PoolParty Extractor - Customize Configuration Files
04/06/2025
For stop word elimination, lemmatization of terms and rule-based person name recognition, the PoolParty Extractor (PPX) uses word lists, which you can customize.
PoolParty Extractor ships with defaults. If no customized configuration files are present in the POOLPARTY_CONFIG
directory, the defaults are used.
The word lists for stop word elimination and lemmatization of terms are language-specific. To override the defaults, place a file per language in:
POOLPARTY_CONFIG/extractor/TYPE/LANG.txt
where
TYPE
is one oflemmatization
stopwords
and
LANG
is the language without country code (eg: 'en', 'de', ..)
Each request to the PoolParty Extractor API has to provide the language of the input. Based on the requested language, a LANG.txt
is loaded.
For example for a request with language='en', the stopword filter loads POOLPARTY_CONFIG/extractor/stopwords/en.txt
.
To add custom words or rules to the default word lists for stop word elimination and lemmatization of terms, you have to copy the file and place it in POOLPARTY_CONFIG
.
The word list for rule-based entity recognition contains first names. To customize the default, add all first names in all languages in the personNames.txt
file and place the file in the POOLPARTY_CONFIG/extractor/personNames
folder.
Note
Defaults and custom configuration files are exclusive. If there is for instance a stop words language file for English (stopwords/en.txt
), the stop word filter uses just this file ignoring the default.
As soon as a file gets loaded by an incoming request, the file is cached in memory. If the files are modified on disk, you need to restart the PoolParty server for the changes to take effect.
Structure
All configuration files will be read as UTF-8.
Stopwords
The format of stopword files is a simple list of words to be filtered:
stopwords/en.txt
... the their theirs them themselves then thence there ...
Wordforms
The format of wordform files is a tab separated list of two words, where the left word is an inflected form and the right is the lemma.
lemmatization/en.txt
... biked bike bikers biker bikes bike bikeways bikeway ...
Person names
The format of the person name file is a simple list of first names.
personNames.txt
... Arbogast Jayden Jaromir Leyla Freia Jolie Marceau Isaias Isaiah Francis ...
How to Get Default Files
Step 1: Extract Term Lists
In a terminal, go to the Apache Tomcat directory:
Stopwords:
jar xf webapps/extractor/WEB-INF/lib/poolparty-extractor-ara-*.jar at/punkt/poolparty/extractor/ara/pipeline/stopwords mv at/punkt/poolparty/extractor/ara/pipeline/stopwords/ stopwords rmdir -p at/punkt/poolparty/extractor/ara/pipeline/ ls stopwords
Lemmatization:
jar xf webapps/extractor/WEB-INF/lib/poolparty-extractor-ara-*.jar at/punkt/poolparty/extractor/ara/pipeline/wordforms mv at/punkt/poolparty/extractor/ara/pipeline/wordforms/ wordforms rmdir -p at/punkt/poolparty/extractor/ara/pipeline/ ls wordforms
Person names:
jar xf webapps/extractor/WEB-INF/lib/poolparty-extractor-ara-*.jar at/punkt/poolparty/extractor/ara/pipeline/regex/personNames.txt mv at/punkt/poolparty/extractor/ara/pipeline/regex/personNames.txt personNames.txt rmdir -p at/punkt/poolparty/extractor/ara/pipeline/
Step 2: Move the Extracted File Into the Config Directory
mv stopwords/ {POOLPARTY_HOME}/config/extractor/ mv wordforms/ {POOLPARTY_HOME}/config/extractor/ mv personNames.txt {POOLPARTY_HOME}/config/extractor/personNames/