PoolParty Extractor - Configure Language Files
PoolParty Extractor - Configure Language Files
For stop word elimination and lemmatization of terms PoolParty Extractor (PPX) uses term lists you can customize.
PoolParty Extractor ships with defaults. If no customized configuration files are present in the POOLPARTY_CONFIG directory, the defaults are used. To replace the defaults, a file per language can be placed in:
POOLPARTY_CONFIG/extractor/TYPE/LANG.txt
where
TYPE
is one oflemmatization
stopwords
and
LANG
is the language without country code (eg: 'en', 'de', ..)
Note
Defaults and language configuration files are exclusive!
If there is a stop words language file for english (stopwords/en.txt), the stop word filter uses just this file ignoring the defaults.
To add custom words or rules to the defaults, one has to copy the original and place it in POOLPARTY_CONFIG.
Each request to the extractor API has to provide the language of the input. Based on the requested language, a LANG.txt is loaded.
For example a request with language='en' the stopword filter loads:
POOLPARTY_CONFIG/extractor/stopwords/en.txt
As soon as a file gets loaded by an incoming request, the file is cached in memory. If the files are modified on disk, a restart of the PoolParty Server is necessary for the changes to take effect.
Structure
All configuration files will be read as UTF-8.
Stopwords
The format of stopword files is a simple list of words to be filtered:
stopwords/en.txt
... the their theirs them themselves then thence there ...
Wordforms
The format of wordform files is a tab separated list of two words, where the left word is an inflected form and the right is the lemma.
lemmatization/en.txt
... biked bike bikers biker bikes bike bikeways bikeway ...
How to Get Default Files
Step 1: Extract Term Lists
In a terminal, go to the Apache Tomcat directory:
Stopwords:
jar xf webapps/extractor/WEB-INF/lib/poolparty-extractor-ara-*.jar at/punkt/poolparty/extractor/ara/pipeline/stopwords mv at/punkt/poolparty/extractor/ara/pipeline/stopwords/ stopwords rmdir -p at/punkt/poolparty/extractor/ara/pipeline/ ls stopwords
Lemmatization:
jar xf webapps/extractor/WEB-INF/lib/poolparty-extractor-ara-*.jar at/punkt/poolparty/extractor/ara/pipeline/wordforms ls wordforms
Step 2: Move the Extracted File Into the Config Directory
mv stopwords/ POOLPARTY_HOME/config/extractor/ mv wordforms/ POOLPARTY_HOME/config/extractor/