PoolParty Extractor - Configure Language Files

For stop word elimination and lemmatization of terms PoolParty Extractor (PPX) uses term lists you can customize.

PoolParty Extractor ships with defaults. If no customized configuration files are present in the POOLPARTY_CONFIG directory, the defaults are used. To replace the defaults, a file per language can be placed in:

  • POOLPARTY_CONFIG/extractor/TYPE/LANG.txt

  • where TYPE is one of

    • lemmatization

    • stopwords

  • and LANG is the language without country code (eg: 'en', 'de', ..)

Note

Defaults and language configuration files are exclusive!

If there is a stop words language file for english (stopwords/en.txt), the stop word filter uses just this file ignoring the defaults.

To add custom words or rules to the defaults, one has to copy the original and place it in POOLPARTY_CONFIG.

Each request to the extractor API has to provide the language of the input. Based on the requested language, a LANG.txt is loaded.

For example a request with language='en' the stopword filter loads:

  • POOLPARTY_CONFIG/extractor/stopwords/en.txt

As soon as a file gets loaded by an incoming request, the file is cached in memory. If the files are modified on disk, a restart of the PoolParty Server is necessary for the changes to take effect.

Structure

All configuration files will be read as UTF-8.

Stopwords

The format of stopword files is a simple list of words to be filtered:

stopwords/en.txt

...
the
their
theirs
them
themselves
then
thence
there
...

Wordforms

The format of wordform files is a tab separated list of two words, where the left word is an inflected form and the right is the lemma.

lemmatization/en.txt

...
biked     bike
bikers    biker
bikes     bike
bikeways  bikeway
...

How to Get Default Files

Step 1: Extract Term Lists

In a terminal, go to the Apache Tomcat directory:

Stopwords:

jar xf webapps/extractor/WEB-INF/lib/poolparty-extractor-ara-*.jar at/punkt/poolparty/extractor/ara/pipeline/stopwords
mv at/punkt/poolparty/extractor/ara/pipeline/stopwords/ stopwords
rmdir -p at/punkt/poolparty/extractor/ara/pipeline/
ls stopwords

Lemmatization:

jar xf webapps/extractor/WEB-INF/lib/poolparty-extractor-ara-*.jar at/punkt/poolparty/extractor/ara/pipeline/wordforms
ls wordforms

Step 2: Move the Extracted File Into the Config Directory

mv stopwords/ POOLPARTY_HOME/config/extractor/
mv wordforms/ POOLPARTY_HOME/config/extractor/