Skip to main content

PoolParty Extractor - Customize Configuration Files

04/06/2025

For stop word elimination, lemmatization of terms and rule-based person name recognition, the PoolParty Extractor (PPX) uses word lists, which you can customize.

PoolParty Extractor ships with defaults. If no customized configuration files are present in the POOLPARTY_CONFIG directory, the defaults are used.

The word lists for stop word elimination and lemmatization of terms are language-specific. To override the defaults, place a file per language in:

  • POOLPARTY_CONFIG/extractor/TYPE/LANG.txt

  • where TYPE is one of

    • lemmatization

    • stopwords

  • and LANG is the language without country code (eg: 'en', 'de', ..)

Each request to the PoolParty Extractor API has to provide the language of the input. Based on the requested language, a LANG.txt is loaded.

For example for a request with language='en', the stopword filter loads POOLPARTY_CONFIG/extractor/stopwords/en.txt.

To add custom words or rules to the default word lists for stop word elimination and lemmatization of terms, you have to copy the file and place it in POOLPARTY_CONFIG.

The word list for rule-based entity recognition contains first names. To customize the default, add all first names in all languages in the personNames.txt file and place the file in the POOLPARTY_CONFIG/extractor/personNames folder.

Note

Defaults and custom configuration files are exclusive. If there is for instance a stop words language file for English (stopwords/en.txt), the stop word filter uses just this file ignoring the default.

As soon as a file gets loaded by an incoming request, the file is cached in memory. If the files are modified on disk, you need to restart the PoolParty server for the changes to take effect.

Structure

All configuration files will be read as UTF-8.

Stopwords

The format of stopword files is a simple list of words to be filtered:

stopwords/en.txt

...
the
their
theirs
them
themselves
then
thence
there
...

Wordforms

The format of wordform files is a tab separated list of two words, where the left word is an inflected form and the right is the lemma.

lemmatization/en.txt

...
biked     bike
bikers    biker
bikes     bike
bikeways  bikeway
...

Person names

The format of the person name file is a simple list of first names.

personNames.txt

...
Arbogast
Jayden
Jaromir
Leyla
Freia
Jolie
Marceau
Isaias
Isaiah
Francis
...

How to Get Default Files

Step 1: Extract Term Lists

In a terminal, go to the Apache Tomcat directory:

Stopwords:

jar xf webapps/extractor/WEB-INF/lib/poolparty-extractor-ara-*.jar at/punkt/poolparty/extractor/ara/pipeline/stopwords
mv at/punkt/poolparty/extractor/ara/pipeline/stopwords/ stopwords
rmdir -p at/punkt/poolparty/extractor/ara/pipeline/
ls stopwords

Lemmatization:

jar xf webapps/extractor/WEB-INF/lib/poolparty-extractor-ara-*.jar at/punkt/poolparty/extractor/ara/pipeline/wordforms
mv at/punkt/poolparty/extractor/ara/pipeline/wordforms/ wordforms
rmdir -p at/punkt/poolparty/extractor/ara/pipeline/
ls wordforms

Person names:

jar xf webapps/extractor/WEB-INF/lib/poolparty-extractor-ara-*.jar at/punkt/poolparty/extractor/ara/pipeline/regex/personNames.txt
mv at/punkt/poolparty/extractor/ara/pipeline/regex/personNames.txt personNames.txt
rmdir -p at/punkt/poolparty/extractor/ara/pipeline/

Step 2: Move the Extracted File Into the Config Directory

mv stopwords/ {POOLPARTY_HOME}/config/extractor/
mv wordforms/ {POOLPARTY_HOME}/config/extractor/
mv personNames.txt {POOLPARTY_HOME}/config/extractor/personNames/