PoolParty Extractor

The PoolParty Extractor analyses documents and text and extracts meaningful phrases, concepts, categories or other metadata automatically with high throughput and accuracy.

Different data or metadata schemas can be mapped to a SKOS thesaurus that is used as a unified semantic knowledge model. During this process the extracted entities are linked to the knowledge model (the thesaurus in the PoolParty Thesaurus Server) via URIs that provide a direct way to integration following Semantic Web principles.

The PoolParty Extractor is implemented as a pipeline of annotation units where each specific unit adds to the final result. This keeps the system flexible and allows it to be adapted quickly to new requirements.

Advanced linguistic features include classification, named entity recognition, corpus statistics and disambiguation.

Documents are classified along the structure of a thesaurus which allows the user to flexibly change the classification criteria. The recognition of named entities helps in finding common terms like organisation names or persons without the need of a dedicated knowledge model in the background. Corpora (sets of domain specific documents) are a great way to add background knowledge to text mining processes. They provide term frequencies and distributions that improve the scoring of entities and drive the detection of new relevant entities from text. Ambiguity can greatly reduce the precision of entity extraction when identical terms are used to refer to different entities. Such ambiguities can be modeled in PoolParty and improve extraction quality and in the end the experience of the users that interact with the annotation results.

The text mining functionality of the PoolParty Extractor is integrated with other systems via a web service API that follows the RESTful principle and produces results in JSON and RDF. The API is designed for high throughput. In situations with special requirements in terms of high availability or scalability the system can be operated in clustered mode, too. Out of the box, the system comes with connectors to RDF graph databases that enable easy integration of the results of text mining processes with other RDF data.