Extractor - What does it do

Abstract

Text Analysis, Data Mining and Extraction

The PoolParty Extractor automatically analyses documents and texts to extract meaningful phrases, named entities, categories or other metadata that can be then mapped to a SKOS thesaurus used as a unified semantic knowledge model. This knowledge model (the thesaurus) and the extracted entities are linked together using URIs hence ensuring their direct integration in line with the Semantic Web principles. The PoolParty Extractor is implemented as a pipeline of annotation units where each specific unit contributes to the final result. At the same time the system remains flexible and highly scalable to quickly address any new requirements. Supported advanced linguistic features include classification, corpus statistics and disambiguation.

Documents are classified along the structure of a thesaurus allowing the user to flexibly change the classification criteria. Corpora (sets of domain specific documents) are an effective way to add background knowledge to text mining processes. PoolParty corpus management tightly integrates the PoolParty Extractor into the thesaurus management process. It uses the Extractor's ability to analyse text and extract terms and phrases, which are then matched against the concepts in your thesaurus. You can integrate all extracted domain specific terms as new concepts or synonyms of existing concepts into your thesaurus.

Ambiguity can greatly reduce the precision of entity extraction when identical terms are used to refer to different entities. This is when different concepts share the same label and represent a frequently observed challenge in text analytics leading to incorrect annotations in the text extraction process. The PoolParty Extractor can distinguish such occurrences based on the taxonomy structure and the local context of the ambiguous concepts.

Machine learning based NER capabilities (NER=Named-entity Recognition) are also integrated within the PoolParty Extractor. Extraction of organisations, people and locations is based on a trained model using a maximum entropy-based classification being a purely ML-based approach to extract named entities. This approach can be combined with the graph-based extraction feature of PoolParty as well as with PoolParty’s Regex based annotator.

Text mining functionality of the PoolParty Extractor is integrated with other systems by a web service API compliant with the RESTful principle delivering results in JSON. This API is designed for high throughput and comes with connectors to RDF graph databases providing easy integration of the text mining results with other RDF data.

In this section: