The PoolParty Extractor analyses documents and text and extracts meaningful phrases, named entities, categories or other metadata automatically with high throughput and accuracy.
Different data or metadata schemas can be mapped to a SKOS thesaurus that is used as a unified semantic knowledge model. During this process the extracted entities are linked to the knowledge model (the thesaurus in the PoolParty Thesaurus Server) by URIs that provide a direct way to integration, following Semantic Web principles.
The PoolParty Extractor is implemented as a pipeline of annotation units where each specific unit adds to the final result. This keeps the system flexible and allows it to be adapted quickly to new requirements.
Advanced linguistic features include classification, corpus statistics and disambiguation. Machine learning based Named-entity Recognition (NER) capabilities are also part of the PoolParty Extractor. An organization, people and location extraction works based on a trained model, using a maximum entropy based classification, which is a purely ML-based approach to extract named entities. This approach can be combined with the graph-based extraction feature of PoolParty as well as with PoolParty’s Regex based annotator.
Documents are classified along the structure of a thesaurus which allows the user to flexibly change the classification criteria. Corpora (sets of domain specific documents) are a great way to add background knowledge to text mining processes. They provide term frequencies and distributions that improve the scoring of entities and drive the detection of new relevant entities from text.
Ambiguity can greatly reduce the precision of entity extraction, when identical terms are used to refer to different entities. Such ambiguities can be modeled in PoolParty and improve extraction quality and eventually the experience of the users that interact with the annotation results.
The text mining functionality of the PoolParty Extractor is integrated with other systems by a web service API that follows the RESTful principle and produces results in JSON. The API is designed for high throughput. In situations with special requirements in terms of high availability or scalability the system can be operated in clustered mode, too. Out of the box, the system comes with connectors to RDF graph databases that enable easy integration of the results of text mining processes with other RDF data.