Skip to main content

PoolParty Extractor - Background

Abstract

PoolParty Extractor - Background

This page provides detailed information on the way PoolParty Extractor works.

The PoolParty Extractor (PPX) is responsible for enhancing documents' metadata by

  • Mapping metadata values to concepts in a thesaurus.

  • Extracting additional metadata from document data itself and mapping this addons metadata again to concepts in a thesaurus.

By establishing such mappings to concepts in a thesaurus, an alignment of metadata across documents from different sources can be accomplished. Semantically enriched documents become more re-findable and thus re-usable.

PPX is using an index at its core, where all information from thesauri is stored optimized for fast lookups. This index has to be kept up-to-date along with the thesaurus itself. During document processing, new metadata will be discovered and already stored in the thesaurus for being later put into the right context and thus further improving PPX results.

Metadata Mapping

PPX is interpreting explicitly provided metadata as (semi)structured information ready to be mapped to thesaurus concepts. As a basic configuration a mapping scheme between predefined metadata fields of documents on the one side and collections of concepts (concept schemes) in thesauri on the other side is provided. Upon document processing PPX is receiving RDF formatted metadata from the collector which it then processes by looking up values in the thesauri.

There are 3 common outcomes:

  • 1 matching concept found

    If values are matching concepts, these are associated with metadata fields and thus the document itself, bringing all the other information bound to the concepts with them.

  • No matching concept found

    Metadata that does not match to concept labels at all is added to the thesaurus for later reorganization and rework. Nonetheless identical new metadata values in the same field are already now collected under one such "free concept" and therefore aligned. So even as the system does not "know" more yet about this new value, it already helps further organizing documents and make them more findable.

  • More than 1 matching concept found

    Metadata for which are more than one matching concepts found is automatically disambiguated. As the system is performing an automatic disambiguation, it will map to the first matching concept and warn about further possible matches. Further improvement here is possible by taking inter-concept relations from the thesaurus into account, similarly as described with the extraction below. Finally all mapped concepts are added to the metadata and handled on to the indexer for using them to build a better search index.

Metadata Extraction

In addition to already (semi)structured metadata explicitly provided by document authors, PPX is also constructed for finding new metadata from unstructured document text. It therefore uses a mixed approach of NLP techniques (natural language processing) and statistics based heuristics.

As first step, document text is analysed and single words and multi-word phrases are collected from it, which are also weighted according to their position and prominence in the text. In a second step these words and phrases are looked up in a special index constructed from the thesauri.

This 'extraction model' is optimized for:

  • Fast lookups of large numbers of words and phrases.

  • Considering relations between words found in the text.

  • Considering relations between concepts and their different kinds of labels in the thesauri.

  • Bringing this all together and calculating score values on matches between words from the text and labels from the thesaurus.

The result of this second step is list of words and phrases ordered by their significance for the given text. There are now three kinds of such elements:

  • Words and phrases mapped to concepts

    These concepts are replacing the simple flat words and again bring all their semantic power with them.

  • Such words and phrases that have not been mapped to concepts, but which are still significant for the text.

    They are possible candidates for new concepts or new labels for already existing concepts and may be added to the thesaurus. As with mapping before, already now these "free concepts" are useful for metadata approvement.

Finally all extracted concepts (and possibly free concepts) are added as metadata (e.g. as tagging) to the document and handled on to the indexer for using them to build a better search index.