Skip to main content

Corpus Management - Overview

Abstract

Corpus Management - Overview

This section is dedicated to the corpus management functionality in PoolParty.

The corpus management functionality in PoolParty supports you in extending thesauri with relevant terms derived from documents matching the domain of your thesauri. In addition corpora are used to improve entity extraction providing improved scoring of terms and concepts and offering shadow concept suggestions based on co-occurrences.

You can also start to create a new thesaurus from scratch based on a corpus.

PoolParty's Corpus Management Functionality

In order to enrich your thesaurus with terms, using the corpus management function, you can process documents (PDF, DOC, Powerpoint, TXT, etc.) that are related to your project's domain or harvest RSS feeds, web sites and DBpedia resources linked to the concepts in your thesaurus.

The PoolParty corpus management tightly integrates the PoolParty Extractor into the thesaurus management process. It uses the extractor's ability to analyse text and extract terms and phrases, which then are matched against the concepts in your thesaurus. You can then integrate extracted domain specific terms as new concepts or synonyms of existing concepts into your thesaurus.

The terms you decide to select and use for integration into your thesaurus from the extracted terms are called 'Candidate Concepts' in PoolParty. Find details about their handling and the possible workflow here: Candidate Concepts List

The following image shows an example Corpus Management view, where a corpus called 'Cocktails' already has been created:

51732913.png

To learn in detail how to use the Corpus Management feature, refer to the following topics:

Note

Multiple corpora are available for PoolParty Enterprise Server and PoolParty Semantic Integrator.

PoolParty Advanced Server allow one corpus per project.

You can manage your corpus or corpora programmatically as well, or automated remotely by using the PoolParty Corpus API services, such as: Web Service Method: Create a New Corpus, Web Service Method: Upload a Document to a Corpus, Method: analyse corpus, Web Service Method: Request Concept Matches of a Corpus, etc.

In addition you can significantly improve extraction results of free terms by using a corpus. Details find here: Free Terms Extraction Based on a Text Corpus

Tip

If you would like to learn more about this topic, please watch this PoolParty Academy Tutorial video:

2.4 Corpus Management Basics

When the video is not available, you can sign up to the PoolParty Academy