Create a Document Corpus

This section contains a short guide on how to create a document corpus in PoolParty.

Creating a document corpus is crucial if you want to enrich your thesaurus with terms from relevant documents.

You can choose you create a document corpus based on documents (PDF, DOC, Powerpoint, TXT, etc.) that are related to your project's domain or harvest RSS feeds, web sites and DBpedia resources linked to the concepts in your thesaurus. Supported file types are those supported by the Apache Tika library. This topic shows how to create such a corpus.

Prerequisites

  • An opened PoolParty project with an existing thesaurus.

How to Create a Document Corpus in PoolParty

Two ways are available in PoolParty to create a corpus, using the main menu or the right-click option.

In your opened PoolParty project, follow these steps:

  1. Click Corpora in the Main Menu, select Create Corpus.
    • Alternatively, click the Corpus Management icon (2) and right click the Corpora node, select Create Corpus (3).

  2. The Create Corpus dialogue opens (4). Define the following options here:
    • Title: Enter the title of your choice. It will be the name for this corpus throughout the project, appearing in the Hierarchy Tree for example.
    • Language: Choose the language of the corpus' documents from the drop down. Languages available are determined by the projects Language Settings
    • Server: Select the server your corpus should be saved to. Choices depend on the PoolParty setup using the Semantic Middleware Configurator.
  3. Click Create Corpus (5).


Now you can start uploading documents to your corpus.

Multiple corpora are available for PoolParty Enterprise Server and PoolParty Semantic Integrator.

PoolParty Basic and Advanced Server allow one corpus per project. Basic Server in addition has a limit for the number and the size of documents.