Skip to main content

Corpus Quality

Abstract

Corpus Quality

This section contains a short guide on how to use the additional information on the corpus quality information available in PoolParty after the analysis has been performed.

The quality of extracted terms depends strongly on the provided corpus, the following parameters effect the corpus quality:

  • Number and size of documents:

    A corpus with a larger number of small documents will provide better results than a corpus with a small number of large documents.

  • Relevancy of documents:

    The documents should be selected as a representative sample of the domain, document set or topic your thesaurus should represent.

PoolParty indicates the quality of a corpus in the Corpus Analysis Summary panel.

You find the following information in this panel:

  • Number of Extracted Concepts in the first row of the summary.

  • Number of Extracted Terms in the second row.

  • Extracted Concepts Occurrences let you know how often the extracted concepts have been found in your corpus.

  • Extracted Terms Occurrences let you know how often the extracted terms have been been found in your corpus.

    • The Status icon indicates the quality of the corpus and is based on the Extracted Terms Occurrences.

Corpus Quality Status Examples

Below 250,000 terms a red Status icon is displayed and extracted terms are not shown in the document details view.

23899932.png

In the range of 500,000 to 1,000,000 the Extracted Term Occurrences quality is considered to be medium. This is indicated with a yellow Status icon and extracted terms are shown in the documents details view.

23899934.png

Above 1,000,000 Extracted Term occurrences quality is considered to be high. This is indicated with a green Status icon and extracted terms are shown in the Documents Details Tab.

23899933.png