Skip to main content

Free Terms Extraction Based on a Text Corpus

Abstract

Free Terms Extraction Based on a Text Corpus

During the corpus analysis free terms are extracted and scored according to statistical methods (see Extracted Terms List in the corpus management section).

These scores indicate the relevance of terms in a given text corpus and can be used to improve scores of free terms in the extraction of single documents. When you use the parameter "corpusScoring" then the relevance scores the extracted terms have in the corpus are taken into account. This way terms that show higher relevance in the corpus will be ranked higher in the document. This is especially useful for short documents where the term frequencies are low and the scoring based on the document alone does not provide satisfying results.

Example text to be analysed:

A five-door version, called Sportback, was launched in November 2011, with sales starting in export markers during spring 2012. The A1 is designed to compete with the Mini (marque) Mini, and Alfa Romeo MiTo. The car is aimed mostly at young, affluent Urban area urban buyers. The A1 is produced at Audi Brussels Audi's Belgian factory in Forest, Belgium Forest, near Brussels.

Call without parameter:

http://[PoolParty Server URL]/extractor/api/extract?projectId=1DBCB738-3DDE-0001-456B-1A80824632E0&language=en&numberOfTerms=10&text=A%20five-door%20version,%20called%20Sportback,%20was%20launched%20in%20November%202011,%20with%20sales%20starting%20in%20export%20markers%20during%20spring%202012.%20The%20A1%20is%20designed%20to%20compete%20with%20the%20Mini%20(marque)%20Mini,%20and%20Alfa%20Romeo%20MiTo.%20The%20car%20is%20aimed%20mostly%20at%20young,%20affluent%20Urban%20area%20urban%20buyers.%20The%20A1%20is%20produced%20at%20Audi%20Brussels%20Audi%27s%20Belgian%20factory%20in%20Forest,%20Belgium%20Forest,%20near%20Brussels.

Results:

{
        "freeTerms": [
                {
                        "textValue": "five-door version called sportback",
                        "score": 100,
                        "frequencyInDocument": 1
                },
                {
                        "textValue": "called sportback was launched",
                        "score": 95,
                        "frequencyInDocument": 1
                },
                {
                        "textValue": "sales starting in export",
                        "score": 77,
                        "frequencyInDocument": 1
                },
                {
                        "textValue": "starting in export markers",
                        "score": 75,
                        "frequencyInDocument": 1
                },
                {
                        "textValue": "export markers during spring",
                        "score": 70,
                        "frequencyInDocument": 1
                },
                {
                        "textValue": "compete with the mini",
                        "score": 52,
                        "frequencyInDocument": 1
                },
                {
                        "textValue": "five-door",
                        "score": 50,
                        "frequencyInDocument": 1
                },
                {
                        "textValue": "five-door version",
                        "score": 50,
                        "frequencyInDocument": 1
                },
                {
                        "textValue": "five-door version called",
                        "score": 50,
                        "frequencyInDocument": 1
                },
                {
                        "textValue": "version",
                        "score": 49,
                        "frequencyInDocument": 1
                }
        ]
}

Call with corpus parameter (parameter value is the corpus ID as it is shown in corpus detail):

http://[PoolParty Server URL]/extractor/api/extract?projectId=1DBCB738-3DDE-0001-456B-1A80824632E0&language=en&numberOfTerms=10&corpusScoring=corpus:ace05665-5a18-4162-9f59-d8ea2f4c2226&text=A%20five-door%20version,%20called%20Sportback,%20was%20launched%20in%20November%202011,%20with%20sales%20starting%20in%20export%20markers%20during%20spring%202012.%20The%20A1%20is%20designed%20to%20compete%20with%20the%20Mini%20(marque)%20Mini,%20and%20Alfa%20Romeo%20MiTo.%20The%20car%20is%20aimed%20mostly%20at%20young,%20affluent%20Urban%20area%20urban%20buyers.%20The%20A1%20is%20produced%20at%20Audi%20Brussels%20Audi%27s%20Belgian%20factory%20in%20Forest,%20Belgium%20Forest,%20near%20Brussels.

The corpus contains a few hundred documents related to the theme "cars" and now those terms related to the theme are scored higher. Results:

{
        "freeTerms": [
                {
                        "textValue": "alfa romeo",
                        "score": 43,
                        "frequencyInDocument": 1
                },
                {
                        "textValue": "version called",
                        "score": 32,
                        "frequencyInDocument": 1
                },
                {
                        "textValue": "mini marque",
                        "score": 27,
                        "frequencyInDocument": 1
                },
                {
                        "textValue": "alfa romeo mito",
                        "score": 27,
                        "frequencyInDocument": 1
                },
                {
                        "textValue": "romeo mito",
                        "score": 22,
                        "frequencyInDocument": 1
                },
                {
                        "textValue": "young affluent urban area",
                        "score": 22,
                        "frequencyInDocument": 1
                },
                {
                        "textValue": "affluent urban area urban",
                        "score": 21,
                        "frequencyInDocument": 1
                },
                {
                        "textValue": "mini",
                        "score": 20,
                        "frequencyInDocument": 2
                },
                {
                        "textValue": "urban area urban buyers",
                        "score": 20,
                        "frequencyInDocument": 1
                },
                {
                        "textValue": "launched in november",
                        "score": 19,
                        "frequencyInDocument": 1
                }
        ]
}