Skip to main content

Thesaurus Based Disambiguation of Annotated Concepts

Abstract

Thesaurus Based Disambiguation of Annotated Concepts

One frequently observed phenomenon in controlled vocabularies like thesauri are ambiguous terms, that is, when different concepts share the same label. This leads to wrong annotations in the text extraction process.

The PoolParty Extractor can distinguish such occurrences based on the thesaurus structure and the local context of the ambiguous concepts.

Disambiguation Process Details

The following example explains the applied method:

In the thesaurus there are two concepts, 'Data mart' and 'Data mining', and both share the alternative label 'DM':

23902098.png

The method takes into account all the other concepts that are found in the surrounding of the ambiguous label in a given text and evaluates how close they are in the thesaurus.

For example, 'Data mart' has a related concept 'OLAP cube', whereas 'Data mining' is related to the concept 'SEMMA' in the thesaurus.

23902100.png

If one of those concepts occurs near the term 'DM' in the text, then the system is able to decide how it should be annotated, that is, if it should return 'Data mining' or 'Data mart'. This way, the annotation quality of PoolParty's text mining feature is greatly enhanced.

You can define which relationships in the thesaurus should be considered to calculate distances among the concepts.

  1. Click CORPORA in the main menu.

  2. Select Disambiguation Settings.

    26411164.png

    The Disambiguation Settings dialogue opens.

  3. Select the Enable Disambiguation checkbox.

  4. Enable the relevant relation types. The most common relation types to consider are:

    • 'has broader / has narrower' (this is how concepts are related hierarchically),

    • 'is top concept in scheme / has top concept' (this is how concepts are related to concept schemes),

    • 'has related' (to related concepts in a non-hierarchical manner).

    Yet other SKOS and custom properties can be included. For more information on custom properties, see Create Custom Relations.

    Dismabiguation-Settings.jpg
  5. Refresh the extraction model for the changes to take effect. For more information, see Create an Extraction Model.

Example for Annotation 1

Now this text can be annotated: 'An OLAP cube is a specialization around a DM.'

The API call without disambiguation parameter looks as follows:

http://localhost/extractor/api/extract?projectId=1DBCCDFA-41C8-0001-BC24-BA4E1BF03AE0&language=en&numberOfTerms=0&text=An%20OLAP%20cube%20is%20a%20specialization%20around%20a%20DM.

The result of this call see here.

In total 3 concepts that all have an alternative label 'DM' are returned:

{
        "concepts": [
                {
                        "language": "en",
                        "id": "1DBCCDFA-41C8-0001-BC24-BA4E1BF03AE0:http://dbpedia.org/resource/OLAP_cube@en",
                        "prefLabel": "OLAP cube",
                        "score": 100,
                        "frequencyInDocument": 1,
                        "conceptSchemes": [
                {
                        "title": "Business intelligence",
                        "uri": "http://dbpedia.org/resource/Category:Business_intelligence"
                }
                ],
                "altLabels": [
                        "Olap cube",
                        "Cube (disambiguation)"
                ],
                        "project": "1DBCCDFA-41C8-0001-BC24-BA4E1BF03AE0",
                        "uri": "http://dbpedia.org/resource/OLAP_cube"
                },
                {
                        "language": "en",
                        "id": "1DBCCDFA-41C8-0001-BC24-BA4E1BF03AE0:http://dbpedia.org/resource/Dimensional_modeling@en",
                        "prefLabel": "Dimensional modeling",
                        "score": 14,
                        "frequencyInDocument": 1,
                        "conceptSchemes": [
                {
                        "title": "Business intelligence",
                        "uri": "http://dbpedia.org/resource/Category:Business_intelligence"
                }
                ],
                "altLabels": [
                        "DM"
                ],
                        "project": "1DBCCDFA-41C8-0001-BC24-BA4E1BF03AE0",
                        "uri": "http://dbpedia.org/resource/Dimensional_modeling"
                },
                {
                        "language": "en",
                        "id": "1DBCCDFA-41C8-0001-BC24-BA4E1BF03AE0:http://dbpedia.org/resource/Data_mart@en",
                        "prefLabel": "Data mart",
                        "score": 14,
                        "frequencyInDocument": 1,
                        "conceptSchemes": [
                {
                        "title": "Business intelligence",
                        "uri": "http://dbpedia.org/resource/Category:Business_intelligence"
                }
                ],
                "altLabels": [
                        "DM",
                        "Datamart",
                        "Data market",
                        "Mart"
                ],
                        "project": "1DBCCDFA-41C8-0001-BC24-BA4E1BF03AE0",
                        "uri": "http://dbpedia.org/resource/Data_mart"
                },
                {
                        "language": "en",
                        "id": "1DBCCDFA-41C8-0001-BC24-BA4E1BF03AE0:http://dbpedia.org/resource/Data_mining@en",
                        "prefLabel": "Data mining",
                        "score": 14,
                        "frequencyInDocument": 1,
                        "conceptSchemes": [
                {
                        "title": "Data mining",
                        "uri": "http://dbpedia.org/resource/Category:Data_mining"
                }
                ],
                "altLabels": [
                        "Knowledge Discovery in Databases",
                        "Subject-based data mining",
                        "Data miner",
                        "Information-mining",
                        "Predictive software",
                        "Pattern mining",
                        "Information mining",
                        "Knowledge discovering in databases",
                        "Data-mining",
                        "Artificial Intelligence in Data Mining",
                        "Predictive Analytics Software",
                        "DATA MINING",
                        "Web data mining",
                        "Knowledge discovery in databases",
                        "DM",
                        "Datamining",
                        "Datamine",
                        "Visual Data Mining",
                        "Data Mining",
                        "Usage mining",
                        "Mining (disambiguation)",
                        "KDD",
                        "Knowledge mining",
                        "Pattern Mining"
                        ],
                "project": "1DBCCDFA-41C8-0001-BC24-BA4E1BF03AE0",
                "uri": "http://dbpedia.org/resource/Data_mining"
                }
        ]
}

The same call but with the parameter 'disambiguate=true':

http://localhost/extractor/api/extract?projectId=1DBCCDFA-41C8-0001-BC24-BA4E1BF03AE0&language=en&numberOfTerms=0&disambiguate=true&text=An%20OLAP%20cube%20is%20a%20specialization%20around%20a%20DM.

Now only the correct concept is returned:

{
        "concepts": [
                {
                        "language": "en",
                        "id": "1DBCCDFA-41C8-0001-BC24-BA4E1BF03AE0:http://dbpedia.org/resource/OLAP_cube@en",
                        "prefLabel": "OLAP cube",
                        "score": 100,
                        "frequencyInDocument": 1,
                        "conceptSchemes": [
                {
                        "title": "Business intelligence",
                        "uri": "http://dbpedia.org/resource/Category:Business_intelligence"
                }
                ],
                "altLabels": [
                        "Olap cube",
                        "Cube (disambiguation)"
                ],
                        "project": "1DBCCDFA-41C8-0001-BC24-BA4E1BF03AE0",
                        "uri": "http://dbpedia.org/resource/OLAP_cube"
                },
                {
                        "language": "en",
                        "id": "1DBCCDFA-41C8-0001-BC24-BA4E1BF03AE0:http://dbpedia.org/resource/Data_mart@en",
                        "prefLabel": "Data mart",
                        "score": 14,
                        "frequencyInDocument": 1,
                        "conceptSchemes": [
                {
                        "title": "Business intelligence",
                        "uri": "http://dbpedia.org/resource/Category:Business_intelligence"
                }
                ],
                "altLabels": [
                        "DM",
                        "Datamart",
                        "Data market",
                        "Mart"
                ],
                        "project": "1DBCCDFA-41C8-0001-BC24-BA4E1BF03AE0",
                        "uri": "http://dbpedia.org/resource/Data_mart"
                }
        ]
}
Example for Annotation 2

Another test with the text 'SEMMA' mainly focuses on the modeling tasks of DM projects, leaving the business aspects out.

It returns only 'Data mining' for the ambiguous label 'DM'.

{
        "concepts": [
        {
                "language": "en",
                "id": "1DBCCDFA-41C8-0001-BC24-BA4E1BF03AE0:http://dbpedia.org/resource/SEMMA@en",
                "prefLabel": "SEMMA",
                "score": 100,
                "frequencyInDocument": 1,
                "conceptSchemes": [
                {
                        "title": "Data mining",
                        "uri": "http://dbpedia.org/resource/Category:Data_mining"
                },
                {
                        "title": "Business intelligence",
                        "uri": "http://dbpedia.org/resource/Category:Business_intelligence"
                }
                ],
                "project": "1DBCCDFA-41C8-0001-BC24-BA4E1BF03AE0",
                "uri": "http://dbpedia.org/resource/SEMMA"
        },
        {
                "language": "en",
                "id": "1DBCCDFA-41C8-0001-BC24-BA4E1BF03AE0:http://dbpedia.org/resource/Data_mining@en",
                "prefLabel": "Data mining",
                "score": 36,
                "frequencyInDocument": 1,
                "conceptSchemes": [
                {
                        "title": "Data mining",
                        "uri": "http://dbpedia.org/resource/Category:Data_mining"
                }
                ],
                "altLabels": [
                        "Knowledge Discovery in Databases",
                        "Subject-based data mining",
                        "Data miner",
                        "Information-mining",
                        "Predictive software",
                        "Pattern mining",
                        "Information mining",
                        "Knowledge discovering in databases",
                        "Data-mining",
                        "Artificial Intelligence in Data Mining",
                        "Predictive Analytics Software",
                        "DATA MINING",
                        "Web data mining",
                        "Knowledge discovery in databases",
                        "DM",
                        "Datamining",
                        "Datamine",
                        "Visual Data Mining",
                        "Data Mining",
                        "Usage mining",
                        "Mining (disambiguation)",
                        "KDD",
                        "Knowledge mining",
                        "Pattern Mining"
                ],
        "project": "1DBCCDFA-41C8-0001-BC24-BA4E1BF03AE0",
        "uri": "http://dbpedia.org/resource/Data_mining"
        }
        ]
}
Negation - Exclude Annotations Based on Other Occurring Concepts

There is also the possibility to explicitly state that IF a certain concept appears THEN another related ambiguous concept will never be considered for disambiguation in that context.

To illustrate this principle lets assume the following thesaurus:

thesaurus_negative_relation.png

There are two concepts, 'Resource Description Framework' and 'Reality distortion field' that share the same label 'RDF':

24577217.png

We presuppose now that if 'Steve Jobs' occurs in the text, then we are sure that 'RDF' does not mean 'Resource Description Framework'.

To express this we define a custom relation in a custom scheme to link 'Steve Jobs' to 'Resource Description Framework'. For more information on custom relations, see Create Custom Relations.

In this example we defined a relation 'negative' in the custom schema 'Disambiguation' (you can use any relation you define).

Then you select the relationship in the Disambiguation Settings dialogue in the Negation tab and refresh the extraction model.

Now a text like this can be annotated:

'The RDF was said by Andy Hertzfeld to be Steve Jobs' ability to convince himself and others to believe almost anything with a mix of charm, charisma, bravado, hyperbole, marketing, appeasement and persistence.'

The call to the API without disambiguation:
http://localhost/extractor/api/extract?projectId=1DCDEF6F-680D-0001-9AB3-FB1BF82067A0&language=en&numberOfTerms=0&text=%22The%20RDF%20was%20said%20by%20Andy%20Hertzfeld%20to%20be%20Steve%20Jobs%27%20ability%20to%20convince%20himself%20and%20others%20to%20believe%20almost%20anything%20with%20a%20mix%20of%20charm,%20charisma,%20bravado,%20hyperbole,%20marketing,%20appeasement%20and%20persistence.%22

In the result both concepts are returned:

{
        "concepts": [
                {
                        "language": "en",
                        "id": "1DCDEF6F-680D-0001-9AB3-FB1BF82067A0:http://localhost/Negativedisambiguation/Resource_Description_Framework@en",
                        "prefLabel": "Resource Description Framework",
                        "score": 100,
                        "frequencyInDocument": 1,
                        "conceptSchemes": [
                                {
                                        "title": "Semantic Web",
                                        "uri": "http://localhost/Negativedisambiguation/Semantic_Web"
                                }
                        ],
                        "altLabels": [
                                "RDF"
                        ],
                        "project": "1DCDEF6F-680D-0001-9AB3-FB1BF82067A0",
                        "uri": "http://localhost/Negativedisambiguation/Resource_Description_Framework"
                },
                {
                        "language": "en",
                        "id": "1DCDEF6F-680D-0001-9AB3-FB1BF82067A0:http://localhost/Negativedisambiguation/Reality_distortion_field@en",
                        "prefLabel": "Reality distortion field",
                        "score": 100,
                        "frequencyInDocument": 1,
                        "conceptSchemes": [
                                {
                                        "title": "Sociological terminology",
                                        "uri": "http://localhost/Negativedisambiguation/Sociological_terminology"
                                }
                        ],
                        "altLabels": [
                                "RDF"
                        ],
                        "project": "1DCDEF6F-680D-0001-9AB3-FB1BF82067A0",
                        "uri": "http://localhost/Negativedisambiguation/Reality_distortion_field"
                },
                {
                        "language": "en",
                        "id": "1DCDEF6F-680D-0001-9AB3-FB1BF82067A0:http://localhost/Negativedisambiguation/Steve_Jobs@en",
                        "prefLabel": "Steve Jobs",
                        "score": 66,
                        "frequencyInDocument": 1,
                        "conceptSchemes": [
                                {
                                        "title": "People",
                                        "uri": "http://localhost/Negativedisambiguation/People"
                                }
                        ],
                        "project": "1DCDEF6F-680D-0001-9AB3-FB1BF82067A0",
                        "uri": "http://localhost/Negativedisambiguation/Steve_Jobs"
                }
        ]
}
The call to the API with disambiguation:
http://localhost/extractor/api/extract?projectId=1DCDEF6F-680D-0001-9AB3-FB1BF82067A0&language=en&numberOfTerms=0&disambiguate=true&text=%22The%20RDF%20was%20said%20by%20Andy%20Hertzfeld%20to%20be%20Steve%20Jobs%27%20ability%20to%20convince%20himself%20and%20others%20to%20believe%20almost%20anything%20with%20a%20mix%20of%20charm,%20charisma,%20bravado,%20hyperbole,%20marketing,%20appeasement%20and%20persistence.%22

Now only the correct concept is returned:

{
        "concepts": [
                {
                        "language": "en",
                        "id": "1DCDEF6F-680D-0001-9AB3-FB1BF82067A0:http://localhost/Negativedisambiguation/Reality_distortion_field@en",
                        "prefLabel": "Reality distortion field",
                        "score": 100,
                        "frequencyInDocument": 1,
                        "conceptSchemes": [
                                {
                                        "title": "Sociological terminology",
                                        "uri": "http://localhost/Negativedisambiguation/Sociological_terminology"
                                }
                        ],
                        "altLabels": [
                                "RDF"
                        ],
                        "project": "1DCDEF6F-680D-0001-9AB3-FB1BF82067A0",
                        "uri": "http://localhost/Negativedisambiguation/Reality_distortion_field"
                },
                {
                        "language": "en",
                        "id": "1DCDEF6F-680D-0001-9AB3-FB1BF82067A0:http://localhost/Negativedisambiguation/Steve_Jobs@en",
                        "prefLabel": "Steve Jobs",
                        "score": 66,
                        "frequencyInDocument": 1,
                        "conceptSchemes": [
                                {
                                        "title": "People",
                                        "uri": "http://localhost/Negativedisambiguation/People"
                                }
                        ],
                        "project": "1DCDEF6F-680D-0001-9AB3-FB1BF82067A0",
                        "uri": "http://localhost/Negativedisambiguation/Steve_Jobs"
                }
        ]
}

Negative indications are just another way to think about disambiguation.

In this example linking Steve Jobs via skos:related to the 'Reality distortion field' would lead to the same results (if skos:related is included in the relations to consider). But sometimes for modeling purposes it is more elegant to use a negation.