Normalising the Capitalisation of Language Tags

Abstract

Normalizing the Capitalization of Language Tags

The Problem

Most text values in PoolParty have the RDF type "langString", which means they are annotated with a language tag, for example "Taxonomy"@en-US.

In PoolParty versions 6.1 and 6.2, two-part language tags can be generated and stored with inconsistent capitalization. This is not an issue within PoolParty, since RDF language tags are case-insensitive as required by the applicable standards, but when the data is exported, the inconsistent casing can confuse naive text-processing tools.

This section shows how to normalize the case of language tags for the purpose of exporting data in a consistent form.

For more on language tags and the scope of the problem, see Background: Capitalization of RDF Language Tags in Graph Databases.

Note

This issue affects PoolParty versions 6.1 and 6.2. In PoolParty 7, language tags are generated with consistent capitalization, namely in the recommended interchange format, for example @en-US. The issue may also arise when data from external sources, inconsistently capitalized, is imported into PoolParty 6.1 or any later version.

Normalizing Language Tag Capitalization in RDF4J

This section shows how to adjust the capitalization of language tags through the SPARQL interface. The conversion is hindered by the fact that although RDF4J will store language tags in any case we specify, it does treat the case variants as equivalent.

Consequently it is difficult to persuade it to execute a conversion that in principle changes nothing. Even CONSTRUCT queries (for exporting the data) will ignore attempts to change the case of language tags.

Because RDF4J will not replace a language tag with a case variant of itself, we must do the conversion in two steps:

Query 1 will rewrite, for example, "Taxonomy"@en-us to "Taxonomy"@en-US-x-fixme.
Query 2 will rewrite it again to "Taxonomy"@en-US.

Warning

RDF4J will keep deleted strings in its cache, and will insert the original form instead of the desired one.

To defeat caching from the user interface, you have to restart PoolParty between the two queries. Fortunately this only needs to be done once, no matter how many PoolParty projects you normalize.

Note

This solution only works for two-part tags (language-REGION). Do not use if you have tags with more than two components.

Solution Summary

Run the query casefix-step1.sparql, included below. Commit. Repeat for each project that needs conversion.
Restart PoolParty.
Run query casefix-step2.sparql, included below. Commit. Repeat for each project that needs conversion.

Detailed Instructions

The SPARQL queries below must be run separately on each PoolParty project that needs to be repaired. Because string literals are shared across graphs, it is necessary to process all RDF graphs of the project at once.

Log in to PoolParty.
Open the SPARQL shell: expand the Toolsmenu, Admin Scripts, click the link PP SPARQL shell.
From the drop-down Users Repository, select the repository of the PoolParty project to be repaired.
- For example, if the project is called 'Hello world', find and select 'Project: Hello world'.
Click Connect.
Delete all text from the SPARQL shell form, and paste in the contents of casefix-step1.sparql (see below).
Click Run Query. Ensure that the response is 'Query succeeded'.
Click Commit Updates.
- If you wish, you may inspect the interim result with the SELECT query shown at the end of these instructions.
If you plan to normalize more than one PoolParty project, connect to each project's repository, execute casefix-step1.sparql, and commit.
Warning
Do not run the script twice on the same project.
Shut down the PoolParty server.
- Shutting down the server clears the cache, allowing the next step to work as intended.
- This is necessary because the deleted labels with the lowercased language tags are still in the cache, and will be resurrected if we try to insert the same labels with an equivalent (differently capitalised) language tag.
Restart the server and wait until it is fully operational.
Your SPARQL shell session has become invalid now. Click Disconnect, then click Connect to initiate a new connection. (The User Repository selector should still be showing your last project's repository).
Delete everything in the SPARQL shell, and paste in the contents of the query casefix-step2.sparql (see below).
Click Run Query, check that it reports success, and press Commit Updates.
If you have multiple projects to normalize, connect to each project's repository and apply step 2 of the fix.
The conversion is finished. You can use the following query to inspect the result:

inspect-langstrings.sparql

# To inspect your thesaurus only, change ?graph to the uri of your thesaurus graph.
SELECT *
WHERE
{ GRAPH ?graph
  {
    ?s ?p ?o .
    FILTER ( DATATYPE(?o) = rdf:langString && CONTAINS(LANG(?o), "-") ) .
  }
} LIMIT 200

SPARQL Queries to Normalize Capitalization

casefix-step1.sparql

# Step 1 of the fix: Convert and temporarily tag for distinctness
DELETE
{ GRAPH ?graph { ?s ?p ?o } }
INSERT
{ GRAPH ?graph { ?s ?p ?corrected } }
WHERE
{ GRAPH ?graph
  {
    ?s ?p ?o .
    # Restrict: langStrings with a two-part language tag
    FILTER ( DATATYPE(?o) = rdf:langString && CONTAINS(LANG(?o), "-") ) .

    # Build new language identifier
    BIND(STRBEFORE(LANG(?o), "-") AS ?lgmain)
    BIND(STRAFTER(LANG(?o), "-") AS ?state)
    # Only "fix" it if it is not already capitalised
    FILTER( ?state != UCASE(?state) )
    
    BIND(CONCAT(?lgmain, "-", UCASE(?state), "-x-fixme") AS ?newlang)
    # Rebuild the langString
    BIND(STRLANG(STR(?o), ?newlang) AS ?corrected)
  }
}

The PoolParty server must be restarted between the two queries. Detailed instructions are given above.

casefix-step2.sparql

# Step 2 of the fix: Remove the temporary tags.
DELETE
{ GRAPH ?graph { ?s ?p ?o } }
INSERT
{ GRAPH ?graph { ?s ?p ?corrected } }
WHERE
{ GRAPH ?graph
  {
    ?s ?p ?o .
    # Restrict: langStrings with a two-part language tag
    FILTER ( DATATYPE(?o) = rdf:langString && CONTAINS(LANG(?o), "-x-fixme") ) .

    # Build new language identifier
    BIND(STRBEFORE(LANG(?o), "-x-fixme") AS ?real)
    BIND(STRLANG(STR(?o), ?real) AS ?corrected)
  }
}

Results

After running the queries in the previous section, PoolParty's graph database contains literals with consistently capitalized language tags that can be exported in the usual way.

Note

The continued use of PoolParty, versions 6.1 or 6.2, may introduce more of the inconsistently capitalized language tags. You can avoid this problem by upgrading to PoolParty 7, which uses the conventional capitalization for all new language tags.

In this section: