Background: Capitalization of RDF Language Tags in Graph Databases

Abstract

This section documents the allowed and preferred capitalization of RDF language tags, and their support in popular graph databases.

Background

Most text fields in PoolParty have RDF type "langString", which means they are identified by a language tag, e.g. "Taxonomy"@en-US. The standard dictates that language tags are case-independent: @en-us, @en-US, @En-uS and any other capitalizations represent the same language tag. But while triple stores support the case-insensitive semantics of language tags, they differ in how the support is implemented.

RDF4J, which PoolParty uses for internal storage, used to store all language tags in lower case regardless of how they were entered. Starting with the version distributed with PoolParty 6.1 and 6.2, it stores language tags as delivered by the application, but interprets them in a case-independent manner, as required.

This has exposed an inconsistency in PoolParty: two-part tags (language-country, as in the examples above) are sometimes generated in lower case, and sometimes with the country part in capitals. Since capitalization is irrelevant on principle, everything still works fine: strings are correctly handled, and language tags are even normalized to the conventional form (region uppercased) in the PoolParty user interface. However, when data is exported it retains the inconsistent capitalization, which can cause problems for downstream tools that do not treat language tags specially.

Note

This issue affects PoolParty versions 6.1 and 6.2.

In PoolParty 7, language tags are generated with consistent capitalization, namely in the recommended interchange format, e.g. @en-US. The issue may also arise when data from external sources, inconsistently capitalized, is imported into PoolParty 6.1 or any later version.

The section Normalising the Capitalisation of Language Tags contains detailed steps on how to transform a PoolParty project to consistent capitalization.

Standards and Graph Database Support

Applicable Standards

RDF requires that RDF language tags must be well-formed according to the BCP 47 specification. According to BCP 47, a language tag consists of an obligatory language part plus several optional components, including script and region (country) identifiers, an extension space, and a private use area. The parts are connected with hyphens. For example, sr-Latn-ME represents Serbian as spoken in Montenegro, written in the Latin script. The most common forms involve the language alone (@en, @ar)orlanguage plus region (@en-US, @sr-ME).

BCP 47 further specifies that language tags must be treated as case-insensitive (meaning that sr-Latn-ME is semantically identical to sr-latn-me, SR-LaTn-mE, etc.), but recommends that they be presented in the capitalization style of the registry for each component, namely:

Lower case for the language identifier.
Title case (first letter capital) for the script identifier.
Upper case for the region identifier.
Lower case for any additional components.

We will refer to this formatting style as canonical capitalization.

Note

The RDF specification includes the statement that 'The value space of language tags is always in lower case'. This is a technical device to achieve case insensitivity, and does not conflict with the BCP 47 recommendation for capitalization as given above.

The page Normalising the Capitalisation of Language Tags shows how to transform a PoolParty project to consistent capitalization.

In this section: