Background: Capitalisation of RDF Language Tags in Graph Databases

Abstract

This section documents the allowed and preferred capitalisation of RDF language tags, and their support in popular graph databases.

Background

Most text fields in PoolParty have RDF type "langString", which means they are identified by a language tag, e.g. "Taxonomy"@en-US. The standard dictates that language tags are case-independent: @en-us, @en-US, @En-uS and any other capitalisations represent the same language tag. But while triple stores support the case-insensitive semantics of language tags, they differ in how the support is implemented.

RDF4J, which PoolParty uses for internal storage, used to store all language tags in lower case regardless of how they were entered. Starting with the version distributed with PoolParty 6.1 and 6.2, it stores language tags as delivered by the application, but interprets them in a case-independent manner, as required.

This has exposed an inconsistency in PoolParty: two-part tags (language-country, as in the examples above) are sometimes generated in lower case, and sometimes with the country part in capitals. Since capitalisation is irrelevant on principle, everything still works fine: strings are correctly handled, and language tags are even normalised to the conventional form (region uppercased) in the PoolParty user interface. However, when data is exported it retains the inconsistent capitalisation, which can cause problems for downstream tools that do not treat language tags specially.

Note

This issue affects PoolParty versions 6.1 and 6.2.

In PoolParty 7, language tags are generated with consistent capitalisation, namely in the recommended interchange format, e.g. @en-US. The issue may also arise when data from external sources, inconsistently capitalized, is imported into PoolParty 6.1 or any later version.

The section Normalising the Capitalisation of Language Tags contains detailed steps on how to transform a PoolParty project to consistent capitalisation.

Standards and Graph Database Support

Applicable Standards

RDF requires that RDF language tags must be well-formed according to the BCP 47 specification. According to BCP 47, a language tag consists of an obligatory language part plus several optional components, including script and region (country) identifiers, an extension space, and a private use area. The parts are connected with hyphens. For example, sr-Latn-ME represents Serbian as spoken in Montenegro, written in the Latin script. The most common forms involve the language alone (@en, @ar)orlanguage plus region (@en-US, @sr-ME).

BCP 47 further specifies that language tags must be treated as case-insensitive (meaning that sr-Latn-ME is semantically identical to sr-latn-me, SR-LaTn-mE, etc.), but recommends that they be presented in the capitalisation style of the registry for each component, namely:

Lower case for the language identifier.
Title case (first letter capital) for the script identifier.
Upper case for the region identifier.
Lower case for any additional components.

We will refer to this formatting style as canonical capitalisation.

Note

The RDF specification includes the statement that 'The value space of language tags is always in lower case'. This is a technical device to achieve case insensitivity, and does not conflict with the BCP 47 recommendation for capitalisation as given above.

Support by Graph Databases

The graph databases commonly used with PoolParty take different approaches to the handling of language tags. For example:

The current version of RDF4J stores language tags as formatted by the application, but applies case-insensitive semantics to matching and distinguishing them. Earlier versions stored language tags in lower case.
Stardog converts language tags to lower case and stores them in this form.
Virtuoso avoids capitalisation inconsistencies by only accepting language tags in the canonical capitalisation style (@en-UK etc.).

These approaches mean that no conversion is possible in Stardog and Virtuoso. Each graph represents language tags in only one way, and the only room for conversion is during export.

But in RDF4J, the default triple store for PoolParty taxonomies, it is possible to have a mix of canonical and non-canonical capitalisation in the language tags of a single graph or project.

The page Normalising the Capitalisation of Language Tags shows how to transform a PoolParty project to consistent capitalisation.

In this section: