Normalizing language tags

Andy Seaborne Thu, 28 Sep 2023 02:36:28 -0700

An active issue in the RDF 1.2 Working Group is whether to mandate thesyntactic form of language tags.

Currently, in RDF, it says that language tags are compared caseinsensitively and also that "Lexical representations of language tagsMAY be converted to lower case." Its actually in RDF semantics asD-entailment.

The issue has come to prominence because of work on RDF canonicalizationand hashing (RCH) which works on the syntax of graphs. Signing andVerifiable Credentials than rely on RCH. So syntax matters, not the value.

The language tags RFC 5646 (AKA BCP-47 which is a soft link to thecurrent latest RFC on the subject) says that

"case distinctions do not carry meaning in language tags"

Canonicalization of Language Tags [2] is different and out of scope - itmeans use the preferred names, for example, for countries. That requiresaccess to the global registry. It is not being considered by RDF 1.,2 WG.

A complication is that the RDF-defined preferred presentation oflanguage tags is not the same as the RFC. RDF says "lower case".

In the RFC, each subtag has a preferred form. It's "en-US" ,"en-Latn-US"... The preferred form normalization rules only need the language tagstring. Different subtags are identified by length, or for the countrypart - by being first.


Jena defaults to treating language tags as given.
"abc"@fr and "abc"@FR are different RDF terms.

The Jena parsers have options to choose what to do.

  RDFParserBuilder.langTagLowerCase()
  RDFParserBuilder.langTagCanonical()


For Jena5:

1/ Do you think Jena should switch to one form?
   2a/ Should that be in parsers, setting default to a one form output?
   2b/ Or should all langtags get normalized as the node is created?

2/ Which is your preferred form for RDF 1.2?
   2a/ Lower case
   2b/ RFC-preferred form
   2c/ No change

If Jena changes to have a common format of language tags, persistentdata that has language tags in it will have to be reloaded.


Jena has a LangTag parser, org.apache.jena.riot.web.LangTag.

The current codebase defers to Locale.Builder for normalization in theJDK but that can be intercepted if the JDK is insufficient withoutapplication involvement.


    Andy


RFC 5646
https://datatracker.ietf.org/doc/html/rfc5646

[1] Formatting of language tags:
https://datatracker.ietf.org/doc/html/rfc5646#section-2.1.1

[2] Canonicalization of language tags
https://datatracker.ietf.org/doc/html/rfc5646#page-66

[3]
https://issues.apache.org/jira/browse/JENA-1384

Normalizing language tags

Reply via email to