An active issue in the RDF 1.2 Working Group is whether to mandate the syntactic form of language tags.

Currently, in RDF, it says that language tags are compared case insensitively and also that "Lexical representations of language tags MAY be converted to lower case." Its actually in RDF semantics as D-entailment.

The issue has come to prominence because of work on RDF canonicalization and hashing (RCH) which works on the syntax of graphs. Signing and Verifiable Credentials than rely on RCH. So syntax matters, not the value.

The language tags RFC 5646 (AKA BCP-47 which is a soft link to the current latest RFC on the subject) says that
"case distinctions do not carry meaning in language tags"

Canonicalization of Language Tags [2] is different and out of scope - it means use the preferred names, for example, for countries. That requires access to the global registry. It is not being considered by RDF 1.,2 WG.

A complication is that the RDF-defined preferred presentation of language tags is not the same as the RFC. RDF says "lower case".

In the RFC, each subtag has a preferred form. It's "en-US" ,"en-Latn-US" ... The preferred form normalization rules only need the language tag string. Different subtags are identified by length, or for the country part - by being first.

Jena defaults to treating language tags as given.
"abc"@fr and "abc"@FR are different RDF terms.

The Jena parsers have options to choose what to do.

  RDFParserBuilder.langTagLowerCase()
  RDFParserBuilder.langTagCanonical()


For Jena5:

1/ Do you think Jena should switch to one form?
   2a/ Should that be in parsers, setting default to a one form output?
   2b/ Or should all langtags get normalized as the node is created?

2/ Which is your preferred form for RDF 1.2?
   2a/ Lower case
   2b/ RFC-preferred form
   2c/ No change


If Jena changes to have a common format of language tags, persistent data that has language tags in it will have to be reloaded.

Jena has a LangTag parser, org.apache.jena.riot.web.LangTag.
The current codebase defers to Locale.Builder for normalization in the JDK but that can be intercepted if the JDK is insufficient without application involvement.

    Andy


RFC 5646
https://datatracker.ietf.org/doc/html/rfc5646

[1] Formatting of language tags:
https://datatracker.ietf.org/doc/html/rfc5646#section-2.1.1

[2] Canonicalization of language tags
https://datatracker.ietf.org/doc/html/rfc5646#page-66

[3]
https://issues.apache.org/jira/browse/JENA-1384

Reply via email to