An active issue in the RDF 1.2 Working Group is whether to mandate the
syntactic form of language tags.
Currently, in RDF, it says that language tags are compared case
insensitively and also that "Lexical representations of language tags
MAY be converted to lower case." Its actually in RDF semantics as
D-entailment.
The issue has come to prominence because of work on RDF canonicalization
and hashing (RCH) which works on the syntax of graphs. Signing and
Verifiable Credentials than rely on RCH. So syntax matters, not the value.
The language tags RFC 5646 (AKA BCP-47 which is a soft link to the
current latest RFC on the subject) says that
"case distinctions do not carry meaning in language tags"
Canonicalization of Language Tags [2] is different and out of scope - it
means use the preferred names, for example, for countries. That requires
access to the global registry. It is not being considered by RDF 1.,2 WG.
A complication is that the RDF-defined preferred presentation of
language tags is not the same as the RFC. RDF says "lower case".
In the RFC, each subtag has a preferred form. It's "en-US" ,"en-Latn-US"
... The preferred form normalization rules only need the language tag
string. Different subtags are identified by length, or for the country
part - by being first.
Jena defaults to treating language tags as given.
"abc"@fr and "abc"@FR are different RDF terms.
The Jena parsers have options to choose what to do.
RDFParserBuilder.langTagLowerCase()
RDFParserBuilder.langTagCanonical()
For Jena5:
1/ Do you think Jena should switch to one form?
2a/ Should that be in parsers, setting default to a one form output?
2b/ Or should all langtags get normalized as the node is created?
2/ Which is your preferred form for RDF 1.2?
2a/ Lower case
2b/ RFC-preferred form
2c/ No change
If Jena changes to have a common format of language tags, persistent
data that has language tags in it will have to be reloaded.
Jena has a LangTag parser, org.apache.jena.riot.web.LangTag.
The current codebase defers to Locale.Builder for normalization in the
JDK but that can be intercepted if the JDK is insufficient without
application involvement.
Andy
RFC 5646
https://datatracker.ietf.org/doc/html/rfc5646
[1] Formatting of language tags:
https://datatracker.ietf.org/doc/html/rfc5646#section-2.1.1
[2] Canonicalization of language tags
https://datatracker.ietf.org/doc/html/rfc5646#page-66
[3]
https://issues.apache.org/jira/browse/JENA-1384