Re: Question about basic vs extended language ranges

Andy Seaborne Sat, 12 Sep 2020 09:32:10 -0700

This is from a discussion this last week:

    https://github.com/TopQuadrant/shacl/issues/100


On 12/09/2020 11:55, Håvard Ottestad wrote:

Hi,

I’ve been trying to get basic language ranges working for the SHACL engine in 
RDF4J and I’ve stumbled upon some differences between how RDF4J and Jena 
implement basic language ranges.

The SPARQL spec points to: https://www.ietf.org/rfc/rfc4647.txt 
<https://www.ietf.org/rfc/rfc4647.txt>
Specifically sections
  -  2.1.  Basic Language Range
  - 3.3.1.  Basic Filtering

Looking at the ABNF in 2.1.

    language-range   = (1*8ALPHA *("-" 1*8alphanum)) / "*"
    alphanum         = ALPHA / DIGIT

It looks like “*” is legal, “en” is legal and “en-gb” is legal (and even 
“a-ab-abc-12345678-a”). But “*-gb” is not legal and neither is “en-*”.

It seems like the range “en” would match a tag “en-gb” and a tag “en”.

I had a deep dive into the langMatch code in Jena and it seems to support “*” 
at any position in the range.

Is Jena supporting part of the extended range specification,

Jena LangMatches supports basic matching as required by SPARQL andSHACL, and does match some cases of "-*" but not properly by full RFC4647. More by accident than design, I suspect.

Calling it "part of extended" is generous. It fails to match "-*" tomultiples subtag ranges.


Basic is not completely compatible with extended.

Pattern "de-DE" matches "de-Latn-DE" by extended, but not basic.

Extended is sensitive to the fact the second subtag, 'script' is 4ALPHA,and 'region' is 2ALPHA or 3DIGIT so "de-DE" matches like "de-*-DE" onlanguage and region, skipping region. Each part of a language has aslightly different syntax and extended filtering seem to depend on thisto do its jump ahead for "-*".

I haven't got my head around the full impact of extended matching. Itassumes valid language tags and invalid (by RFC 5646) language exist. Inthe real world, bad tags are common.

But SPARQL and Turtle have a catch all parse syntax based on the earlierRFC 3066 and HTTP at the time. And in the real world, bad tags are common.

"a-ab-abc-12345678-a" is not a legal language tag by 5646 or 4646 inseveral ways; it is legal by 3066.

To add to the language tag fun, RDF and RFC 4646 disagree on thecanonical form of language tags.

> or am I missing something? (I have been missing a lot of thingslately > :P so I wouldn’t be surprised).


This? :-)
https://github.com/TopQuadrant/shacl/issues/100#issuecomment-690100566

"""

The NodeFunctions.langMatches code does look like it gets basic matchingright (as SPARQL requires), test cases to the contrary welcome, but thehandling of extended matching looks wrong for "-*" with multipleoccurences of subtags.

Extended matching is complicated and relies on (1) valid language taginput (2) the different parts of a language tag having different syntax.


"de-DE" does not match "de-Latn-DE" by basic but does by extended.
"""

    Andy


Cheers,
Håvard



PS: From 2.2.  Extended Language Range

    extended-language-range = (1*8ALPHA / "*”) *("-" (1*8alphanum / "*"))

Re: Question about basic vs extended language ranges

Reply via email to