This is from a discussion this last week:

    https://github.com/TopQuadrant/shacl/issues/100

On 12/09/2020 11:55, Håvard Ottestad wrote:
Hi,

I’ve been trying to get basic language ranges working for the SHACL engine in 
RDF4J and I’ve stumbled upon some differences between how RDF4J and Jena 
implement basic language ranges.

The SPARQL spec points to: https://www.ietf.org/rfc/rfc4647.txt 
<https://www.ietf.org/rfc/rfc4647.txt>
Specifically sections
  -  2.1.  Basic Language Range
  - 3.3.1.  Basic Filtering

Looking at the ABNF in 2.1.

    language-range   = (1*8ALPHA *("-" 1*8alphanum)) / "*"
    alphanum         = ALPHA / DIGIT

It looks like “*” is legal, “en” is legal and “en-gb” is legal (and even 
“a-ab-abc-12345678-a”). But “*-gb” is not legal and neither is “en-*”.

It seems like the range “en” would match a tag “en-gb” and a tag “en”.

I had a deep dive into the langMatch code in Jena and it seems to support “*” 
at any position in the range.

Is Jena supporting part of the extended range specification,

Jena LangMatches supports basic matching as required by SPARQL and SHACL, and does match some cases of "-*" but not properly by full RFC 4647. More by accident than design, I suspect.

Calling it "part of extended" is generous. It fails to match "-*" to multiples subtag ranges.

Basic is not completely compatible with extended.

Pattern "de-DE" matches "de-Latn-DE" by extended, but not basic.

Extended is sensitive to the fact the second subtag, 'script' is 4ALPHA, and 'region' is 2ALPHA or 3DIGIT so "de-DE" matches like "de-*-DE" on language and region, skipping region. Each part of a language has a slightly different syntax and extended filtering seem to depend on this to do its jump ahead for "-*".

I haven't got my head around the full impact of extended matching. It assumes valid language tags and invalid (by RFC 5646) language exist. In the real world, bad tags are common.

But SPARQL and Turtle have a catch all parse syntax based on the earlier RFC 3066 and HTTP at the time. And in the real world, bad tags are common.

"a-ab-abc-12345678-a" is not a legal language tag by 5646 or 4646 in several ways; it is legal by 3066.

To add to the language tag fun, RDF and RFC 4646 disagree on the canonical form of language tags.

> or am I missing something? (I have been missing a lot of things lately > :P so I wouldn’t be surprised).

This? :-)
https://github.com/TopQuadrant/shacl/issues/100#issuecomment-690100566

"""
The NodeFunctions.langMatches code does look like it gets basic matching right (as SPARQL requires), test cases to the contrary welcome, but the handling of extended matching looks wrong for "-*" with multiple occurences of subtags.

Extended matching is complicated and relies on (1) valid language tag input (2) the different parts of a language tag having different syntax.

"de-DE" does not match "de-Latn-DE" by basic but does by extended.
"""

    Andy


Cheers,
Håvard



PS: From 2.2.  Extended Language Range

    extended-language-range = (1*8ALPHA / "*”) *("-" (1*8alphanum / "*"))


Reply via email to