Re: Language codes

Benson Margulies Wed, 02 Jul 2014 15:02:32 -0700

A squirrel ran by and I clicked 'send' too fast.

Well, I'm exaggerating. "Perfectly awful" should be 'mildly inconvenient'.


In my space, it's typical to assume that language code comparisons
know the equivalence between en and eng. So, if one expect to process
a range of data including languages only distinguished in -3 space,
one just works with -3 codes.

However, There's lots of RDF out there with -1 codes (e.g. @en). So, I
can't just throw the
switch, as it were, to -3 codes and expect to match against it. I need
to be careful to generate triples that use -1 codes except for those
languages where -3
codes are required to distinguish.

Am I making sense?


On Wed, Jul 2, 2014 at 6:00 PM, Benson Margulies <[email protected]> wrote:
> On Wed, Jul 2, 2014 at 5:55 PM, Andy Seaborne <[email protected]> wrote:
>> On 02/07/14 22:27, Benson Margulies wrote:
>>>
>>> On Wed, Jul 2, 2014 at 5:11 PM, Andy Seaborne <[email protected]> wrote:
>>>>
>>>> On 02/07/14 21:45, Benson Margulies wrote:
>>>>>
>>>>>
>>>>> Andy,
>>>>>
>>>>> The upshot of all of this is that ISO-639-3 codes should work.
>>>>> However, that leaves a mystery to me. If I store a triple with @en,
>>>>> and someone queries with @eng, are they supposed to match? In
>>>>> practical terms, do they match in TDB or any other common triple
>>>>> stores?
>>>>
>>>>
>>>>
>>>> No.
>>>>
>>>> ""@en and ""@eng are different RDF terms.  As is ""@en-uk.
>>>>
>>>> All the stores I know of treat language tags as (normalized) strings.
>>>
>>>
>>> That's perfectly clear and perfectly awful, at least for people who
>>> care about Persian, Dari, and that ilk. Thanks.
>>
>>
>> Why?  All ISO-639 systems are supported - but there is no equivalence tables
>> between the different systems built in.  Or within the systems (B and T
>> codes).
>
> Well, I'm exaggerating. "Perfectly awful" should be 'mildly inconvenient'/
>
> In my space, it's typical to assume that language code comparisons
> know the equivalence between en and eng. So, if one expect to process
> a range of data including languages only distinguished in -3 space.
> There's lots of RDF out there with @en. So, I can't just throw the
> switch, as it were, to -3 codes and expect to match against it. I need
> to be careful to use -1 codes except for those languages where -3
> codes are required to distinguish.
>
> Am I making sense?
>
>
>>
>> (This is all outside the RDF specs - they just inherit from W3C
>> Internationalization and BCP 47).
>>
>>
>> Experiment with:
>> http://www.sparql.org/data-validator.html
>>
>>         Andy
>>
>>
>>>
>>>>
>>>> SPARQL uses LANGMATCHES, which is the algorithm from RFC 4647 "Matching
>>>> of
>>>> Language Tags".
>>>>
>>>> If you want semantic (ha!) equality, then canonicalizing on input is
>>>> best.
>>>> Then worry about en-uk.
>>>>
>>>>          Andy
>>>>
>>>>
>>>>>
>>>>>
>>>>>
>>>>> On Wed, Jul 2, 2014 at 12:34 PM, Andy Seaborne <[email protected]> wrote:
>>>>>>
>>>>>>
>>>>>> On 02/07/14 12:01, Benson Margulies wrote:
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> I always see two-letter ISO-639-1 language codes. This isn't enough,
>>>>>>> not all languages have them.
>>>>>>>
>>>>>>> Does the spec specifically call for these, or does it also allow for
>>>>>>> -3?
>>>>>>>
>>>>>>> --benson
>>>>>>>
>>>>>>
>>>>>> RDF 1.1 Concepts:
>>>>>>
>>>>>> http://www.w3.org/TR/rdf11-concepts/#section-Graph-Literal
>>>>>>
>>>>>> so it's BCP 47 / RFC 5646
>>>>>>
>>>>>> The grammars do not include the RFC grammar (because a big language tag
>>>>>> grammar would dwarf the rest).
>>>>>>
>>>>>> http://www.w3.org/TR/turtle/#grammar-production-LANGTAG
>>>>>>
>>>>>> [144s]  LANGTAG         ::=     '@' [a-zA-Z]+ ('-' [a-zA-Z0-9]+)*
>>>>>>
>>>>>> So neutral and the grammars provide a more general match to language
>>>>>> codes.
>>>>>>
>>>>>> Jena has a language tag parser: LangTag.
>>>>>>
>>>>>>           Andy
>>>>>>
>>>>
>>

Re: Language codes

Reply via email to