Re: Language-specific collation in ARQ

Andy Seaborne Tue, 25 Oct 2016 06:20:20 -0700

Sorting is based on the "<" operation so SPARQL defines what happensbased on that operation.

The working group (DAWG - SPARQL 1.0) did not mandate handling languagetags so this is an extension point.


The idea of making it lang sensitive is a good one.

There is a gotcha though :-) Currently, Jena sorts literals by lexicalform then by language tag which leads to interesting effects


Suppose language "@zzz" collates with E before G.
Suppose language "@aaa" collates with G before E.

When it is two different languages, use codepoint then language tag.

"G"@aaa < "E"@aaa  (collate by language 'aaa')
"E"@aaa < "E"@zzz  (lang tag)
"E"@zzz < "G"@aaa  (codepoint)
===>
"E"@zzz < "G"@aaa < "E"@aaa  so "E"@zzz < "E"@aaa

but "E"@aaa < "E"@zzz

Oops.

At the root of sorting is Arrays.sort(.comparator) and if the comparatoris unstable, it gets weird (or it complains).

What could be done is a mode whereby sorting changes to order bylanguage tag then language collation within same language tag.

When used in a predominately single language situation, this is naturaland stray other languages go to the start or end of sorting.

In a mixed language system (say, languages that have similar spellings,or are unknown collatitions), that may be not what is natural.


See
  NodeUtils.compareLiteralsBySyntax

and the sorters:

QueryIterSort and QueryIterTopN

        Andy

On 24/10/16 13:02, Osma Suominen wrote:

Hi!

I'm looking into an issue [1] we have in the Skosmos application with
the ordering of literals as returned by a SPARQL query served by Fuseki.
It appears that when using ORDER BY, the order of literals is based on
Unicode collation order (or something similar). This is not always
optimal for user-facing applications where language-specific collation
order would be expected.

For example, this SPARQL query:

--cut--
SELECT ?label WHERE {
  VALUES ?label { "tsahurin kieli"@fi "tšekin kieli"@fi "tulun kieli"@fi
"töyhtöhyyppä"@fi }
}
ORDER BY ?label
--cut--

returns the literals in the following order:

1. tsahurin kieli
2. tulun kieli
3. töyhtöhyyppä
4. tšekin kieli

This is not expected by users; according to Finnish collation rules [2],
"š" should be collated together with "s" so the ordering of the literals
should be the same as was used in the VALUES statement.

Based on what I found out, SPARQL doesn't really state the collation
order of literals [3,4,5]. Often generic Unicode collation is used.
However, Dydra, a cloud-based triple store, has special support for
language-specific collation [5]. There, the logic is this: "plain
literals which share a language tag are ordered according to the
collation rules for the respective language" [5,6]. Implementing
collation this way makes a lot of sense to me.

Could the same be done with Jena ARQ? Either by changing the current
sorting implementation to be language-aware, or by using some custom
extension function to pre-process the literals into strings that can
then be compared using ORDER BY? Would this be a lot of work to implement?

I note that there is basic language-sensitive collation support
available in the Collator class [7] introduced in Java 7. A possibly
more complete (and apparently faster) Collator implementation [8] is
available in the ICU4J library.

-Osma


[1] https://github.com/NatLibFi/Skosmos/issues/559

[2] https://en.wikipedia.org/wiki/Finnish_orthography#Collation_order

[3] https://www.w3.org/TR/2013/REC-sparql11-query-20130321/#modOrderBy

[4]
http://stackoverflow.com/questions/38961492/how-do-you-set-the-collation-for-a-sparql-query


[5] http://blog.dydra.com/2015/05/06/collation

[6]
https://github.com/dydra/http-api-tests/blob/master/extensions/sparql-protocol/collation/README.md


[7] https://docs.oracle.com/javase/7/docs/api/java/text/Collator.html

[8] http://icu-project.org/apiref/icu4j/com/ibm/icu/text/Collator.html

Re: Language-specific collation in ARQ

Reply via email to