Hi!

I'm looking into an issue [1] we have in the Skosmos application with the ordering of literals as returned by a SPARQL query served by Fuseki. It appears that when using ORDER BY, the order of literals is based on Unicode collation order (or something similar). This is not always optimal for user-facing applications where language-specific collation order would be expected.

For example, this SPARQL query:

--cut--
SELECT ?label WHERE {
VALUES ?label { "tsahurin kieli"@fi "tšekin kieli"@fi "tulun kieli"@fi "töyhtöhyyppä"@fi }
}
ORDER BY ?label
--cut--

returns the literals in the following order:

1. tsahurin kieli
2. tulun kieli
3. töyhtöhyyppä
4. tšekin kieli

This is not expected by users; according to Finnish collation rules [2], "š" should be collated together with "s" so the ordering of the literals should be the same as was used in the VALUES statement.

Based on what I found out, SPARQL doesn't really state the collation order of literals [3,4,5]. Often generic Unicode collation is used. However, Dydra, a cloud-based triple store, has special support for language-specific collation [5]. There, the logic is this: "plain literals which share a language tag are ordered according to the collation rules for the respective language" [5,6]. Implementing collation this way makes a lot of sense to me.

Could the same be done with Jena ARQ? Either by changing the current sorting implementation to be language-aware, or by using some custom extension function to pre-process the literals into strings that can then be compared using ORDER BY? Would this be a lot of work to implement?

I note that there is basic language-sensitive collation support available in the Collator class [7] introduced in Java 7. A possibly more complete (and apparently faster) Collator implementation [8] is available in the ICU4J library.

-Osma


[1] https://github.com/NatLibFi/Skosmos/issues/559

[2] https://en.wikipedia.org/wiki/Finnish_orthography#Collation_order

[3] https://www.w3.org/TR/2013/REC-sparql11-query-20130321/#modOrderBy

[4] http://stackoverflow.com/questions/38961492/how-do-you-set-the-collation-for-a-sparql-query

[5] http://blog.dydra.com/2015/05/06/collation

[6] https://github.com/dydra/http-api-tests/blob/master/extensions/sparql-protocol/collation/README.md

[7] https://docs.oracle.com/javase/7/docs/api/java/text/Collator.html

[8] http://icu-project.org/apiref/icu4j/com/ibm/icu/text/Collator.html

--
Osma Suominen
D.Sc. (Tech), Information Systems Specialist
National Library of Finland
P.O. Box 26 (Kaikukatu 4)
00014 HELSINGIN YLIOPISTO
Tel. +358 50 3199529
[email protected]
http://www.nationallibrary.fi

Reply via email to