Sorting is based on the "<" operation so SPARQL defines what happens
based on that operation.
The working group (DAWG - SPARQL 1.0) did not mandate handling language
tags so this is an extension point.
The idea of making it lang sensitive is a good one.
There is a gotcha though :-) Currently, Jena sorts literals by lexical
form then by language tag which leads to interesting effects
Suppose language "@zzz" collates with E before G.
Suppose language "@aaa" collates with G before E.
When it is two different languages, use codepoint then language tag.
"G"@aaa < "E"@aaa (collate by language 'aaa')
"E"@aaa < "E"@zzz (lang tag)
"E"@zzz < "G"@aaa (codepoint)
===>
"E"@zzz < "G"@aaa < "E"@aaa so "E"@zzz < "E"@aaa
but "E"@aaa < "E"@zzz
Oops.
At the root of sorting is Arrays.sort(.comparator) and if the comparator
is unstable, it gets weird (or it complains).
What could be done is a mode whereby sorting changes to order by
language tag then language collation within same language tag.
When used in a predominately single language situation, this is natural
and stray other languages go to the start or end of sorting.
In a mixed language system (say, languages that have similar spellings,
or are unknown collatitions), that may be not what is natural.
See
NodeUtils.compareLiteralsBySyntax
and the sorters:
QueryIterSort and QueryIterTopN
Andy
On 24/10/16 13:02, Osma Suominen wrote:
Hi!
I'm looking into an issue [1] we have in the Skosmos application with
the ordering of literals as returned by a SPARQL query served by Fuseki.
It appears that when using ORDER BY, the order of literals is based on
Unicode collation order (or something similar). This is not always
optimal for user-facing applications where language-specific collation
order would be expected.
For example, this SPARQL query:
--cut--
SELECT ?label WHERE {
VALUES ?label { "tsahurin kieli"@fi "tšekin kieli"@fi "tulun kieli"@fi
"töyhtöhyyppä"@fi }
}
ORDER BY ?label
--cut--
returns the literals in the following order:
1. tsahurin kieli
2. tulun kieli
3. töyhtöhyyppä
4. tšekin kieli
This is not expected by users; according to Finnish collation rules [2],
"š" should be collated together with "s" so the ordering of the literals
should be the same as was used in the VALUES statement.
Based on what I found out, SPARQL doesn't really state the collation
order of literals [3,4,5]. Often generic Unicode collation is used.
However, Dydra, a cloud-based triple store, has special support for
language-specific collation [5]. There, the logic is this: "plain
literals which share a language tag are ordered according to the
collation rules for the respective language" [5,6]. Implementing
collation this way makes a lot of sense to me.
Could the same be done with Jena ARQ? Either by changing the current
sorting implementation to be language-aware, or by using some custom
extension function to pre-process the literals into strings that can
then be compared using ORDER BY? Would this be a lot of work to implement?
I note that there is basic language-sensitive collation support
available in the Collator class [7] introduced in Java 7. A possibly
more complete (and apparently faster) Collator implementation [8] is
available in the ICU4J library.
-Osma
[1] https://github.com/NatLibFi/Skosmos/issues/559
[2] https://en.wikipedia.org/wiki/Finnish_orthography#Collation_order
[3] https://www.w3.org/TR/2013/REC-sparql11-query-20130321/#modOrderBy
[4]
http://stackoverflow.com/questions/38961492/how-do-you-set-the-collation-for-a-sparql-query
[5] http://blog.dydra.com/2015/05/06/collation
[6]
https://github.com/dydra/http-api-tests/blob/master/extensions/sparql-protocol/collation/README.md
[7] https://docs.oracle.com/javase/7/docs/api/java/text/Collator.html
[8] http://icu-project.org/apiref/icu4j/com/ibm/icu/text/Collator.html