Hello,
We're in the process of replacing an old server that was still running
Fuseki1 from Jena 3.8.0 with a TDB1 store. The new server has Fuseki2
from Jena 3.16.0 and a TDB2 store.
While testing the new server, I noticed that the new Fuseki is running a
particular SPARQL query much slower than the old one. This is a query
performed by Skosmos to find out all the letters of the alphabet for an
alphabetical index by looking at all the skos:prefLabel values in a
specific language. It's expected to be a bit slow (several seconds)
since it needs to look at all the labels - but on the new server, the
query is almost an order of magnitude slower, which is causing timeout
issues.
To investigate this more closely, I decided to drop Fuseki out of the
equation and just use Jena command line utilities. I wanted to compare
the effect of Jena versions (3.8.0 vs 3.16.0), store type (TDB1 vs
TDB2), and variations of the original SPARQL query. For the data, I used
the newly published KANTO/finaf data set (an authority file of named
entities, i.e. persons and organizations) which can be downloaded from
finto.fi [1]. It has around 3M triples, 200k skos:Concept instances and
the same number of skos:prefLabel values.
I loaded this into a TDB1 data set using Jena 3.7.0 (because of
JENA-1575) like this:
apache-jena-3.7.0/bin/tdbloader --loc tdb1 --graph
http://example.org/finaf finaf-skos.ttl
Likewise, I loaded the same data set into a TDB2 store using Jena 3.16.0:
apache-jena-3.16.0/bin/tdb2.tdbloader --loc tdb2 --graph
http://example.org/finaf finaf-skos.ttl
This is the original SPARQL query:
PREFIX skos: <http://www.w3.org/2004/02/skos/core#>
SELECT DISTINCT (ucase(str(substr(?label, 1, 1))) as ?l)
FROM <http://example.org/finaf>
WHERE {
?c a skos:Concept .
?c skos:prefLabel ?label .
FILTER(langMatches(lang(?label), 'fi'))
}
The query should return 68 results. There is no particular order since
there is no ORDER BY, they are just a set of letters and special
characters such as numbers and punctuation.
I ran it using tdbquery / tdb2.tdbquery, separately on both Jena 3.8.0
and 3.16.0, using the options --time --repeat 2,10 (benchmark; two
rounds of warming up, ten rounds of benchmarking) and wrote down the
average query time, rounded to the first decimal point. I'm doing the
benchmarks on an i5-7200U laptop with a pretty fast SSD.
Jena 3.8.0 / TDB1: 2.1s
Jena 3.16.0 / TDB1: 2.4s
Jena 3.8.0 / TDB2: 11.8s
Jena 3.16.0 / TDB2: 12.0s
The difference between Jena versions is not very significant, but TDB2
is 5-6 times slower than TDB1. Here is how tdbquery -v explains the
query on the TDB level:
17:06:07 INFO exec :: TDB
(distinct
(project (?l)
(extend ((?l (ucase (str (substr ?label 1 1)))))
(filter (langMatches (lang ?label) "fi")
(bgp
(triple ?c
<http://www.w3.org/1999/02/22-rdf-syntax-ns#type>
<http://www.w3.org/2004/02/skos/core#Concept>)
(triple ?c <http://www.w3.org/2004/02/skos/core#prefLabel>
?label)
)))))
The explanation is identical for tdb2.tdbquery so I won't repeat it.
I then looked at ways of optimizing the query to make it perform better.
After trying many variations (for example reordering the clauses and/or
moving the substring expression to a BIND variable), the only change
that seemed to have a significant effect was to remove the FROM clause
and instead insert a GRAPH clause targeting the same graph, like this:
PREFIX skos: <http://www.w3.org/2004/02/skos/core#>
SELECT DISTINCT (ucase(str(substr(?label, 1, 1))) as ?l)
WHERE {
GRAPH <http://example.org/finaf> {
?c a skos:Concept .
?c skos:prefLabel ?label .
FILTER(langMatches(lang(?label), 'fi'))
}
}
Benchmark results for this GRAPH version of the query:
Jena 3.8.0 / TDB1: 0.9s
Jena 3.16.0 / TDB1: 1.3s
Jena 3.8.0 / TDB2: 1.4s
Jena 3.16.0 / TDB2: 1.9s
The results are much more even this time, though Jena 3.16.0 is about
40% slower than 3.8.0 and TDB2 is about 50% slower than TDB1. tdbquery
-v (and tdb2.tdbquery -v) explains the query like this:
17:13:02 INFO exec :: TDB
(distinct
(project (?l)
(extend ((?l (ucase (str (substr ?label 1 1)))))
(filter (langMatches (lang ?label) "fi")
(quadpattern
(quad <http://example.org/finaf> ?c
<http://www.w3.org/1999/02/22-rdf-syntax-ns#type>
<http://www.w3.org/2004/02/skos/core#Concept>)
(quad <http://example.org/finaf> ?c
<http://www.w3.org/2004/02/skos/core#prefLabel> ?label)
)))))
The difference I see compared to the previous query is the use of "quad"
instead of "triple". My understanding of operations on the TDB level is
pretty naive, but it seems to me this is now targeting the correct graph
directly, instead of indirectly, as in the first case. This is a bit
surprising to me since the "FROM <http://example.org/finaf>" clause in
the first query is, to me, saying the same thing as the GRAPH clause:
just target triples in this particular graph. Is there a missed
opportunity for some optimization here? Why is FROM (much) worse than GRAPH?
I also wonder why TDB2 is so much slower than TDB1, especially for the
first version of the query. It should be an improvement, right? Of
course there are trade-offs in implementing any complex system. But it
makes me think whether we should stick to TDB1 for the time being, as
there are no obvious benefits in using TDB2 for our current use.
Likewise, this makes me wonder whether there has been a mild decrease in
performance between Jena 3.8.0 and 3.16.0 - though I didn't look at
intermediate versions to pinpoint the exact change (or several) that
would be causing the slowdown. If there's interest, I can try other
versions as well.
For now we will probably just change Skosmos to use the GRAPH variant of
the query, which should fix the immediate problems with timeouts.
Unfortunately I don't have the skills to work directly on the ARQ
optimizer or TDB2 code bases. But I'd be happy to test other variations
and potential fixes to these performance problems.
Cheers,
Osma
[1] https://finto.fi/rest/v1/finaf/data?format=text/turtle
--
Osma Suominen
D.Sc. (Tech), Information Systems Specialist
National Library of Finland
P.O. Box 15 (Unioninkatu 36)
00014 HELSINGIN YLIOPISTO
Tel. +358 50 3199529
[email protected]
http://www.nationallibrary.fi