Hello,

We're in the process of replacing an old server that was still running Fuseki1 from Jena 3.8.0 with a TDB1 store. The new server has Fuseki2 from Jena 3.16.0 and a TDB2 store.

While testing the new server, I noticed that the new Fuseki is running a particular SPARQL query much slower than the old one. This is a query performed by Skosmos to find out all the letters of the alphabet for an alphabetical index by looking at all the skos:prefLabel values in a specific language. It's expected to be a bit slow (several seconds) since it needs to look at all the labels - but on the new server, the query is almost an order of magnitude slower, which is causing timeout issues.

To investigate this more closely, I decided to drop Fuseki out of the equation and just use Jena command line utilities. I wanted to compare the effect of Jena versions (3.8.0 vs 3.16.0), store type (TDB1 vs TDB2), and variations of the original SPARQL query. For the data, I used the newly published KANTO/finaf data set (an authority file of named entities, i.e. persons and organizations) which can be downloaded from finto.fi [1]. It has around 3M triples, 200k skos:Concept instances and the same number of skos:prefLabel values.

I loaded this into a TDB1 data set using Jena 3.7.0 (because of JENA-1575) like this:

apache-jena-3.7.0/bin/tdbloader --loc tdb1 --graph http://example.org/finaf finaf-skos.ttl

Likewise, I loaded the same data set into a TDB2 store using Jena 3.16.0:

apache-jena-3.16.0/bin/tdb2.tdbloader --loc tdb2 --graph http://example.org/finaf finaf-skos.ttl


This is the original SPARQL query:

        PREFIX skos: <http://www.w3.org/2004/02/skos/core#>

        SELECT DISTINCT (ucase(str(substr(?label, 1, 1))) as ?l)
        FROM <http://example.org/finaf>
        WHERE {
          ?c a skos:Concept .
          ?c skos:prefLabel ?label .
          FILTER(langMatches(lang(?label), 'fi'))
        }

The query should return 68 results. There is no particular order since there is no ORDER BY, they are just a set of letters and special characters such as numbers and punctuation.

I ran it using tdbquery / tdb2.tdbquery, separately on both Jena 3.8.0 and 3.16.0, using the options --time --repeat 2,10 (benchmark; two rounds of warming up, ten rounds of benchmarking) and wrote down the average query time, rounded to the first decimal point. I'm doing the benchmarks on an i5-7200U laptop with a pretty fast SSD.

Jena 3.8.0 / TDB1: 2.1s
Jena 3.16.0 / TDB1: 2.4s
Jena 3.8.0 / TDB2: 11.8s
Jena 3.16.0 / TDB2: 12.0s

The difference between Jena versions is not very significant, but TDB2 is 5-6 times slower than TDB1. Here is how tdbquery -v explains the query on the TDB level:

17:06:07 INFO  exec            :: TDB
  (distinct
    (project (?l)
      (extend ((?l (ucase (str (substr ?label 1 1)))))
        (filter (langMatches (lang ?label) "fi")
          (bgp
(triple ?c <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> <http://www.w3.org/2004/02/skos/core#Concept>) (triple ?c <http://www.w3.org/2004/02/skos/core#prefLabel> ?label)
          )))))

The explanation is identical for tdb2.tdbquery so I won't repeat it.

I then looked at ways of optimizing the query to make it perform better. After trying many variations (for example reordering the clauses and/or moving the substring expression to a BIND variable), the only change that seemed to have a significant effect was to remove the FROM clause and instead insert a GRAPH clause targeting the same graph, like this:

        PREFIX skos: <http://www.w3.org/2004/02/skos/core#>

        SELECT DISTINCT (ucase(str(substr(?label, 1, 1))) as ?l)
        WHERE {
          GRAPH <http://example.org/finaf> {
            ?c a skos:Concept .
            ?c skos:prefLabel ?label .
            FILTER(langMatches(lang(?label), 'fi'))
          }
        }

Benchmark results for this GRAPH version of the query:

Jena 3.8.0 / TDB1: 0.9s
Jena 3.16.0 / TDB1: 1.3s
Jena 3.8.0 / TDB2: 1.4s
Jena 3.16.0 / TDB2: 1.9s

The results are much more even this time, though Jena 3.16.0 is about 40% slower than 3.8.0 and TDB2 is about 50% slower than TDB1. tdbquery -v (and tdb2.tdbquery -v) explains the query like this:

17:13:02 INFO  exec            :: TDB
  (distinct
    (project (?l)
      (extend ((?l (ucase (str (substr ?label 1 1)))))
        (filter (langMatches (lang ?label) "fi")
          (quadpattern
(quad <http://example.org/finaf> ?c <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> <http://www.w3.org/2004/02/skos/core#Concept>) (quad <http://example.org/finaf> ?c <http://www.w3.org/2004/02/skos/core#prefLabel> ?label)
          )))))

The difference I see compared to the previous query is the use of "quad" instead of "triple". My understanding of operations on the TDB level is pretty naive, but it seems to me this is now targeting the correct graph directly, instead of indirectly, as in the first case. This is a bit surprising to me since the "FROM <http://example.org/finaf>" clause in the first query is, to me, saying the same thing as the GRAPH clause: just target triples in this particular graph. Is there a missed opportunity for some optimization here? Why is FROM (much) worse than GRAPH?

I also wonder why TDB2 is so much slower than TDB1, especially for the first version of the query. It should be an improvement, right? Of course there are trade-offs in implementing any complex system. But it makes me think whether we should stick to TDB1 for the time being, as there are no obvious benefits in using TDB2 for our current use.

Likewise, this makes me wonder whether there has been a mild decrease in performance between Jena 3.8.0 and 3.16.0 - though I didn't look at intermediate versions to pinpoint the exact change (or several) that would be causing the slowdown. If there's interest, I can try other versions as well.

For now we will probably just change Skosmos to use the GRAPH variant of the query, which should fix the immediate problems with timeouts. Unfortunately I don't have the skills to work directly on the ARQ optimizer or TDB2 code bases. But I'd be happy to test other variations and potential fixes to these performance problems.

Cheers,
Osma

[1] https://finto.fi/rest/v1/finaf/data?format=text/turtle


--
Osma Suominen
D.Sc. (Tech), Information Systems Specialist
National Library of Finland
P.O. Box 15 (Unioninkatu 36)
00014 HELSINGIN YLIOPISTO
Tel. +358 50 3199529
[email protected]
http://www.nationallibrary.fi

Reply via email to