Performance regressions in Jena and TDB2

Osma Suominen Mon, 30 Nov 2020 07:34:47 -0800

Hello,

We're in the process of replacing an old server that was still runningFuseki1 from Jena 3.8.0 with a TDB1 store. The new server has Fuseki2from Jena 3.16.0 and a TDB2 store.

While testing the new server, I noticed that the new Fuseki is running aparticular SPARQL query much slower than the old one. This is a queryperformed by Skosmos to find out all the letters of the alphabet for analphabetical index by looking at all the skos:prefLabel values in aspecific language. It's expected to be a bit slow (several seconds)since it needs to look at all the labels - but on the new server, thequery is almost an order of magnitude slower, which is causing timeoutissues.

To investigate this more closely, I decided to drop Fuseki out of theequation and just use Jena command line utilities. I wanted to comparethe effect of Jena versions (3.8.0 vs 3.16.0), store type (TDB1 vsTDB2), and variations of the original SPARQL query. For the data, I usedthe newly published KANTO/finaf data set (an authority file of namedentities, i.e. persons and organizations) which can be downloaded fromfinto.fi [1]. It has around 3M triples, 200k skos:Concept instances andthe same number of skos:prefLabel values.

I loaded this into a TDB1 data set using Jena 3.7.0 (because ofJENA-1575) like this:

apache-jena-3.7.0/bin/tdbloader --loc tdb1 --graphhttp://example.org/finaf finaf-skos.ttl


Likewise, I loaded the same data set into a TDB2 store using Jena 3.16.0:

apache-jena-3.16.0/bin/tdb2.tdbloader --loc tdb2 --graphhttp://example.org/finaf finaf-skos.ttl



This is the original SPARQL query:

        PREFIX skos: <http://www.w3.org/2004/02/skos/core#>

        SELECT DISTINCT (ucase(str(substr(?label, 1, 1))) as ?l)
        FROM <http://example.org/finaf>
        WHERE {
          ?c a skos:Concept .
          ?c skos:prefLabel ?label .
          FILTER(langMatches(lang(?label), 'fi'))
        }

The query should return 68 results. There is no particular order sincethere is no ORDER BY, they are just a set of letters and specialcharacters such as numbers and punctuation.

I ran it using tdbquery / tdb2.tdbquery, separately on both Jena 3.8.0and 3.16.0, using the options --time --repeat 2,10 (benchmark; tworounds of warming up, ten rounds of benchmarking) and wrote down theaverage query time, rounded to the first decimal point. I'm doing thebenchmarks on an i5-7200U laptop with a pretty fast SSD.


Jena 3.8.0 / TDB1: 2.1s
Jena 3.16.0 / TDB1: 2.4s
Jena 3.8.0 / TDB2: 11.8s
Jena 3.16.0 / TDB2: 12.0s

The difference between Jena versions is not very significant, but TDB2is 5-6 times slower than TDB1. Here is how tdbquery -v explains thequery on the TDB level:


17:06:07 INFO  exec            :: TDB
  (distinct
    (project (?l)
      (extend ((?l (ucase (str (substr ?label 1 1)))))
        (filter (langMatches (lang ?label) "fi")
          (bgp

(triple ?c<http://www.w3.org/1999/02/22-rdf-syntax-ns#type><http://www.w3.org/2004/02/skos/core#Concept>)(triple ?c <http://www.w3.org/2004/02/skos/core#prefLabel>?label)

          )))))

The explanation is identical for tdb2.tdbquery so I won't repeat it.

I then looked at ways of optimizing the query to make it perform better.After trying many variations (for example reordering the clauses and/ormoving the substring expression to a BIND variable), the only changethat seemed to have a significant effect was to remove the FROM clauseand instead insert a GRAPH clause targeting the same graph, like this:


        PREFIX skos: <http://www.w3.org/2004/02/skos/core#>

        SELECT DISTINCT (ucase(str(substr(?label, 1, 1))) as ?l)
        WHERE {
          GRAPH <http://example.org/finaf> {
            ?c a skos:Concept .
            ?c skos:prefLabel ?label .
            FILTER(langMatches(lang(?label), 'fi'))
          }
        }

Benchmark results for this GRAPH version of the query:

Jena 3.8.0 / TDB1: 0.9s
Jena 3.16.0 / TDB1: 1.3s
Jena 3.8.0 / TDB2: 1.4s
Jena 3.16.0 / TDB2: 1.9s

The results are much more even this time, though Jena 3.16.0 is about40% slower than 3.8.0 and TDB2 is about 50% slower than TDB1. tdbquery-v (and tdb2.tdbquery -v) explains the query like this:


17:13:02 INFO  exec            :: TDB
  (distinct
    (project (?l)
      (extend ((?l (ucase (str (substr ?label 1 1)))))
        (filter (langMatches (lang ?label) "fi")
          (quadpattern

(quad <http://example.org/finaf> ?c<http://www.w3.org/1999/02/22-rdf-syntax-ns#type><http://www.w3.org/2004/02/skos/core#Concept>)(quad <http://example.org/finaf> ?c<http://www.w3.org/2004/02/skos/core#prefLabel> ?label)

          )))))

The difference I see compared to the previous query is the use of "quad"instead of "triple". My understanding of operations on the TDB level ispretty naive, but it seems to me this is now targeting the correct graphdirectly, instead of indirectly, as in the first case. This is a bitsurprising to me since the "FROM <http://example.org/finaf>" clause inthe first query is, to me, saying the same thing as the GRAPH clause:just target triples in this particular graph. Is there a missedopportunity for some optimization here? Why is FROM (much) worse than GRAPH?

I also wonder why TDB2 is so much slower than TDB1, especially for thefirst version of the query. It should be an improvement, right? Ofcourse there are trade-offs in implementing any complex system. But itmakes me think whether we should stick to TDB1 for the time being, asthere are no obvious benefits in using TDB2 for our current use.

Likewise, this makes me wonder whether there has been a mild decrease inperformance between Jena 3.8.0 and 3.16.0 - though I didn't look atintermediate versions to pinpoint the exact change (or several) thatwould be causing the slowdown. If there's interest, I can try otherversions as well.

For now we will probably just change Skosmos to use the GRAPH variant ofthe query, which should fix the immediate problems with timeouts.Unfortunately I don't have the skills to work directly on the ARQoptimizer or TDB2 code bases. But I'd be happy to test other variationsand potential fixes to these performance problems.


Cheers,
Osma

[1] https://finto.fi/rest/v1/finaf/data?format=text/turtle


--
Osma Suominen
D.Sc. (Tech), Information Systems Specialist
National Library of Finland
P.O. Box 15 (Unioninkatu 36)
00014 HELSINGIN YLIOPISTO
Tel. +358 50 3199529
[email protected]
http://www.nationallibrary.fi

Performance regressions in Jena and TDB2

Reply via email to