query performance on named graph vs. default graph

Jim Balhoff Mon, 18 Mar 2024 10:46:43 -0700

Hi,

I’m running a particular query in a Fuseki server which performs very 
differently if the data is in a named graph vs. the default graph. I’m 
wondering if it’s expected to have a large performance hit if a named graph is 
specified. The dataset consists of ~462 million triples; it’s this dataset with 
all graphs merged together: 
https://github.com/INCATools/ubergraph?tab=readme-ov-file#downloads


I have loaded all the triples into a named graph in TDB2 using this command: 

tdb2.tdbloader --loc tdb --graph 'http://example.org/ubergraph’ ubergraph.nt.gz

My fuseki config is like this:

[] rdf:type fuseki:Server ;
    ja:context [ ja:cxtName "arq:queryTimeout" ;  ja:cxtValue "120000" ] ;
    fuseki:services ( <#my-service> ) .

<#my-service> rdf:type fuseki:Service ;
    fuseki:name                  "union" ;
    fuseki:serviceQuery          "sparql" ;
    fuseki:serviceReadGraphStore "get" ;
    fuseki:dataset               <#dataset> .

<#dataset> rdf:type      tdb2:DatasetTDB2 ;
    tdb2:location "tdb" ;
    tdb2:unionDefaultGraph true .

This is my query:

PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
PREFIX cell: <http://purl.obolibrary.org/obo/CL_0000000>
PREFIX organ: <http://purl.obolibrary.org/obo/UBERON_0000062>
PREFIX abdomen: <http://purl.obolibrary.org/obo/UBERON_0000916>
PREFIX part_of: <http://purl.obolibrary.org/obo/BFO_0000050>
SELECT DISTINCT ?cell ?organ 
FROM <http://example.org/ubergraph>
WHERE {
  ?cell rdfs:subClassOf cell: .
  ?cell part_of: ?organ .
  ?organ rdfs:subClassOf organ: .
  ?organ part_of: abdomen: .
  ?cell rdfs:label ?cell_label .
  ?organ rdfs:label ?organ_label .
}

Using the FROM line causes the query to complete in about 40 seconds. Deleting 
the FROM line allows the query to complete in about 5 seconds.

The reason I was testing this in TDB2 is that I first noticed this behavior 
with an HDT backend, and wanted to make sure it wasn’t only an HDT issue. If I 
create a dataset using an HDT graph as the default graph, the query completes 
in a fraction of a second, but if I use the graph as a named graph the time 
jumps to about 20 seconds. For both of these scenarios (TDB2 and HDT) there is 
only a single named graph in the dataset.

Is there any way to improve performance when using FROM in the query?

Thank you,
Jim

query performance on named graph vs. default graph

Reply via email to