Hi,

I’m running a particular query in a Fuseki server which performs very 
differently if the data is in a named graph vs. the default graph. I’m 
wondering if it’s expected to have a large performance hit if a named graph is 
specified. The dataset consists of ~462 million triples; it’s this dataset with 
all graphs merged together: 
https://github.com/INCATools/ubergraph?tab=readme-ov-file#downloads

I have loaded all the triples into a named graph in TDB2 using this command: 

tdb2.tdbloader --loc tdb --graph 'http://example.org/ubergraph’ ubergraph.nt.gz

My fuseki config is like this:

[] rdf:type fuseki:Server ;
    ja:context [ ja:cxtName "arq:queryTimeout" ;  ja:cxtValue "120000" ] ;
    fuseki:services ( <#my-service> ) .

<#my-service> rdf:type fuseki:Service ;
    fuseki:name                  "union" ;
    fuseki:serviceQuery          "sparql" ;
    fuseki:serviceReadGraphStore "get" ;
    fuseki:dataset               <#dataset> .

<#dataset> rdf:type      tdb2:DatasetTDB2 ;
    tdb2:location "tdb" ;
    tdb2:unionDefaultGraph true .

This is my query:

PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
PREFIX cell: <http://purl.obolibrary.org/obo/CL_0000000>
PREFIX organ: <http://purl.obolibrary.org/obo/UBERON_0000062>
PREFIX abdomen: <http://purl.obolibrary.org/obo/UBERON_0000916>
PREFIX part_of: <http://purl.obolibrary.org/obo/BFO_0000050>
SELECT DISTINCT ?cell ?organ 
FROM <http://example.org/ubergraph>
WHERE {
  ?cell rdfs:subClassOf cell: .
  ?cell part_of: ?organ .
  ?organ rdfs:subClassOf organ: .
  ?organ part_of: abdomen: .
  ?cell rdfs:label ?cell_label .
  ?organ rdfs:label ?organ_label .
}

Using the FROM line causes the query to complete in about 40 seconds. Deleting 
the FROM line allows the query to complete in about 5 seconds.

The reason I was testing this in TDB2 is that I first noticed this behavior 
with an HDT backend, and wanted to make sure it wasn’t only an HDT issue. If I 
create a dataset using an HDT graph as the default graph, the query completes 
in a fraction of a second, but if I use the graph as a named graph the time 
jumps to about 20 seconds. For both of these scenarios (TDB2 and HDT) there is 
only a single named graph in the dataset.

Is there any way to improve performance when using FROM in the query?

Thank you,
Jim

Reply via email to