On 18/03/2024 17:46, Jim Balhoff wrote:
Hi,

I’m running a particular query in a Fuseki server which performs very 
differently if the data is in a named graph vs. the default graph. I’m 
wondering if it’s expected to have a large performance hit if a named graph is 
specified. The dataset consists of ~462 million triples; it’s this dataset with 
all graphs merged together: 
https://github.com/INCATools/ubergraph?tab=readme-ov-file#downloads

I have loaded all the triples into a named graph in TDB2 using this command:

tdb2.tdbloader --loc tdb --graph 'http://example.org/ubergraph’ ubergraph.nt.gz

My fuseki config is like this:

[] rdf:type fuseki:Server ;
     ja:context [ ja:cxtName "arq:queryTimeout" ;  ja:cxtValue "120000" ] ;
     fuseki:services ( <#my-service> ) .

<#my-service> rdf:type fuseki:Service ;
     fuseki:name                  "union" ;
     fuseki:serviceQuery          "sparql" ;
     fuseki:serviceReadGraphStore "get" ;
     fuseki:dataset               <#dataset> .

<#dataset> rdf:type      tdb2:DatasetTDB2 ;
     tdb2:location "tdb" ;
     tdb2:unionDefaultGraph true .

This is my query:

PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
PREFIX cell: <http://purl.obolibrary.org/obo/CL_0000000>
PREFIX organ: <http://purl.obolibrary.org/obo/UBERON_0000062>
PREFIX abdomen: <http://purl.obolibrary.org/obo/UBERON_0000916>
PREFIX part_of: <http://purl.obolibrary.org/obo/BFO_0000050>
SELECT DISTINCT ?cell ?organ
FROM <http://example.org/ubergraph>
WHERE {
   ?cell rdfs:subClassOf cell: .
   ?cell part_of: ?organ .
   ?organ rdfs:subClassOf organ: .
   ?organ part_of: abdomen: .
   ?cell rdfs:label ?cell_label .
   ?organ rdfs:label ?organ_label .
}

Using the FROM line causes the query to complete in about 40 seconds. Deleting 
the FROM line allows the query to complete in about 5 seconds.

The reason I was testing this in TDB2 is that I first noticed this behavior 
with an HDT backend, and wanted to make sure it wasn’t only an HDT issue. If I 
create a dataset using an HDT graph as the default graph, the query completes 
in a fraction of a second, but if I use the graph as a named graph the time 
jumps to about 20 seconds. For both of these scenarios (TDB2 and HDT) there is 
only a single named graph in the dataset.

Is there any way to improve performance when using FROM in the query?

Hi Jim,

What happens if you use GRAPH rather than FROM?

WHERE {
   GRAPH <http://example.org/ubergraph> {
     ?cell rdfs:subClassOf cell: .
     ?cell part_of: ?organ .
     ?organ rdfs:subClassOf organ: .
     ?organ part_of: abdomen: .
     ?cell rdfs:label ?cell_label .
     ?organ rdfs:label ?organ_label .
   }
}

FROM builds a "view dataset" which is general purpose (e.g. multiple FROM are possible) but which is less efficient for basic graph pattern matching. It does not use the TDB2 basic graph pattern matcher.

GRAPH restricts to a single graph and the query goes direct to TDB2 basic graph pattern matcher.

----

If there is only one name graph, is here a reason to have it as a named graph? Using the default graph and no unionDefaultGraph may be

    Andy


Thank you,
Jim

Reply via email to