On 18/03/2024 17:46, Jim Balhoff wrote:
Hi,
I’m running a particular query in a Fuseki server which performs very
differently if the data is in a named graph vs. the default graph. I’m
wondering if it’s expected to have a large performance hit if a named graph is
specified. The dataset consists of ~462 million triples; it’s this dataset with
all graphs merged together:
https://github.com/INCATools/ubergraph?tab=readme-ov-file#downloads
I have loaded all the triples into a named graph in TDB2 using this command:
tdb2.tdbloader --loc tdb --graph 'http://example.org/ubergraph’ ubergraph.nt.gz
My fuseki config is like this:
[] rdf:type fuseki:Server ;
ja:context [ ja:cxtName "arq:queryTimeout" ; ja:cxtValue "120000" ] ;
fuseki:services ( <#my-service> ) .
<#my-service> rdf:type fuseki:Service ;
fuseki:name "union" ;
fuseki:serviceQuery "sparql" ;
fuseki:serviceReadGraphStore "get" ;
fuseki:dataset <#dataset> .
<#dataset> rdf:type tdb2:DatasetTDB2 ;
tdb2:location "tdb" ;
tdb2:unionDefaultGraph true .
This is my query:
PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
PREFIX cell: <http://purl.obolibrary.org/obo/CL_0000000>
PREFIX organ: <http://purl.obolibrary.org/obo/UBERON_0000062>
PREFIX abdomen: <http://purl.obolibrary.org/obo/UBERON_0000916>
PREFIX part_of: <http://purl.obolibrary.org/obo/BFO_0000050>
SELECT DISTINCT ?cell ?organ
FROM <http://example.org/ubergraph>
WHERE {
?cell rdfs:subClassOf cell: .
?cell part_of: ?organ .
?organ rdfs:subClassOf organ: .
?organ part_of: abdomen: .
?cell rdfs:label ?cell_label .
?organ rdfs:label ?organ_label .
}
Using the FROM line causes the query to complete in about 40 seconds. Deleting
the FROM line allows the query to complete in about 5 seconds.
The reason I was testing this in TDB2 is that I first noticed this behavior
with an HDT backend, and wanted to make sure it wasn’t only an HDT issue. If I
create a dataset using an HDT graph as the default graph, the query completes
in a fraction of a second, but if I use the graph as a named graph the time
jumps to about 20 seconds. For both of these scenarios (TDB2 and HDT) there is
only a single named graph in the dataset.
Is there any way to improve performance when using FROM in the query?
Hi Jim,
What happens if you use GRAPH rather than FROM?
WHERE {
GRAPH <http://example.org/ubergraph> {
?cell rdfs:subClassOf cell: .
?cell part_of: ?organ .
?organ rdfs:subClassOf organ: .
?organ part_of: abdomen: .
?cell rdfs:label ?cell_label .
?organ rdfs:label ?organ_label .
}
}
FROM builds a "view dataset" which is general purpose (e.g. multiple
FROM are possible) but which is less efficient for basic graph pattern
matching. It does not use the TDB2 basic graph pattern matcher.
GRAPH restricts to a single graph and the query goes direct to TDB2
basic graph pattern matcher.
----
If there is only one name graph, is here a reason to have it as a named
graph? Using the default graph and no unionDefaultGraph may be
Andy
Thank you,
Jim