Re: query performance on named graph vs. default graph

2024-04-08 Thread Jim Balhoff


> On Mar 24, 2024, at 5:08 PM, Andy Seaborne  wrote:
> 
> 
> 
> On 21/03/2024 00:21, Jim Balhoff wrote:
>> Hi Lorenz,
>> These both do speed things up quite a bit, but it prevents matching patterns 
>> that cross graphs in the case where I include multiple graphs.
>> Thanks,
>> Jim
> 
> It is the combination choosing certain graphs and wanting cross graph 
> patterns that pushes the code into working in general way. it works in Nodes, 
> and that means string comparisons.  That looses the TDB ability to do faster 
> joins using NodeIds which both avoids string comparisons and retrieving the 
> strings until they are known to be needed for the results.

Thanks, this explanation makes sense.

> 
> Is there a reason for not having a union default graph overall the named 
> graphs instead of selecting certain ones? If it is all named graphs, the 
> union is TDB2 level.

For certain use cases, we would like include and exclude graphs depending on 
the query. Also, we are typically running off of HDT rather than TDB2.

> 
> You can have a Fuseki setup with two endpoints - one that does union default 
> graph, one that does not, for the same dataset.





Re: query performance on named graph vs. default graph

2024-03-24 Thread Andy Seaborne




On 21/03/2024 00:21, Jim Balhoff wrote:

Hi Lorenz,

These both do speed things up quite a bit, but it prevents matching patterns 
that cross graphs in the case where I include multiple graphs.

Thanks,
Jim


It is the combination choosing certain graphs and wanting cross graph 
patterns that pushes the code into working in general way. it works in 
Nodes, and that means string comparisons.  That looses the TDB ability 
to do faster joins using NodeIds which both avoids string comparisons 
and retrieving the strings until they are known to be needed for the 
results.


Is there a reason for not having a union default graph overall the named 
graphs instead of selecting certain ones? If it is all named graphs, the 
union is TDB2 level.


You can have a Fuseki setup with two endpoints - one that does union 
default graph, one that does not, for the same dataset.


Andy





On Mar 20, 2024, at 4:28 AM, Lorenz Buehmann 
 wrote:

Hi,

what about

SELECT *
FROM NAMED 
FROM NAMED 
FROM NAMED  ...
FROM NAMED 
{
   GRAPH ?g {
   ...
   }
}

or

SELECT *
{
  VALUES ?g {  ... }
   GRAPH ?g {
 ...
   }
}


does that work better?

On 19.03.24 15:21, Jim Balhoff wrote:

Hi Andy,


On Mar 19, 2024, at 5:02 AM, Andy Seaborne  wrote:
Hi Jim,

What happens if you use GRAPH rather than FROM?

WHERE {
   GRAPH  {
 ?cell rdfs:subClassOf cell: .
 ?cell part_of: ?organ .
 ?organ rdfs:subClassOf organ: .
 ?organ part_of: abdomen: .
 ?cell rdfs:label ?cell_label .
 ?organ rdfs:label ?organ_label .
   }
}


This does help. With TDB this is actually faster than using the default graph. 
With the HDT setup it’s about the same (fast). But it doesn’t work that well 
for what I’m trying to do (below).


FROM builds a "view dataset" which is general purpose (e.g. multiple FROM are 
possible) but which is less efficient for basic graph pattern matching. It does not use 
the TDB2 basic graph pattern matcher.

GRAPH restricts to a single graph and the query goes direct to TDB2 basic graph 
pattern matcher.



If there is only one name graph, is here a reason to have it as a named graph? 
Using the default graph and no unionDefaultGraph may be

What I am really trying to do is have suite of large graphs that I can choose 
to include or not in a particular query, depending on what data sources I want 
to use in the query. I have several HDT files, one for each data source. I set 
this up as a dataset with a named graph for each data file, and was at first 
very happy with how it performed while turning on and off graphs using FROM 
lines. For example I have Wikidata in one HDT file, and it looks like having it 
available doesn’t slow down queries on other graphs when it’s not included. 
However I did see that performance issue in the query I asked about, and found 
it wasn’t related to having multiple graphs loaded; it happens even with just 
that one graph configured.

If I wrote my own server that accepted a list of data source names in a query 
parameter, and then for each request constructed a union model for executing 
the query over the required HDT graphs, would that work any better? Or is that 
basically the same as what FROM is doing?

Thank you,
Jim



--
Lorenz Bühmann
Research Associate/Scientific Developer

Email buehm...@infai.org

Institute for Applied Informatics e.V. (InfAI) | Goerdelerring 9 | 04109 
Leipzig | Germany





Re: query performance on named graph vs. default graph

2024-03-20 Thread Jim Balhoff
Hi Lorenz,

These both do speed things up quite a bit, but it prevents matching patterns 
that cross graphs in the case where I include multiple graphs.

Thanks,
Jim


> On Mar 20, 2024, at 4:28 AM, Lorenz Buehmann 
>  wrote:
> 
> Hi,
> 
> what about
> 
> SELECT *
> FROM NAMED 
> FROM NAMED 
> FROM NAMED  ...
> FROM NAMED 
> {
>   GRAPH ?g {
>   ...
>   }
> }
> 
> or
> 
> SELECT *
> {
>  VALUES ?g {  ... }
>   GRAPH ?g {
> ...
>   }
> }
> 
> 
> does that work better?
> 
> On 19.03.24 15:21, Jim Balhoff wrote:
>> Hi Andy,
>> 
>>> On Mar 19, 2024, at 5:02 AM, Andy Seaborne  wrote:
>>> Hi Jim,
>>> 
>>> What happens if you use GRAPH rather than FROM?
>>> 
>>> WHERE {
>>>   GRAPH  {
>>> ?cell rdfs:subClassOf cell: .
>>> ?cell part_of: ?organ .
>>> ?organ rdfs:subClassOf organ: .
>>> ?organ part_of: abdomen: .
>>> ?cell rdfs:label ?cell_label .
>>> ?organ rdfs:label ?organ_label .
>>>   }
>>> }
>>> 
>> This does help. With TDB this is actually faster than using the default 
>> graph. With the HDT setup it’s about the same (fast). But it doesn’t work 
>> that well for what I’m trying to do (below).
>> 
>>> FROM builds a "view dataset" which is general purpose (e.g. multiple FROM 
>>> are possible) but which is less efficient for basic graph pattern matching. 
>>> It does not use the TDB2 basic graph pattern matcher.
>>> 
>>> GRAPH restricts to a single graph and the query goes direct to TDB2 basic 
>>> graph pattern matcher.
>>> 
>>> 
>>> 
>>> If there is only one name graph, is here a reason to have it as a named 
>>> graph? Using the default graph and no unionDefaultGraph may be
>> What I am really trying to do is have suite of large graphs that I can 
>> choose to include or not in a particular query, depending on what data 
>> sources I want to use in the query. I have several HDT files, one for each 
>> data source. I set this up as a dataset with a named graph for each data 
>> file, and was at first very happy with how it performed while turning on and 
>> off graphs using FROM lines. For example I have Wikidata in one HDT file, 
>> and it looks like having it available doesn’t slow down queries on other 
>> graphs when it’s not included. However I did see that performance issue in 
>> the query I asked about, and found it wasn’t related to having multiple 
>> graphs loaded; it happens even with just that one graph configured.
>> 
>> If I wrote my own server that accepted a list of data source names in a 
>> query parameter, and then for each request constructed a union model for 
>> executing the query over the required HDT graphs, would that work any 
>> better? Or is that basically the same as what FROM is doing?
>> 
>> Thank you,
>> Jim
>> 
>> 
> -- 
> Lorenz Bühmann
> Research Associate/Scientific Developer
> 
> Email buehm...@infai.org
> 
> Institute for Applied Informatics e.V. (InfAI) | Goerdelerring 9 | 04109 
> Leipzig | Germany
> 



Re: Re: query performance on named graph vs. default graph

2024-03-20 Thread Lorenz Buehmann

Hi,

what about

SELECT *
FROM NAMED 
FROM NAMED 
FROM NAMED  ...
FROM NAMED 
{
  GRAPH ?g {
  ...
  }
}

or

SELECT *
{
 VALUES ?g {  ... }
  GRAPH ?g {
    ...
  }
}


does that work better?

On 19.03.24 15:21, Jim Balhoff wrote:

Hi Andy,


On Mar 19, 2024, at 5:02 AM, Andy Seaborne  wrote:
Hi Jim,

What happens if you use GRAPH rather than FROM?

WHERE {
   GRAPH  {
 ?cell rdfs:subClassOf cell: .
 ?cell part_of: ?organ .
 ?organ rdfs:subClassOf organ: .
 ?organ part_of: abdomen: .
 ?cell rdfs:label ?cell_label .
 ?organ rdfs:label ?organ_label .
   }
}


This does help. With TDB this is actually faster than using the default graph. 
With the HDT setup it’s about the same (fast). But it doesn’t work that well 
for what I’m trying to do (below).


FROM builds a "view dataset" which is general purpose (e.g. multiple FROM are 
possible) but which is less efficient for basic graph pattern matching. It does not use 
the TDB2 basic graph pattern matcher.

GRAPH restricts to a single graph and the query goes direct to TDB2 basic graph 
pattern matcher.



If there is only one name graph, is here a reason to have it as a named graph? 
Using the default graph and no unionDefaultGraph may be

What I am really trying to do is have suite of large graphs that I can choose 
to include or not in a particular query, depending on what data sources I want 
to use in the query. I have several HDT files, one for each data source. I set 
this up as a dataset with a named graph for each data file, and was at first 
very happy with how it performed while turning on and off graphs using FROM 
lines. For example I have Wikidata in one HDT file, and it looks like having it 
available doesn’t slow down queries on other graphs when it’s not included. 
However I did see that performance issue in the query I asked about, and found 
it wasn’t related to having multiple graphs loaded; it happens even with just 
that one graph configured.

If I wrote my own server that accepted a list of data source names in a query 
parameter, and then for each request constructed a union model for executing 
the query over the required HDT graphs, would that work any better? Or is that 
basically the same as what FROM is doing?

Thank you,
Jim



--
Lorenz Bühmann
Research Associate/Scientific Developer

Email buehm...@infai.org

Institute for Applied Informatics e.V. (InfAI) | Goerdelerring 9 | 04109 
Leipzig | Germany



Re: query performance on named graph vs. default graph

2024-03-19 Thread Jim Balhoff
Hi Andy,

> On Mar 19, 2024, at 5:02 AM, Andy Seaborne  wrote:
>> 
> 
> Hi Jim,
> 
> What happens if you use GRAPH rather than FROM?
> 
> WHERE {
>   GRAPH  {
> ?cell rdfs:subClassOf cell: .
> ?cell part_of: ?organ .
> ?organ rdfs:subClassOf organ: .
> ?organ part_of: abdomen: .
> ?cell rdfs:label ?cell_label .
> ?organ rdfs:label ?organ_label .
>   }
> }
> 

This does help. With TDB this is actually faster than using the default graph. 
With the HDT setup it’s about the same (fast). But it doesn’t work that well 
for what I’m trying to do (below).

> FROM builds a "view dataset" which is general purpose (e.g. multiple FROM are 
> possible) but which is less efficient for basic graph pattern matching. It 
> does not use the TDB2 basic graph pattern matcher.
> 
> GRAPH restricts to a single graph and the query goes direct to TDB2 basic 
> graph pattern matcher.
> 
> 
> 
> If there is only one name graph, is here a reason to have it as a named 
> graph? Using the default graph and no unionDefaultGraph may be

What I am really trying to do is have suite of large graphs that I can choose 
to include or not in a particular query, depending on what data sources I want 
to use in the query. I have several HDT files, one for each data source. I set 
this up as a dataset with a named graph for each data file, and was at first 
very happy with how it performed while turning on and off graphs using FROM 
lines. For example I have Wikidata in one HDT file, and it looks like having it 
available doesn’t slow down queries on other graphs when it’s not included. 
However I did see that performance issue in the query I asked about, and found 
it wasn’t related to having multiple graphs loaded; it happens even with just 
that one graph configured.

If I wrote my own server that accepted a list of data source names in a query 
parameter, and then for each request constructed a union model for executing 
the query over the required HDT graphs, would that work any better? Or is that 
basically the same as what FROM is doing?

Thank you,
Jim



Re: query performance on named graph vs. default graph

2024-03-19 Thread Andy Seaborne




On 18/03/2024 17:46, Jim Balhoff wrote:

Hi,

I’m running a particular query in a Fuseki server which performs very 
differently if the data is in a named graph vs. the default graph. I’m 
wondering if it’s expected to have a large performance hit if a named graph is 
specified. The dataset consists of ~462 million triples; it’s this dataset with 
all graphs merged together: 
https://github.com/INCATools/ubergraph?tab=readme-ov-file#downloads

I have loaded all the triples into a named graph in TDB2 using this command:

tdb2.tdbloader --loc tdb --graph 'http://example.org/ubergraph’ ubergraph.nt.gz

My fuseki config is like this:

[] rdf:type fuseki:Server ;
 ja:context [ ja:cxtName "arq:queryTimeout" ;  ja:cxtValue "12" ] ;
 fuseki:services ( <#my-service> ) .

<#my-service> rdf:type fuseki:Service ;
 fuseki:name  "union" ;
 fuseki:serviceQuery  "sparql" ;
 fuseki:serviceReadGraphStore "get" ;
 fuseki:dataset   <#dataset> .

<#dataset> rdf:type  tdb2:DatasetTDB2 ;
 tdb2:location "tdb" ;
 tdb2:unionDefaultGraph true .

This is my query:

PREFIX rdfs: 
PREFIX cell: 
PREFIX organ: 
PREFIX abdomen: 
PREFIX part_of: 
SELECT DISTINCT ?cell ?organ
FROM 
WHERE {
   ?cell rdfs:subClassOf cell: .
   ?cell part_of: ?organ .
   ?organ rdfs:subClassOf organ: .
   ?organ part_of: abdomen: .
   ?cell rdfs:label ?cell_label .
   ?organ rdfs:label ?organ_label .
}

Using the FROM line causes the query to complete in about 40 seconds. Deleting 
the FROM line allows the query to complete in about 5 seconds.

The reason I was testing this in TDB2 is that I first noticed this behavior 
with an HDT backend, and wanted to make sure it wasn’t only an HDT issue. If I 
create a dataset using an HDT graph as the default graph, the query completes 
in a fraction of a second, but if I use the graph as a named graph the time 
jumps to about 20 seconds. For both of these scenarios (TDB2 and HDT) there is 
only a single named graph in the dataset.

Is there any way to improve performance when using FROM in the query?


Hi Jim,

What happens if you use GRAPH rather than FROM?

WHERE {
   GRAPH  {
 ?cell rdfs:subClassOf cell: .
 ?cell part_of: ?organ .
 ?organ rdfs:subClassOf organ: .
 ?organ part_of: abdomen: .
 ?cell rdfs:label ?cell_label .
 ?organ rdfs:label ?organ_label .
   }
}

FROM builds a "view dataset" which is general purpose (e.g. multiple 
FROM are possible) but which is less efficient for basic graph pattern 
matching. It does not use the TDB2 basic graph pattern matcher.


GRAPH restricts to a single graph and the query goes direct to TDB2 
basic graph pattern matcher.




If there is only one name graph, is here a reason to have it as a named 
graph? Using the default graph and no unionDefaultGraph may be


Andy



Thank you,
Jim