Re: query performance on named graph vs. default graph

2024-03-19 Thread Jim Balhoff
Hi Andy,

> On Mar 19, 2024, at 5:02 AM, Andy Seaborne  wrote:
>> 
> 
> Hi Jim,
> 
> What happens if you use GRAPH rather than FROM?
> 
> WHERE {
>   GRAPH  {
> ?cell rdfs:subClassOf cell: .
> ?cell part_of: ?organ .
> ?organ rdfs:subClassOf organ: .
> ?organ part_of: abdomen: .
> ?cell rdfs:label ?cell_label .
> ?organ rdfs:label ?organ_label .
>   }
> }
> 

This does help. With TDB this is actually faster than using the default graph. 
With the HDT setup it’s about the same (fast). But it doesn’t work that well 
for what I’m trying to do (below).

> FROM builds a "view dataset" which is general purpose (e.g. multiple FROM are 
> possible) but which is less efficient for basic graph pattern matching. It 
> does not use the TDB2 basic graph pattern matcher.
> 
> GRAPH restricts to a single graph and the query goes direct to TDB2 basic 
> graph pattern matcher.
> 
> 
> 
> If there is only one name graph, is here a reason to have it as a named 
> graph? Using the default graph and no unionDefaultGraph may be

What I am really trying to do is have suite of large graphs that I can choose 
to include or not in a particular query, depending on what data sources I want 
to use in the query. I have several HDT files, one for each data source. I set 
this up as a dataset with a named graph for each data file, and was at first 
very happy with how it performed while turning on and off graphs using FROM 
lines. For example I have Wikidata in one HDT file, and it looks like having it 
available doesn’t slow down queries on other graphs when it’s not included. 
However I did see that performance issue in the query I asked about, and found 
it wasn’t related to having multiple graphs loaded; it happens even with just 
that one graph configured.

If I wrote my own server that accepted a list of data source names in a query 
parameter, and then for each request constructed a union model for executing 
the query over the required HDT graphs, would that work any better? Or is that 
basically the same as what FROM is doing?

Thank you,
Jim



Re: [EXTERNAL] Re: Query Performance Degrade With Sorting In Subquery

2024-03-19 Thread Andy Seaborne

Hi there,

Could you give some background as to what the sub-select / ORDER / LIMT 
blocks are trying to achieve? Maybe there is another way.


Andy

On 19/03/2024 10:50, Rob @ DNR wrote:

You haven’t specified how your data is stored but assuming you are using Jena’s 
TDB/TDB2 then the triples/quads themselves are already indexed for efficient 
access.  It also inlines some value types that speeds up some comparisons and 
filters, including those used in simple ORDER BY expression as in your example.

This assumes that your objects for relations:hasUserCount triples are properly 
typed as xsd:integer or another well-known XSD numeric type, if not Jena is 
forced to fallback to more simplistic lexical string sorting which can be more 
expensive.

However, there is no indexing available for sorting because SPARQL allows for 
arbitrarily complex sort expressions, and the inputs to those expressions may 
themselves be dynamically computed values that don’t exist in the underlying 
dataset directly.

Rob

From: Chirag Ratra 
Date: Tuesday, 19 March 2024 at 10:39
To: users@jena.apache.org , Andy Seaborne , 
dcchabg...@gmail.com 
Subject: Re: [EXTERNAL] Re: Query Performance Degrade With Sorting In Subquery
Is there any way to create an index or something?

On Tue, Mar 19, 2024 at 3:46 PM Rob @ DNR  wrote:


This is due to Jena’s lazy evaluation in its query engine.

When you include a LIMIT clause on its own Jena only needs find the first
N results (10 in your example) at which point it can abort any further
processing and return results.  In this case evaluation is lazy.

When you include LIMIT and ORDER BY clauses Jena has to find all possible
results, sort them, and then return only the first N results.  In this case
full evaluation is required.

One possible approach might be to split into multiple queries i.e. do one
query to get your main set of results, and then separately issue the
related item sub-queries with concrete values substituted into for your
?concept and ?titleSkosXl values as while Jena will still need to do full
evaluation injecting a concrete value will constrain the query evaluation
further

Hope this helps,

Rob

From: Chirag Ratra 
Date: Tuesday, 19 March 2024 at 07:46
To: users@jena.apache.org 
Subject: Query Performance Degrade With Sorting In Subquery
Hi,

Facing a big performance degradation  while using sort query in subquery
If I run query without sorting the response of my query is around 200 ms
but when I use the order by query,  performance comes to be around 4-5
seconds.

Here is my query :

PREFIX text: >
PREFIX skos: >
PREFIX skosxl: >
PREFIX relations: >

SELECT ?concept ?titleSkosxl ?title ?languageCode (GROUP_CONCAT(DISTINCT
?relatedTitle; separator=", ") AS ?relatedTitles) (GROUP_CONCAT(DISTINCT
?alternate; separator=", ") AS ?alternates)
WHERE
{
   (?titleSkosxl ?score) text:query ('cashier').

?concept skosxl:prefLabel ?titleSkosxl.
   ?titleSkosxl skosxl:literalForm ?title.
   ?titleSkosxl relations:usedInLocale ?controlledList.
   ?controlledList relations:languageMarketCode ?languageCode
FILTER(?languageCode = 'en-US').


#  get alternate title
OPTIONAL
   {
 Select ?alternate  {
 ?concept skosxl:altLabel ?alternateSkosxl.
 ?alternateSkosxl skosxl:literalForm ?alternate;
   relations:hasUserCount ?alternateUserCount.
 }
ORDER BY DESC (?alternateUserCount) LIMIT 10
}

#  get related titles
   OPTIONAL
   {
   Select ?relatedTitle
   {
 ?titleSkosxl relations:isRelatedTo ?relatedSkosxl.
 ?relatedSkosxl skosxl:literalForm ?relatedTitle;
 relations:hasUserCount ?relatedUserCount.
   }
ORDER BY DESC (?relatedUserCount) LIMIT 10
}
}
GROUP BY ?concept ?titleSkosxl ?title ?languageCode ?alternateJobTitle
?notation
ORDER BY DESC(?jobtitleWeight) DESC(?score)
LIMIT 10

The sorting queries given causes huge performance degradation :
ORDER BY DESC (?alternateUserCount) AND ORDER BY DESC (?relatedUserCount)

How can this be improved, this sorting will be used in each and every query
in my application.

--








This email may contain material that is confidential, privileged,
or for the sole use of the intended recipient.  Any review, disclosure,
reliance, or distribution by others or forwarding without express
permission is strictly prohibited.  If you are not the intended recipient,
please contact the sender and delete all copies, including attachments.



--








This email may contain material that is confidential, privileged,
or for the sole use of the intende

Re: [EXTERNAL] Re: Query Performance Degrade With Sorting In Subquery

2024-03-19 Thread Rob @ DNR
You haven’t specified how your data is stored but assuming you are using Jena’s 
TDB/TDB2 then the triples/quads themselves are already indexed for efficient 
access.  It also inlines some value types that speeds up some comparisons and 
filters, including those used in simple ORDER BY expression as in your example.

This assumes that your objects for relations:hasUserCount triples are properly 
typed as xsd:integer or another well-known XSD numeric type, if not Jena is 
forced to fallback to more simplistic lexical string sorting which can be more 
expensive.

However, there is no indexing available for sorting because SPARQL allows for 
arbitrarily complex sort expressions, and the inputs to those expressions may 
themselves be dynamically computed values that don’t exist in the underlying 
dataset directly.

Rob

From: Chirag Ratra 
Date: Tuesday, 19 March 2024 at 10:39
To: users@jena.apache.org , Andy Seaborne 
, dcchabg...@gmail.com 
Subject: Re: [EXTERNAL] Re: Query Performance Degrade With Sorting In Subquery
Is there any way to create an index or something?

On Tue, Mar 19, 2024 at 3:46 PM Rob @ DNR  wrote:

> This is due to Jena’s lazy evaluation in its query engine.
>
> When you include a LIMIT clause on its own Jena only needs find the first
> N results (10 in your example) at which point it can abort any further
> processing and return results.  In this case evaluation is lazy.
>
> When you include LIMIT and ORDER BY clauses Jena has to find all possible
> results, sort them, and then return only the first N results.  In this case
> full evaluation is required.
>
> One possible approach might be to split into multiple queries i.e. do one
> query to get your main set of results, and then separately issue the
> related item sub-queries with concrete values substituted into for your
> ?concept and ?titleSkosXl values as while Jena will still need to do full
> evaluation injecting a concrete value will constrain the query evaluation
> further
>
> Hope this helps,
>
> Rob
>
> From: Chirag Ratra 
> Date: Tuesday, 19 March 2024 at 07:46
> To: users@jena.apache.org 
> Subject: Query Performance Degrade With Sorting In Subquery
> Hi,
>
> Facing a big performance degradation  while using sort query in subquery
> If I run query without sorting the response of my query is around 200 ms
> but when I use the order by query,  performance comes to be around 4-5
> seconds.
>
> Here is my query :
>
> PREFIX text: >
> PREFIX skos:  http://www.w3.org/2004/02/skos/core>>
> PREFIX skosxl:  http://www.w3.org/2008/05/skos-xl>>
> PREFIX relations:  https://cxdata.bold.com/ontologies/myDomain>>
>
> SELECT ?concept ?titleSkosxl ?title ?languageCode (GROUP_CONCAT(DISTINCT
> ?relatedTitle; separator=", ") AS ?relatedTitles) (GROUP_CONCAT(DISTINCT
> ?alternate; separator=", ") AS ?alternates)
> WHERE
> {
>   (?titleSkosxl ?score) text:query ('cashier').
>
> ?concept skosxl:prefLabel ?titleSkosxl.
>   ?titleSkosxl skosxl:literalForm ?title.
>   ?titleSkosxl relations:usedInLocale ?controlledList.
>   ?controlledList relations:languageMarketCode ?languageCode
> FILTER(?languageCode = 'en-US').
>
>
> #  get alternate title
> OPTIONAL
>   {
> Select ?alternate  {
> ?concept skosxl:altLabel ?alternateSkosxl.
> ?alternateSkosxl skosxl:literalForm ?alternate;
>   relations:hasUserCount ?alternateUserCount.
> }
> ORDER BY DESC (?alternateUserCount) LIMIT 10
> }
>
> #  get related titles
>   OPTIONAL
>   {
>   Select ?relatedTitle
>   {
> ?titleSkosxl relations:isRelatedTo ?relatedSkosxl.
> ?relatedSkosxl skosxl:literalForm ?relatedTitle;
> relations:hasUserCount ?relatedUserCount.
>   }
> ORDER BY DESC (?relatedUserCount) LIMIT 10
>}
> }
> GROUP BY ?concept ?titleSkosxl ?title ?languageCode ?alternateJobTitle
> ?notation
> ORDER BY DESC(?jobtitleWeight) DESC(?score)
> LIMIT 10
>
> The sorting queries given causes huge performance degradation :
> ORDER BY DESC (?alternateUserCount) AND ORDER BY DESC (?relatedUserCount)
>
> How can this be improved, this sorting will be used in each and every query
> in my application.
>
> --
>
>
>
>
>
>
>
>
> This email may contain material that is confidential, privileged,
> or for the sole use of the intended recipient.  Any review, disclosure,
> reliance, or distribution by others or forwarding without express
> permission is strictly prohibited.  If you are not the intended recipient,
> please contact the sender and delete all copies, including attachments.
>

--








This email may contain material that is confidential, privileged,
or for the sole use of the intended recipient.  Any review, discl

Re: [EXTERNAL] Re: Query Performance Degrade With Sorting In Subquery

2024-03-19 Thread Chirag Ratra
Is there any way to create an index or something?

On Tue, Mar 19, 2024 at 3:46 PM Rob @ DNR  wrote:

> This is due to Jena’s lazy evaluation in its query engine.
>
> When you include a LIMIT clause on its own Jena only needs find the first
> N results (10 in your example) at which point it can abort any further
> processing and return results.  In this case evaluation is lazy.
>
> When you include LIMIT and ORDER BY clauses Jena has to find all possible
> results, sort them, and then return only the first N results.  In this case
> full evaluation is required.
>
> One possible approach might be to split into multiple queries i.e. do one
> query to get your main set of results, and then separately issue the
> related item sub-queries with concrete values substituted into for your
> ?concept and ?titleSkosXl values as while Jena will still need to do full
> evaluation injecting a concrete value will constrain the query evaluation
> further
>
> Hope this helps,
>
> Rob
>
> From: Chirag Ratra 
> Date: Tuesday, 19 March 2024 at 07:46
> To: users@jena.apache.org 
> Subject: Query Performance Degrade With Sorting In Subquery
> Hi,
>
> Facing a big performance degradation  while using sort query in subquery
> If I run query without sorting the response of my query is around 200 ms
> but when I use the order by query,  performance comes to be around 4-5
> seconds.
>
> Here is my query :
>
> PREFIX text: >
> PREFIX skos:  http://www.w3.org/2004/02/skos/core>>
> PREFIX skosxl:  http://www.w3.org/2008/05/skos-xl>>
> PREFIX relations:  https://cxdata.bold.com/ontologies/myDomain>>
>
> SELECT ?concept ?titleSkosxl ?title ?languageCode (GROUP_CONCAT(DISTINCT
> ?relatedTitle; separator=", ") AS ?relatedTitles) (GROUP_CONCAT(DISTINCT
> ?alternate; separator=", ") AS ?alternates)
> WHERE
> {
>   (?titleSkosxl ?score) text:query ('cashier').
>
> ?concept skosxl:prefLabel ?titleSkosxl.
>   ?titleSkosxl skosxl:literalForm ?title.
>   ?titleSkosxl relations:usedInLocale ?controlledList.
>   ?controlledList relations:languageMarketCode ?languageCode
> FILTER(?languageCode = 'en-US').
>
>
> #  get alternate title
> OPTIONAL
>   {
> Select ?alternate  {
> ?concept skosxl:altLabel ?alternateSkosxl.
> ?alternateSkosxl skosxl:literalForm ?alternate;
>   relations:hasUserCount ?alternateUserCount.
> }
> ORDER BY DESC (?alternateUserCount) LIMIT 10
> }
>
> #  get related titles
>   OPTIONAL
>   {
>   Select ?relatedTitle
>   {
> ?titleSkosxl relations:isRelatedTo ?relatedSkosxl.
> ?relatedSkosxl skosxl:literalForm ?relatedTitle;
> relations:hasUserCount ?relatedUserCount.
>   }
> ORDER BY DESC (?relatedUserCount) LIMIT 10
>}
> }
> GROUP BY ?concept ?titleSkosxl ?title ?languageCode ?alternateJobTitle
> ?notation
> ORDER BY DESC(?jobtitleWeight) DESC(?score)
> LIMIT 10
>
> The sorting queries given causes huge performance degradation :
> ORDER BY DESC (?alternateUserCount) AND ORDER BY DESC (?relatedUserCount)
>
> How can this be improved, this sorting will be used in each and every query
> in my application.
>
> --
>
>
>
>
>
>
>
>
> This email may contain material that is confidential, privileged,
> or for the sole use of the intended recipient.  Any review, disclosure,
> reliance, or distribution by others or forwarding without express
> permission is strictly prohibited.  If you are not the intended recipient,
> please contact the sender and delete all copies, including attachments.
>

-- 








This email may contain material that is confidential, privileged, 
or for the sole use of the intended recipient.  Any review, disclosure, 
reliance, or distribution by others or forwarding without express 
permission is strictly prohibited.  If you are not the intended recipient, 
please contact the sender and delete all copies, including attachments.


Re: Query Performance Degrade With Sorting In Subquery

2024-03-19 Thread Rob @ DNR
This is due to Jena’s lazy evaluation in its query engine.

When you include a LIMIT clause on its own Jena only needs find the first N 
results (10 in your example) at which point it can abort any further processing 
and return results.  In this case evaluation is lazy.

When you include LIMIT and ORDER BY clauses Jena has to find all possible 
results, sort them, and then return only the first N results.  In this case 
full evaluation is required.

One possible approach might be to split into multiple queries i.e. do one query 
to get your main set of results, and then separately issue the related item 
sub-queries with concrete values substituted into for your ?concept and 
?titleSkosXl values as while Jena will still need to do full evaluation 
injecting a concrete value will constrain the query evaluation further

Hope this helps,

Rob

From: Chirag Ratra 
Date: Tuesday, 19 March 2024 at 07:46
To: users@jena.apache.org 
Subject: Query Performance Degrade With Sorting In Subquery
Hi,

Facing a big performance degradation  while using sort query in subquery
If I run query without sorting the response of my query is around 200 ms
but when I use the order by query,  performance comes to be around 4-5
seconds.

Here is my query :

PREFIX text: >
PREFIX skos: 
>
PREFIX skosxl: 
>
PREFIX relations: 
>

SELECT ?concept ?titleSkosxl ?title ?languageCode (GROUP_CONCAT(DISTINCT
?relatedTitle; separator=", ") AS ?relatedTitles) (GROUP_CONCAT(DISTINCT
?alternate; separator=", ") AS ?alternates)
WHERE
{
  (?titleSkosxl ?score) text:query ('cashier').

?concept skosxl:prefLabel ?titleSkosxl.
  ?titleSkosxl skosxl:literalForm ?title.
  ?titleSkosxl relations:usedInLocale ?controlledList.
  ?controlledList relations:languageMarketCode ?languageCode
FILTER(?languageCode = 'en-US').


#  get alternate title
OPTIONAL
  {
Select ?alternate  {
?concept skosxl:altLabel ?alternateSkosxl.
?alternateSkosxl skosxl:literalForm ?alternate;
  relations:hasUserCount ?alternateUserCount.
}
ORDER BY DESC (?alternateUserCount) LIMIT 10
}

#  get related titles
  OPTIONAL
  {
  Select ?relatedTitle
  {
?titleSkosxl relations:isRelatedTo ?relatedSkosxl.
?relatedSkosxl skosxl:literalForm ?relatedTitle;
relations:hasUserCount ?relatedUserCount.
  }
ORDER BY DESC (?relatedUserCount) LIMIT 10
   }
}
GROUP BY ?concept ?titleSkosxl ?title ?languageCode ?alternateJobTitle
?notation
ORDER BY DESC(?jobtitleWeight) DESC(?score)
LIMIT 10

The sorting queries given causes huge performance degradation :
ORDER BY DESC (?alternateUserCount) AND ORDER BY DESC (?relatedUserCount)

How can this be improved, this sorting will be used in each and every query
in my application.

--








This email may contain material that is confidential, privileged,
or for the sole use of the intended recipient.  Any review, disclosure,
reliance, or distribution by others or forwarding without express
permission is strictly prohibited.  If you are not the intended recipient,
please contact the sender and delete all copies, including attachments.


Re: query performance on named graph vs. default graph

2024-03-19 Thread Andy Seaborne




On 18/03/2024 17:46, Jim Balhoff wrote:

Hi,

I’m running a particular query in a Fuseki server which performs very 
differently if the data is in a named graph vs. the default graph. I’m 
wondering if it’s expected to have a large performance hit if a named graph is 
specified. The dataset consists of ~462 million triples; it’s this dataset with 
all graphs merged together: 
https://github.com/INCATools/ubergraph?tab=readme-ov-file#downloads

I have loaded all the triples into a named graph in TDB2 using this command:

tdb2.tdbloader --loc tdb --graph 'http://example.org/ubergraph’ ubergraph.nt.gz

My fuseki config is like this:

[] rdf:type fuseki:Server ;
 ja:context [ ja:cxtName "arq:queryTimeout" ;  ja:cxtValue "12" ] ;
 fuseki:services ( <#my-service> ) .

<#my-service> rdf:type fuseki:Service ;
 fuseki:name  "union" ;
 fuseki:serviceQuery  "sparql" ;
 fuseki:serviceReadGraphStore "get" ;
 fuseki:dataset   <#dataset> .

<#dataset> rdf:type  tdb2:DatasetTDB2 ;
 tdb2:location "tdb" ;
 tdb2:unionDefaultGraph true .

This is my query:

PREFIX rdfs: 
PREFIX cell: 
PREFIX organ: 
PREFIX abdomen: 
PREFIX part_of: 
SELECT DISTINCT ?cell ?organ
FROM 
WHERE {
   ?cell rdfs:subClassOf cell: .
   ?cell part_of: ?organ .
   ?organ rdfs:subClassOf organ: .
   ?organ part_of: abdomen: .
   ?cell rdfs:label ?cell_label .
   ?organ rdfs:label ?organ_label .
}

Using the FROM line causes the query to complete in about 40 seconds. Deleting 
the FROM line allows the query to complete in about 5 seconds.

The reason I was testing this in TDB2 is that I first noticed this behavior 
with an HDT backend, and wanted to make sure it wasn’t only an HDT issue. If I 
create a dataset using an HDT graph as the default graph, the query completes 
in a fraction of a second, but if I use the graph as a named graph the time 
jumps to about 20 seconds. For both of these scenarios (TDB2 and HDT) there is 
only a single named graph in the dataset.

Is there any way to improve performance when using FROM in the query?


Hi Jim,

What happens if you use GRAPH rather than FROM?

WHERE {
   GRAPH  {
 ?cell rdfs:subClassOf cell: .
 ?cell part_of: ?organ .
 ?organ rdfs:subClassOf organ: .
 ?organ part_of: abdomen: .
 ?cell rdfs:label ?cell_label .
 ?organ rdfs:label ?organ_label .
   }
}

FROM builds a "view dataset" which is general purpose (e.g. multiple 
FROM are possible) but which is less efficient for basic graph pattern 
matching. It does not use the TDB2 basic graph pattern matcher.


GRAPH restricts to a single graph and the query goes direct to TDB2 
basic graph pattern matcher.




If there is only one name graph, is here a reason to have it as a named 
graph? Using the default graph and no unionDefaultGraph may be


Andy



Thank you,
Jim



Query Performance Degrade With Sorting In Subquery

2024-03-19 Thread Chirag Ratra
Hi,

Facing a big performance degradation  while using sort query in subquery
If I run query without sorting the response of my query is around 200 ms
but when I use the order by query,  performance comes to be around 4-5
seconds.

Here is my query :

PREFIX text: 
PREFIX skos: 
PREFIX skosxl: 
PREFIX relations: 

SELECT ?concept ?titleSkosxl ?title ?languageCode (GROUP_CONCAT(DISTINCT
?relatedTitle; separator=", ") AS ?relatedTitles) (GROUP_CONCAT(DISTINCT
?alternate; separator=", ") AS ?alternates)
WHERE
{
  (?titleSkosxl ?score) text:query ('cashier').

?concept skosxl:prefLabel ?titleSkosxl.
  ?titleSkosxl skosxl:literalForm ?title.
  ?titleSkosxl relations:usedInLocale ?controlledList.
  ?controlledList relations:languageMarketCode ?languageCode
FILTER(?languageCode = 'en-US').


#  get alternate title
OPTIONAL
  {
Select ?alternate  {
?concept skosxl:altLabel ?alternateSkosxl.
?alternateSkosxl skosxl:literalForm ?alternate;
  relations:hasUserCount ?alternateUserCount.
}
ORDER BY DESC (?alternateUserCount) LIMIT 10
}

#  get related titles
  OPTIONAL
  {
  Select ?relatedTitle
  {
?titleSkosxl relations:isRelatedTo ?relatedSkosxl.
?relatedSkosxl skosxl:literalForm ?relatedTitle;
relations:hasUserCount ?relatedUserCount.
  }
ORDER BY DESC (?relatedUserCount) LIMIT 10
   }
}
GROUP BY ?concept ?titleSkosxl ?title ?languageCode ?alternateJobTitle
?notation
ORDER BY DESC(?jobtitleWeight) DESC(?score)
LIMIT 10

The sorting queries given causes huge performance degradation :
ORDER BY DESC (?alternateUserCount) AND ORDER BY DESC (?relatedUserCount)

How can this be improved, this sorting will be used in each and every query
in my application.

-- 








This email may contain material that is confidential, privileged, 
or for the sole use of the intended recipient.  Any review, disclosure, 
reliance, or distribution by others or forwarding without express 
permission is strictly prohibited.  If you are not the intended recipient, 
please contact the sender and delete all copies, including attachments.