Re: Re: [EXTERNAL] Re: Query Performance Degrade With Sorting In Subquery

Lorenz Buehmann Wed, 20 Mar 2024 02:06:08 -0700

This sounds more like a use-case for correlated queries which is on theway to being added to SPARQL standard in 1.2.

I also think that your current query doesn't do what your expect as thesubquery is evaluated independently, and there will be a cartesianproduct given that you do not propagate the join variable, e.g. for thefirst subquery you need ?concept as variable.

You can try to use a LATERAL clause in order to inline the bound data tothe subquery? I'm not sure if this does happen currently as subquerieswill be evaluated first,


Like

OPTIONAL {
   LATERAL {
        {Select ?alternate ?concept  {
          ?concept skosxl:altLabel ?alternateSkosxl.
                   ?alternateSkosxl skosxl:literalForm ?alternate;
                   relations:hasUserCount ?alternateUserCount.
        }
        ORDER BY DESC (?alternateUserCount) LIMIT 10}
   }
}


Lorenz

On 20.03.24 08:23, Chirag Ratra wrote:

Hi,

Before I share the background, @Rob the answer to your question is we are
using tdb2 and the object type for relation:hasuserCount  is  <
http://www.w3.org/2001/XMLSchema#integer>

BACKGROUND  :

So the use case is we need to do a full text search for the search term on
the skos xl-prefLabel and skos xl-altLabel. After resolving the search term
through fulltext search, we need to return the metadata which includes all
the skos xl-altLabel. Since there could be many skos xl-altLabels we need
to return the top 10 skos xl-altLabel as a collection of arrays.

So we have another triple corresponding to each skos xl label which has
predicate <https://cxdata.bold.com/ontologies/myDomain#hasUserCount> and
object is integer value (<http://www.w3.org/2001/XMLSchema#integer>) which
is basically the rank of label.

So I need to return the higher rank  skos xl-altLabel in the collection .
Similar is the use case for related skosxl label



Here is ttl file

@prefix : <https://data.coypu.org/> .
@prefix fuseki: <http://jena.apache.org/fuseki#> .
@prefix rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#> .
@prefix text: <http://jena.apache.org/text#> .
@prefix tdb2: <http://jena.apache.org/2016/tdb#> .
@prefix skos: <http://www.w3.org/2004/02/skos/core#> .
@prefix skosxl: <http://www.w3.org/2008/05/skos-xl#> .
@prefix relations: <https://cxdata.bold.com/ontologies/CareerDomain#>

:service_tdb_all  rdf:type  fuseki:Service;
         fuseki:name      "cdm";
fuseki:endpoint [ fuseki:operation fuseki:query; fuseki:name "sparql" ];
fuseki:endpoint [ fuseki:operation fuseki:query; fuseki:name "query" ];
fuseki:endpoint [ fuseki:operation fuseki:update; fuseki:name "update"];
fuseki:endpoint [ fuseki:operation fuseki:gsp-r; ];
fuseki:endpoint [ fuseki:operation fuseki:gsp-r; fuseki:name "get" ];
fuseki:endpoint [ fuseki:operation fuseki:gsp-rw; fuseki:name "data" ];

         fuseki:dataset <#myTextDS>.

<#myTextDS> rdf:type text:TextDataset ;
     text:dataset <#myDatasetReadWrite> ;
     text:index <#indexLucene> ;
     .

<#indexLucene> a text:TextIndexLucene ;
     text:analyzer [ a text:StandardAnalyzer ];
     text:directory "run/databases/cdm-text-index";
     text:storeValues true ;
     text:entityMap <#entMap> ;
     .

<#entMap> a text:EntityMap ;
     text:entityField "uri" ;
     text:graphField "graph" ;
     text:defaultField "title" ;
     text:map (
         [ text:field "title"; text:predicate skosxl:literalForm; ]

     ) .

<#myDatasetReadWrite>
         rdf:type       tdb2:DatasetTDB2;
         tdb2:location
  "/apache-jena-fuseki/apache-jena-fuseki-5.0.0-rc1/run/databases/cdm" .



Here is my current query

PREFIX text: <http://jena.apache.org/text#>
PREFIX skos: <http://www.w3.org/2004/02/skos/core#>
PREFIX skosxl: <http://www.w3.org/2008/05/skos-xl#>
PREFIX relations: <https://cxdata.bold.com/ontologies/myDomain#>

SELECT ?concept ?titleSkosxl ?title ?languageCode (GROUP_CONCAT(DISTINCT
?relatedTitle; separator=", ") AS ?relatedTitles) (GROUP_CONCAT(DISTINCT
?alternate; separator=", ") AS ?alternates)
WHERE
{
   (?titleSkosxl ?score) text:query ('cashier').

?concept skosxl:prefLabel ?titleSkosxl.
   ?titleSkosxl skosxl:literalForm ?title.
   ?titleSkosxl relations:usedInLocale ?controlledList.
   ?controlledList relations:languageMarketCode ?languageCode
FILTER(?languageCode = 'en-US').


#  get alternate title
OPTIONAL
   {
         Select ?alternate  {
         ?concept skosxl:altLabel ?alternateSkosxl.
         ?alternateSkosxl skosxl:literalForm ?alternate;
   relations:hasUserCount ?alternateUserCount.
         }
ORDER BY DESC (?alternateUserCount) LIMIT 10
}

#  get related titles
   OPTIONAL
   {
       Select ?relatedTitle
       {
             ?titleSkosxl relations:isRelatedTo ?relatedSkosxl.
             ?relatedSkosxl skosxl:literalForm ?relatedTitle;
             relations:hasUserCount ?relatedUserCount.
       }
ORDER BY DESC (?relatedUserCount) LIMIT 10
    }
}
GROUP BY ?concept ?titleSkosxl ?title ?languageCode ?alternateJobTitle
?notation
ORDER BY DESC(?jobtitleWeight) DESC(?score)
LIMIT 10

The sorting queries given causes huge performance degradation :
ORDER BY DESC (?alternateUserCount) AND ORDER BY DESC (?relatedUserCount)

On Tue, Mar 19, 2024 at 5:21 PM Andy Seaborne <[email protected]> wrote:

Hi there,

Could you give some background as to what the sub-select / ORDER / LIMT
blocks are trying to achieve? Maybe there is another way.

      Andy

On 19/03/2024 10:50, Rob @ DNR wrote:

You haven’t specified how your data is stored but assuming you are using

Jena’s TDB/TDB2 then the triples/quads themselves are already indexed for
efficient access.  It also inlines some value types that speeds up some
comparisons and filters, including those used in simple ORDER BY expression
as in your example.

This assumes that your objects for relations:hasUserCount triples are

properly typed as xsd:integer or another well-known XSD numeric type, if
not Jena is forced to fallback to more simplistic lexical string sorting
which can be more expensive.

However, there is no indexing available for sorting because SPARQL

allows for arbitrarily complex sort expressions, and the inputs to those
expressions may themselves be dynamically computed values that don’t exist
in the underlying dataset directly.

Rob

From: Chirag Ratra <[email protected]>
Date: Tuesday, 19 March 2024 at 10:39
To: [email protected] <[email protected]>, Andy Seaborne <

[email protected]>, [email protected] <[email protected]>

Subject: Re: [EXTERNAL] Re: Query Performance Degrade With Sorting In

Subquery

Is there any way to create an index or something?

On Tue, Mar 19, 2024 at 3:46 PM Rob @ DNR <[email protected]> wrote:

This is due to Jena’s lazy evaluation in its query engine.

When you include a LIMIT clause on its own Jena only needs find the

first

N results (10 in your example) at which point it can abort any further
processing and return results.  In this case evaluation is lazy.

When you include LIMIT and ORDER BY clauses Jena has to find all

possible

results, sort them, and then return only the first N results.  In this

case

full evaluation is required.

One possible approach might be to split into multiple queries i.e. do

one

query to get your main set of results, and then separately issue the
related item sub-queries with concrete values substituted into for your
?concept and ?titleSkosXl values as while Jena will still need to do

full

evaluation injecting a concrete value will constrain the query

evaluation

further

Hope this helps,

Rob

From: Chirag Ratra <[email protected]>
Date: Tuesday, 19 March 2024 at 07:46
To: [email protected] <[email protected]>
Subject: Query Performance Degrade With Sorting In Subquery
Hi,

Facing a big performance degradation  while using sort query in subquery
If I run query without sorting the response of my query is around 200 ms
but when I use the order by query,  performance comes to be around 4-5
seconds.

Here is my query :

PREFIX text: <http://jena.apache.org/text#<http://jena.apache.org/text

PREFIX skos: <http://www.w3.org/2004/02/skos/core#<
http://www.w3.org/2004/02/skos/core>><

http://www.w3.org/2004/02/skos/core%3e%3e>

PREFIX skosxl: <http://www.w3.org/2008/05/skos-xl#<
http://www.w3.org/2008/05/skos-xl>><

http://www.w3.org/2008/05/skos-xl%3e%3e>

PREFIX relations: <https://cxdata.bold.com/ontologies/myDomain#<
https://cxdata.bold.com/ontologies/myDomain>><

https://cxdata.bold.com/ontologies/myDomain%3e%3e>

SELECT ?concept ?titleSkosxl ?title ?languageCode (GROUP_CONCAT(DISTINCT
?relatedTitle; separator=", ") AS ?relatedTitles) (GROUP_CONCAT(DISTINCT
?alternate; separator=", ") AS ?alternates)
WHERE
{
    (?titleSkosxl ?score) text:query ('cashier').

?concept skosxl:prefLabel ?titleSkosxl.
    ?titleSkosxl skosxl:literalForm ?title.
    ?titleSkosxl relations:usedInLocale ?controlledList.
    ?controlledList relations:languageMarketCode ?languageCode
FILTER(?languageCode = 'en-US').


#  get alternate title
OPTIONAL
    {
          Select ?alternate  {
          ?concept skosxl:altLabel ?alternateSkosxl.
          ?alternateSkosxl skosxl:literalForm ?alternate;
    relations:hasUserCount ?alternateUserCount.
          }
ORDER BY DESC (?alternateUserCount) LIMIT 10
}

#  get related titles
    OPTIONAL
    {
        Select ?relatedTitle
        {
              ?titleSkosxl relations:isRelatedTo ?relatedSkosxl.
              ?relatedSkosxl skosxl:literalForm ?relatedTitle;
              relations:hasUserCount ?relatedUserCount.
        }
ORDER BY DESC (?relatedUserCount) LIMIT 10
     }
}
GROUP BY ?concept ?titleSkosxl ?title ?languageCode ?alternateJobTitle
?notation
ORDER BY DESC(?jobtitleWeight) DESC(?score)
LIMIT 10

The sorting queries given causes huge performance degradation :
ORDER BY DESC (?alternateUserCount) AND ORDER BY DESC

(?relatedUserCount)

How can this be improved, this sorting will be used in each and every

query

in my application.

--








This email may contain material that is confidential, privileged,
or for the sole use of the intended recipient.  Any review, disclosure,
reliance, or distribution by others or forwarding without express
permission is strictly prohibited.  If you are not the intended

recipient,

please contact the sender and delete all copies, including attachments.

--








This email may contain material that is confidential, privileged,
or for the sole use of the intended recipient.  Any review, disclosure,
reliance, or distribution by others or forwarding without express
permission is strictly prohibited.  If you are not the intended

recipient,

please contact the sender and delete all copies, including attachments.


--
Lorenz Bühmann
Research Associate/Scientific Developer

Email [email protected]

Institute for Applied Informatics e.V. (InfAI) | Goerdelerring 9 | 04109 
Leipzig | Germany

Re: Re: [EXTERNAL] Re: Query Performance Degrade With Sorting In Subquery

Reply via email to