Re: Optimization of query plan for pipe operator

Andy Seaborne Tue, 23 Feb 2021 03:40:19 -0800

Adam,

While the queries are functionally the same they are executed indifferent ways.

TDB uses NodeIds (64 bit ids) to identify RDF Terms (class Node). Themapping between the two is the node table. Nodes are retrived"on-demand" if results or another part of the query need them.

Count is handled specially in TDB. When counting, the actually node form(URI, literal etc) isn't rebuilt when all that is needed is to count.Just the internal NodeID is enough.


{ ?s <http://someURI1> ?o . }
UNION
{ ?s <http://someURI2> ?o . }

is two TDB-level patterns - the matching is done and results involveNodeIds but not Nodes.


In

{ ?s <http://someURI1> | <http://someURI2> ?o .}

(you wrote "<http://someURI1> | <http://someURI1>" -- same URI)

ARQ passes the expression to the path evaluator which works on nodes.

There is some rewrite of paths (e.g. "/") but by default that does notinclude "|".

This only makes a pronounced difference because of the count otherwise Idon't think you'd see a difference. The time is going on retrievingnodes and it's an HDD.

And if you run warm up query first, which will fill the node cache, thespeed should be closer to the 1.1s case.


    Andy

On 22/02/2021 15:09, Adam K wrote:

Hi Andy,
Thanks for the response. It's tested on Jena 3.10.0 and 3.17.0 on HDD TDB1 - 
UNION query counts 63966 results in 1.1s while pipe query finished with timeout 
after 2 minutes. Whole dataset has 1322457 triples.
Thanks,

On 2021/02/22 14:01:57, Andy Seaborne <a...@apache.org> wrote:

Hi Adam,

It would be useful to also know:

      which version of Jena this is
      What the storage is - in-memory, or TDB
          TDB1 or TDB2?
          If TDB: What the hardware is disk or SSD?
      What the times actually are and what the count result is?

Count is handled specially in TDB and maybe that interacts with the "|"
usage.

      Andy

On 22/02/2021 13:18, Adam K wrote:

Hi all, I executed two simple equivalent queries having a big performance
difference on a large dataset:


     1. First matching by two alternative predicates using pipe operator
* SELECT (count(*) as ?total) WHERE { *
* { ?s <http://someURI1 <http://someURI1>>  | <http://someURI1
     <http://someURI1>> ?o .}*
* }*
     this one is very slow and query plan shows the following matching
     pattern:
     (path ?subject (alt  <http://someURI1>  <http://someURI2> ) ?object)))))
     2. If I use UNION operator instead of pipe the query becomes fast
* SELECT (count(*) as ?total) WHERE {*
*   { ?s <http://someURI1 <http://someURI1>> ?o . }**  UNION**  { ?s
     <http://someURI2 <http://someURI2>> ?o . }*
* }*
     query plan here is different and shows UNION of two BGP matches:
     (union (bgp (triple ?s <http://someURI1> ?o )) (bgp (triple ?s <
     http://someURI2> ?o ))))))


Documentation here
https://jena.apache.org/documentation/query/property_paths.html tells that:

     1. "Paths are “simple” if they involve only operators / (sequence), ^
     (reverse, unary or binary) and the form {n}, for some single integer n."
     2. "A path is “complex”  if it involves one or more of the operators
     *,?, + and {}."

These statements do do define implications of | - it should act like union,
but query plan is different - is it a bug or a feature? Is there general
recommendation to use UNION instead of pipe?

Thanks for help!

Re: Optimization of query plan for pipe operator

Reply via email to