Adam,
While the queries are functionally the same they are executed in
different ways.
TDB uses NodeIds (64 bit ids) to identify RDF Terms (class Node). The
mapping between the two is the node table. Nodes are retrived
"on-demand" if results or another part of the query need them.
Count is handled specially in TDB. When counting, the actually node form
(URI, literal etc) isn't rebuilt when all that is needed is to count.
Just the internal NodeID is enough.
{ ?s <http://someURI1> ?o . }
UNION
{ ?s <http://someURI2> ?o . }
is two TDB-level patterns - the matching is done and results involve
NodeIds but not Nodes.
In
{ ?s <http://someURI1> | <http://someURI2> ?o .}
(you wrote "<http://someURI1> | <http://someURI1>" -- same URI)
ARQ passes the expression to the path evaluator which works on nodes.
There is some rewrite of paths (e.g. "/") but by default that does not
include "|".
This only makes a pronounced difference because of the count otherwise I
don't think you'd see a difference. The time is going on retrieving
nodes and it's an HDD.
And if you run warm up query first, which will fill the node cache, the
speed should be closer to the 1.1s case.
Andy
On 22/02/2021 15:09, Adam K wrote:
Hi Andy,
Thanks for the response. It's tested on Jena 3.10.0 and 3.17.0 on HDD TDB1 -
UNION query counts 63966 results in 1.1s while pipe query finished with timeout
after 2 minutes. Whole dataset has 1322457 triples.
Thanks,
On 2021/02/22 14:01:57, Andy Seaborne <a...@apache.org> wrote:
Hi Adam,
It would be useful to also know:
which version of Jena this is
What the storage is - in-memory, or TDB
TDB1 or TDB2?
If TDB: What the hardware is disk or SSD?
What the times actually are and what the count result is?
Count is handled specially in TDB and maybe that interacts with the "|"
usage.
Andy
On 22/02/2021 13:18, Adam K wrote:
Hi all, I executed two simple equivalent queries having a big performance
difference on a large dataset:
1. First matching by two alternative predicates using pipe operator
* SELECT (count(*) as ?total) WHERE { *
* { ?s <http://someURI1 <http://someURI1>> | <http://someURI1
<http://someURI1>> ?o .}*
* }*
this one is very slow and query plan shows the following matching
pattern:
(path ?subject (alt <http://someURI1> <http://someURI2> ) ?object)))))
2. If I use UNION operator instead of pipe the query becomes fast
* SELECT (count(*) as ?total) WHERE {*
* { ?s <http://someURI1 <http://someURI1>> ?o . }** UNION** { ?s
<http://someURI2 <http://someURI2>> ?o . }*
* }*
query plan here is different and shows UNION of two BGP matches:
(union (bgp (triple ?s <http://someURI1> ?o )) (bgp (triple ?s <
http://someURI2> ?o ))))))
Documentation here
https://jena.apache.org/documentation/query/property_paths.html tells that:
1. "Paths are “simple” if they involve only operators / (sequence), ^
(reverse, unary or binary) and the form {n}, for some single integer n."
2. "A path is “complex” if it involves one or more of the operators
*,?, + and {}."
These statements do do define implications of | - it should act like union,
but query plan is different - is it a bug or a feature? Is there general
recommendation to use UNION instead of pipe?
Thanks for help!