To share some numbers with and without an artificial join key variable:
Query used from initial thread:
|select (count(*) as ?C) where { { select ?X ?Y (struuid() as ?UUID)
where { values ?X_i { 0 1 2 3 4 5 6 7 8 9 } values ?X_j { 0 1 2 3 4 5 6
7 8 9 } bind ( ?X_i + 10 * ?X_j as ?X) values ?Y_i { 0 1 2 3 4 5 6 7 8 9
} values ?Y_j { 0 1 2 3 4 5 6 7 8 9 } bind ( ?Y_i + 10 * ?Y_j as ?Y) } }
{ select ?X ?Y where { { select ?X ?Y (rand() as ?RAND) where { values
?X_i { 0 1 2 3 4 5 6 7 8 9 } values ?X_j { 0 1 2 3 4 5 6 7 8 9 } bind (
?X_i + 10 * ?X_j as ?X) values ?Y_i { 0 1 2 3 4 5 6 7 8 9 } values ?Y_j
{ 0 1 2 3 4 5 6 7 8 9 } bind ( ?Y_i + 10 * ?Y_j as ?Y) } } filter (?RAND
< 0.95) } } }|
and with an artificial join key
|select (count(*) AS ?cnt)
where {
{
select
(CONCAT(STR(?X), "|", STR(?Y)) AS ?key) (struuid() as ?UUID)
where {
values ?X_i { 0 1 2 3 4 5 6 7 8 9 10 }
values ?X_j { 0 1 2 3 4 5 6 7 8 9 10 }
bind ( ?X_i + 10 * ?X_j as ?X)
values ?Y_i { 0 1 2 3 4 5 6 7 8 9 10 }
values ?Y_j { 0 1 2 3 4 5 6 7 8 9 10 }
bind ( ?Y_i + 10 * ?Y_j as ?Y)
}
}
{
select
(CONCAT(STR(?X), "|", STR(?Y)) AS ?key)
where {
{
select ?X ?Y (rand() as ?RAND)
where {
values ?X_i { 0 1 2 3 4 5 6 7 8 9 10 }
values ?X_j { 0 1 2 3 4 5 6 7 8 9 10 }
bind ( ?X_i + 10 * ?X_j as ?X)
values ?Y_i { 0 1 2 3 4 5 6 7 8 9 10 }
values ?Y_j { 0 1 2 3 4 5 6 7 8 9 10 }
bind ( ?Y_i + 10 * ?Y_j as ?Y)
}
}
filter (?RAND < 0.95)
}
}
}|
To increase data I also increased the sequence from 10 to 20 and 30 in
further runs.
I disabled the join rewriter via flag because otherwise the join is
being rewritten to a sequence, so I'm forcing the hash join in this query:
|arq --explain --time --set arq:optIndexJoinStrategy=false --query test.rq|
#0..10:
##Without:
|---------
| cnt |
=========
| 18880 |
---------
Time: 2.283 sec|
##With:
|---------
| cnt |
=========
| 18849 |
---------
Time: 0.868 sec|
#0..20:
##Without:
|----------
| cnt |
==========
| 802195 |
----------
Time: 47.565 sec|
##With:
|----------
| cnt |
==========
| 802318 |
----------
Time: 3.282 sec|
|
|
#0..30:
##Without:
|-----------
| cnt |
===========
| 8066735 |
-----------
Time: 688.268 sec|
##With:
|-----------
| cnt |
===========
| 8072254 |
-----------
Time: 13.429 sec|
So for the current query its way faster to force the join being done on
all matching variables (X, Y) - but I don't know implications for other
queries.
Cheers,
Lorenz
On 02.04.24 21:13, John Walker wrote:
Hi James
-----Original Message-----
From: James Anderson<[email protected]>
Sent: Tuesday, 2 April 2024 18:53
To: usersjena. apache. org<[email protected]>
Subject: Re: Performance question with joins
good evening;
On 2. Apr 2024, at 12:27, Lorenz Buehmann <[email protected]
leipzig.de> wrote:
if this description is accurate
according to the hash join implementation in Jena in class
AbstractIterHashJoin a join key is created via line
joinKey = JoinKey.createVarKey(varsLeft, varsRight) ; That method
does take only the first variable in both bindings as join key instead of all
matching variables. In our case that would probably be ?wafer I guess?
and the estimate of constituent cardinality is correct,
The cardinality on the left and right side of the join is around 125k.
then, depending on the distribution of the ?wafer values, this could produce a
large intermediate cross-join.
In my test case, every solution will have the same value for ?wafer variable.
So, if that is being used as the join key for a hash join, then it can explain
the problem.
I tried running the query without projecting ?wafer from the subqueries (so
forcing another variable to be used as the join key) and it completes in 13
seconds.
I also tried other things like moving ?wafer to be the last variable in the
select of both subqueries and renaming the variables ?X_ and ?Y_ to ?a and ?b,
but that does not appear to help.
So, the logic for taking the "first" variable in both bindings is something of
a mystery.
what is the cardinality of the results prior to the distinct operation?
Somewhere around 120k.
That distinct is essentially redundant, each position on each wafer will be
associated to a globally unique uid value.
---
james anderson |[email protected] |https://dydra.com
--
Lorenz Bühmann
Research Associate/Scientific Developer
[email protected]
Institute for Applied Informatics e.V. (InfAI) | Goerdelerring 9 | 04109
Leipzig | Germany