Hi James

> -----Original Message-----
> From: James Anderson <[email protected]>
> Sent: Tuesday, 2 April 2024 18:53
> To: usersjena. apache. org <[email protected]>
> Subject: Re: Performance question with joins
> 
> good evening;
> 
> > On 2. Apr 2024, at 12:27, Lorenz Buehmann <[email protected]
> leipzig.de> wrote:
> 
> if this description is accurate
> 
> > according to the hash join implementation in Jena in class
> AbstractIterHashJoin a join key is created via line
> >     joinKey = JoinKey.createVarKey(varsLeft, varsRight) ; That method
> > does take only the first variable in both bindings as join key instead of 
> > all
> matching variables. In our case that would probably be ?wafer I guess?
> 
> and the estimate of constituent cardinality is correct,
> 
> >  The cardinality on the left and right side of the join is around 125k.
> 
> then, depending on the distribution of the ?wafer values, this could produce a
> large intermediate cross-join.

In my test case, every solution will have the same value for ?wafer variable.
So, if that is being used as the join key for a hash join, then it can explain 
the problem.

I tried running the query without projecting ?wafer from the subqueries (so 
forcing another variable to be used as the join key) and it completes in 13 
seconds.

I also tried other things like moving ?wafer to be the last variable in the 
select of both subqueries and renaming the variables ?X_ and ?Y_ to ?a and ?b, 
but that does not appear to help.
So, the logic for taking the "first" variable in both bindings is something of 
a mystery.

> 
> what is the cardinality of the results prior to the distinct operation?

Somewhere around 120k.

That distinct is essentially redundant, each position on each wafer will be 
associated to a globally unique uid value.

> 
> ---
> james anderson | [email protected] | https://dydra.com
> 

Reply via email to