Hi James
> -----Original Message----- > From: James Anderson <[email protected]> > Sent: Tuesday, 2 April 2024 18:53 > To: usersjena. apache. org <[email protected]> > Subject: Re: Performance question with joins > > good evening; > > > On 2. Apr 2024, at 12:27, Lorenz Buehmann <[email protected] > leipzig.de> wrote: > > if this description is accurate > > > according to the hash join implementation in Jena in class > AbstractIterHashJoin a join key is created via line > > joinKey = JoinKey.createVarKey(varsLeft, varsRight) ; That method > > does take only the first variable in both bindings as join key instead of > > all > matching variables. In our case that would probably be ?wafer I guess? > > and the estimate of constituent cardinality is correct, > > > The cardinality on the left and right side of the join is around 125k. > > then, depending on the distribution of the ?wafer values, this could produce a > large intermediate cross-join. In my test case, every solution will have the same value for ?wafer variable. So, if that is being used as the join key for a hash join, then it can explain the problem. I tried running the query without projecting ?wafer from the subqueries (so forcing another variable to be used as the join key) and it completes in 13 seconds. I also tried other things like moving ?wafer to be the last variable in the select of both subqueries and renaming the variables ?X_ and ?Y_ to ?a and ?b, but that does not appear to help. So, the logic for taking the "first" variable in both bindings is something of a mystery. > > what is the cardinality of the results prior to the distinct operation? Somewhere around 120k. That distinct is essentially redundant, each position on each wafer will be associated to a globally unique uid value. > > --- > james anderson | [email protected] | https://dydra.com >
