Re: [HACKERS] Fixing Grittner's planner issues

2009-02-19 Thread Tom Lane
[ back to planner stuff after a hiatus ]

Robert Haas robertmh...@gmail.com writes:
 Right, so maybe I wasn't as clear as I could have been in asking the
 question.  I do understand how it can be a win to unique B and use it
 as the OUTER relation (jointype JOIN_UNIQUE_OUTER).  What I don't
 understand is how it can ever be a win to unique B and use it as the
 INNER relation (jointype JOIN_UNIQUE_INNER).

Hmm, well, maybe B is *really* nonunique and unique'ifying it makes it
small enough to fit in a single-batch hash table?

Also, seriously nonunique RHS data is pretty awful for mergejoining
(too much rescanning) so I could imagine wanting to do it for a
mergejoin too.

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Fixing Grittner's planner issues

2009-02-19 Thread Tom Lane
[ after re-reading the code a bit ]

Robert Haas robertmh...@gmail.com writes:
 Cool.  On the topic of documentation, I find the following comment in
 joinrels.c rather impenetrable:

 /*
  * Do these steps only if we actually have a
 regular semijoin,
  * as opposed to a case where we should
 unique-ify the RHS.
  */

The point here is that a semijoin ordinarily requires forming the whole
LHS relation (ie, min_lefthand) before applying the special join rule.
However, if you unique-ify the RHS then it's a regular inner join and
you don't have to obey that restriction, ie, you can join to just part
of min_lefthand.  Now ordinarily that's not an amazingly good idea but
there are important special cases where it is.  IIRC the case that
motivated this was

SELECT FROM a, b WHERE (a.x, b.y) IN (SELECT c1, c2 FROM c)

If you do this as a semijoin then you are forced to form the cartesian
product of a*b before you can semijoin to c.  If you uniqueify c
then you can join it to a first and then b using regular joins (possibly
indexscans on a.x and then b.y), or b and then a.

So join_is_legal allows such a join order, and the code in make_join_rel
has to be careful not to claim that a semijoin c is a valid way of
forming that join.

I'll change the comment.  Does this help?

/*
 * We might have a normal semijoin, or a case where we don't have
 * enough rels to do the semijoin but can unique-ify the RHS and
 * then do an innerjoin.  In the latter case we can't apply
 * JOIN_SEMI joining.
 */

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Fixing Grittner's planner issues

2009-02-19 Thread Robert Haas
On Thu, Feb 19, 2009 at 1:38 PM, Tom Lane t...@sss.pgh.pa.us wrote:
 [ after re-reading the code a bit ]

 Robert Haas robertmh...@gmail.com writes:
 Cool.  On the topic of documentation, I find the following comment in
 joinrels.c rather impenetrable:

 /*
  * Do these steps only if we actually have a
 regular semijoin,
  * as opposed to a case where we should
 unique-ify the RHS.
  */

 The point here is that a semijoin ordinarily requires forming the whole
 LHS relation (ie, min_lefthand) before applying the special join rule.
 However, if you unique-ify the RHS then it's a regular inner join and
 you don't have to obey that restriction, ie, you can join to just part
 of min_lefthand.  Now ordinarily that's not an amazingly good idea but
 there are important special cases where it is.  IIRC the case that
 motivated this was

SELECT FROM a, b WHERE (a.x, b.y) IN (SELECT c1, c2 FROM c)

 If you do this as a semijoin then you are forced to form the cartesian
 product of a*b before you can semijoin to c.  If you uniqueify c
 then you can join it to a first and then b using regular joins (possibly
 indexscans on a.x and then b.y), or b and then a.

 So join_is_legal allows such a join order, and the code in make_join_rel
 has to be careful not to claim that a semijoin c is a valid way of
 forming that join.

Gotcha.

 I'll change the comment.  Does this help?

/*
 * We might have a normal semijoin, or a case where we don't have
 * enough rels to do the semijoin but can unique-ify the RHS and
 * then do an innerjoin.  In the latter case we can't apply
 * JOIN_SEMI joining.
 */

It's an improvement, but your example above is so helpful in
understanding what is going on here that it might be worth explicitly
mentioning it in the comment, maybe something like this:

/*
 * In a case like the following, we don't have enough rels to plan
this as a semijoin,
 * but we don't give up completely, because it might be possible to
unique-ify the
 * RHS and perform part of the join at this level.
 *
 * SELECT FROM a, b WHERE (a.x, b.y) IN (SELECT c1, c2 FROM c)
 */

...Robert

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Fixing Grittner's planner issues

2009-02-19 Thread Robert Haas
On Thu, Feb 19, 2009 at 1:20 PM, Tom Lane t...@sss.pgh.pa.us wrote:
 [ back to planner stuff after a hiatus ]

 Robert Haas robertmh...@gmail.com writes:
 Right, so maybe I wasn't as clear as I could have been in asking the
 question.  I do understand how it can be a win to unique B and use it
 as the OUTER relation (jointype JOIN_UNIQUE_OUTER).  What I don't
 understand is how it can ever be a win to unique B and use it as the
 INNER relation (jointype JOIN_UNIQUE_INNER).

 Hmm, well, maybe B is *really* nonunique and unique'ifying it makes it
 small enough to fit in a single-batch hash table?

 Also, seriously nonunique RHS data is pretty awful for mergejoining
 (too much rescanning) so I could imagine wanting to do it for a
 mergejoin too.

Well, as I wrote upthread:

# For a merge join or nested loop, I don't see how this can ever be a
win over teaching the executor to just not rescan B.  For a hash
# join, it can be a win if B turns out to have duplicates, but then
again you could also just teach the executor to skip the insertion of
# the duplicate into the table in the first place (it has to hash 'em
anyway...).

A nestjoin seems like the clearest example.  If the inner path is not
unique and not sorted, you'll need to either sort it or hash it.  That
seems like a lot of trouble considering that you could just scan the
unsorted inner path until you hit the first match, and then move on to
the next outer tuple (and I think this is pretty much what JOIN_SEMI
does anyway).

If the inner path is not unique but does happen to be sorted, then
unique-ifying should be cheap, but I would think it would still be
faster to do it the JOIN_SEMI way rather than insert a separate unique
node to do basically the same work.

If add a materialize node for the inner path after you unique-ify it,
you might reduce the number of tuples by enough to pay for the
unique-ify and materialize steps, but if that's the case you should
probably be doing it that way for JOIN_SEMI too.

...Robert

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Fixing Grittner's planner issues

2009-02-19 Thread Tom Lane
Robert Haas robertmh...@gmail.com writes:
 On Thu, Feb 19, 2009 at 1:20 PM, Tom Lane t...@sss.pgh.pa.us wrote:
 [ back to planner stuff after a hiatus ]

 Well, as I wrote upthread:

What you're actually suggesting is modifying the executor to incorporate
the unique-fication logic into hashjoin and/or mergejoin.  Maybe, but
that code is way too complex already for my taste (especially mergejoin)
and what we'd save is, hmm, four lines in the planner.

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Fixing Grittner's planner issues

2009-02-19 Thread Robert Haas
On Thu, Feb 19, 2009 at 2:54 PM, Tom Lane t...@sss.pgh.pa.us wrote:
 Robert Haas robertmh...@gmail.com writes:
 On Thu, Feb 19, 2009 at 1:20 PM, Tom Lane t...@sss.pgh.pa.us wrote:
 [ back to planner stuff after a hiatus ]

 Well, as I wrote upthread:

 What you're actually suggesting is modifying the executor to incorporate
 the unique-fication logic into hashjoin and/or mergejoin.  Maybe, but
 that code is way too complex already for my taste (especially mergejoin)
 and what we'd save is, hmm, four lines in the planner.

What you'd save is planning time.

...Robert

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Fixing Grittner's planner issues

2009-02-19 Thread Tom Lane
Robert Haas robertmh...@gmail.com writes:
 On Thu, Feb 19, 2009 at 1:38 PM, Tom Lane t...@sss.pgh.pa.us wrote:
 I'll change the comment.  Does this help?

 It's an improvement, but your example above is so helpful in
 understanding what is going on here that it might be worth explicitly
 mentioning it in the comment, maybe something like this:

Done, see
http://anoncvs.postgresql.org/cvsweb.cgi/pgsql/src/backend/optimizer/path/joinrels.c?r1=1.97r2=1.98

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Fixing Grittner's planner issues

2009-02-19 Thread Robert Haas
On Thu, Feb 19, 2009 at 3:44 PM, Tom Lane t...@sss.pgh.pa.us wrote:
 Robert Haas robertmh...@gmail.com writes:
 On Thu, Feb 19, 2009 at 1:38 PM, Tom Lane t...@sss.pgh.pa.us wrote:
 I'll change the comment.  Does this help?

 It's an improvement, but your example above is so helpful in
 understanding what is going on here that it might be worth explicitly
 mentioning it in the comment, maybe something like this:

 Done, see
 http://anoncvs.postgresql.org/cvsweb.cgi/pgsql/src/backend/optimizer/path/joinrels.c?r1=1.97r2=1.98

Thanks, that's much clearer IMO.

...Robert

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Fixing Grittner's planner issues

2009-02-19 Thread Greg Stark
On Thu, Feb 19, 2009 at 7:54 PM, Tom Lane t...@sss.pgh.pa.us wrote:
 Robert Haas robertmh...@gmail.com writes:
 On Thu, Feb 19, 2009 at 1:20 PM, Tom Lane t...@sss.pgh.pa.us wrote:
 [ back to planner stuff after a hiatus ]

 Well, as I wrote upthread:

 What you're actually suggesting is modifying the executor to incorporate
 the unique-fication logic into hashjoin and/or mergejoin.  Maybe, but
 that code is way too complex already for my taste (especially mergejoin)
 and what we'd save is, hmm, four lines in the planner.

I'm not entirely following the implications for semijoins but I know
I've noticed more than a few cases where an option to Hash to only
gather unique values seems like it would be a win.

Consider cases like this where we hash the values twice:

postgres=# explain select * from generate_series(1,1000) as a(i) where
i in (select * from generate_series(1,100) as b(i));
 QUERY PLAN

 Hash Join  (cost=19.50..45.75 rows=1000 width=4)
   Hash Cond: (a.i = b.i)
   -  Function Scan on generate_series a  (cost=0.00..12.50 rows=1000 width=4)
   -  Hash  (cost=17.00..17.00 rows=200 width=4)
 -  HashAggregate  (cost=15.00..17.00 rows=200 width=4)
   -  Function Scan on generate_series b
(cost=0.00..12.50 rows=1000 width=4)
(6 rows)


It's tempting to have Hash cheat and just peek at the node beneath it
to see if it's a HashAggregate, in which case it could call a special
method to request the whole hash. But it would have to know that it's
just a plain uniquify and not implementing a GROUP BY.

-- 
greg

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Fixing Grittner's planner issues

2009-02-19 Thread Tom Lane
Greg Stark st...@enterprisedb.com writes:
 It's tempting to have Hash cheat and just peek at the node beneath it
 to see if it's a HashAggregate, in which case it could call a special
 method to request the whole hash. But it would have to know that it's
 just a plain uniquify and not implementing a GROUP BY.

More to the point, it would have to check if it's unique-ifying on the
same columns and with the same operators as the upper hash is using.
If we were going to do something like this, making it a real option to
the Hash node and teaching the planner about that would be *much*
easier, and would also allow saner cost estimation.

I agree that doing something like this on the inner side of a hashjoin
might not be too unreasonable --- it was the mergejoin case that really
seemed ugly when I thought about it.

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Fixing Grittner's planner issues

2009-02-19 Thread Robert Haas
On Thu, Feb 19, 2009 at 4:53 PM, Tom Lane t...@sss.pgh.pa.us wrote:
 Greg Stark st...@enterprisedb.com writes:
 It's tempting to have Hash cheat and just peek at the node beneath it
 to see if it's a HashAggregate, in which case it could call a special
 method to request the whole hash. But it would have to know that it's
 just a plain uniquify and not implementing a GROUP BY.

 More to the point, it would have to check if it's unique-ifying on the
 same columns and with the same operators as the upper hash is using.
 If we were going to do something like this, making it a real option to
 the Hash node and teaching the planner about that would be *much*
 easier, and would also allow saner cost estimation.

 I agree that doing something like this on the inner side of a hashjoin
 might not be too unreasonable --- it was the mergejoin case that really
 seemed ugly when I thought about it.

Hmm, for some reason I thought hash join would be the harder case
(since the logic to de-dupe the hash table would be all new).  In the
merge-join and nest-join cases, isn't this pretty much what JOIN_SEMI
already does?

...Robert

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Fixing Grittner's planner issues

2009-02-11 Thread Tom Lane
[ forgot to respond to this earlier, sorry ]

Robert Haas robertmh...@gmail.com writes:
 On a related note, I have some vague unease about planning A SEMI JOIN
 B as A INNER JOIN (UNIQUE B), as make_one_rel currently attempts to
 do.  For a merge join or nested loop, I don't see how this can ever be
 a win over teaching the executor to just not rescan B.  For a hash
 join, it can be a win if B turns out to have duplicates, but then
 again you could also just teach the executor to skip the insertion of
 the duplicate into the table in the first place (it has to hash 'em
 anyway...).  I think maybe I'm not understanding something about the
 logic here.

The case where this is a win is where B is small (say a few rows) and
not unique, and A is large, and there's a relevant index on A.  Then
considering this join approach lets us produce a plan that looks like

NestLoop
HashAggregate (or GroupAggregate)
Scan B
IndexScan A
Index Condition : A.x = B.y

Every other possible plan for this join involves reading all of A.

If B produces too many rows for the nestloop indexscan to be a win,
then one of the other join approaches will beat out this one in the
cost comparisons.

 One thing I notice is that src/backend/optimizer/README should
 probably be updated with the rules for commuting SEMI and ANTI joins;
 it currently only mentions INNER, LEFT, RIGHT, and FULL.

Yeah, I noticed that too.  How embarrassing.  Will fix it as part of
the patch, which I hope to start on tomorrow.

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Fixing Grittner's planner issues

2009-02-11 Thread Robert Haas
On Wed, Feb 11, 2009 at 8:24 PM, Tom Lane t...@sss.pgh.pa.us wrote:
 [ forgot to respond to this earlier, sorry ]

Thanks for responding now.

 Robert Haas robertmh...@gmail.com writes:
 On a related note, I have some vague unease about planning A SEMI JOIN
 B as A INNER JOIN (UNIQUE B), as make_one_rel currently attempts to
 do.  For a merge join or nested loop, I don't see how this can ever be
 a win over teaching the executor to just not rescan B.  For a hash
 join, it can be a win if B turns out to have duplicates, but then
 again you could also just teach the executor to skip the insertion of
 the duplicate into the table in the first place (it has to hash 'em
 anyway...).  I think maybe I'm not understanding something about the
 logic here.

 The case where this is a win is where B is small (say a few rows) and
 not unique, and A is large, and there's a relevant index on A.  Then
 considering this join approach lets us produce a plan that looks like

NestLoop
HashAggregate (or GroupAggregate)
Scan B
IndexScan A
Index Condition : A.x = B.y

Right, so maybe I wasn't as clear as I could have been in asking the
question.  I do understand how it can be a win to unique B and use it
as the OUTER relation (jointype JOIN_UNIQUE_OUTER).  What I don't
understand is how it can ever be a win to unique B and use it as the
INNER relation (jointype JOIN_UNIQUE_INNER).

 One thing I notice is that src/backend/optimizer/README should
 probably be updated with the rules for commuting SEMI and ANTI joins;
 it currently only mentions INNER, LEFT, RIGHT, and FULL.

 Yeah, I noticed that too.  How embarrassing.  Will fix it as part of
 the patch, which I hope to start on tomorrow.

Cool.  On the topic of documentation, I find the following comment in
joinrels.c rather impenetrable:

/*
 * Do these steps only if we actually have a
regular semijoin,
 * as opposed to a case where we should
unique-ify the RHS.
 */

I don't think regular semijoin is a term of art, so I'm somewhat at
a loss to understand what this means.  And why as opposed to a case
where we should unique-ify the RHS?  ISTM the code will sometimes try
both...

...Robert

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Fixing Grittner's planner issues

2009-02-05 Thread Kevin Grittner
 Tom Lane t...@sss.pgh.pa.us wrote: 
 SELECT ... FROM
   Case C
   LEFT OUTER JOIN CaseDispo CD
 ON (CD.caseNo = C.caseNo) AND (CD.countyNo =
C.countyNo)
 AND (NOT (EXISTS (SELECT 1 FROM CaseDispo CD2
   WHERE (CD2.caseNo = CD.caseNo)
 AND (CD2.countyNo = CD.countyNo)
 AND (CD2.dispoDate 
CD.dispoDate
 WHERE some-clause-that-selects-just-a-few-C-rows
 
 that is, the EXISTS clause is part of the ON condition of an outer
join.
 If it referred to any variables of the left side of the upper join
 (ie, C here) then we couldn't convert it to a separate join at
all.
 
 I wondered if anyone had any comments
 
The only thing that comes to mind for me that seems possibly helpful
is that we have typically considered it obvious that in the context of
the NOT EXISTS clause we have already established that (CD.caseNo
= C.caseNo) AND (CD.countyNo = C.countyNo) and have not
been at all consistent about whether we used C or CD to compare to
CD2.  Our operating assumption has been that it didn't matter in that
context.
 
-Kevin

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Fixing Grittner's planner issues

2009-02-05 Thread Tom Lane
Kevin Grittner kevin.gritt...@wicourts.gov writes:
 Tom Lane t...@sss.pgh.pa.us wrote: 
 If it referred to any variables of the left side of the upper join
 (ie, C here) then we couldn't convert it to a separate join at
 all.
 
 The only thing that comes to mind for me that seems possibly helpful
 is that we have typically considered it obvious that in the context of
 the NOT EXISTS clause we have already established that (CD.caseNo
 = C.caseNo) AND (CD.countyNo = C.countyNo) and have not
 been at all consistent about whether we used C or CD to compare to
 CD2.  Our operating assumption has been that it didn't matter in that
 context.

Yeah, I had thought about suggesting that you could use that to
determine which type of plan got chosen, but immediately dismissed it
as too bletcherous for words.  In any case the same type of problem
could occur in scenarios where the upper and lower join conditions
aren't so neatly related.

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Fixing Grittner's planner issues

2009-02-05 Thread Robert Haas
 I think the only fix for this that's practical in the 8.4 time frame is
 to give up trying to flatten EXISTS/NOT EXISTS that occur within the ON
 condition of an outer join, ie, revert to 8.3's level of intelligence
 about this case.  It seems like a general purpose solution would involve
 somehow being able to separate the semantic effects of an outer join
 (in particular, null-insertion) from the time at which it's performed,
 and I've really got no idea how to do that or what consequences it would
 have for both the planner and the executor.

I think that A LEFT JOIN (B ANTI JOIN C) is equivalent to (A LEFT JOIN
B) LEFT JOIN (UNIQUE C) if you rewrite each attribute provided by b as
CASE WHEN (no matching tuple found in C) THEN b.attribute END.

On a related note, I have some vague unease about planning A SEMI JOIN
B as A INNER JOIN (UNIQUE B), as make_one_rel currently attempts to
do.  For a merge join or nested loop, I don't see how this can ever be
a win over teaching the executor to just not rescan B.  For a hash
join, it can be a win if B turns out to have duplicates, but then
again you could also just teach the executor to skip the insertion of
the duplicate into the table in the first place (it has to hash 'em
anyway...).  I think maybe I'm not understanding something about the
logic here.

It seems like you might want some infrastructure for cases where we
know we are just probing the inner side of the relation for a match.
If you find at least one, you either make a decision as to whether or
not to filter the tuple (regular semi/anti-join) or whether or not to
force certain fields to NULL (semi/anti-join under ON-clause).  But
all these cases can share the
we-don't-care-about-multiple-inner-matches logic.

 Reflecting on this further, I suspect there are also some bugs in the
 planner's rules about when semi/antijoins can commute with other joins;
 but that's not directly causing Kevin's problem, because the rules do
 make the correct decision for this case.

One thing I notice is that src/backend/optimizer/README should
probably be updated with the rules for commuting SEMI and ANTI joins;
it currently only mentions INNER, LEFT, RIGHT, and FULL.

...Robert

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers