Re: [HACKERS] Cost estimates for parameterized paths

2011-11-09 Thread Tom Lane
More than a year ago, I wrote in
http://archives.postgresql.org/message-id/14624.1283463...@sss.pgh.pa.us

 Awhile back I ranted about replacing the planner's concept of inner
 indexscans with a more generalized notion of parameterized paths:
 http://archives.postgresql.org/pgsql-hackers/2009-10/msg00994.php

 The executor fixes for that are done, and now I'm grappling with getting
 the planner to do something useful with it.  The biggest problem I've run
 into is that a parameterized path can't really be assigned a fixed cost
 in the same way that a normal path can.  The current implementation of
 cost_index() depends on knowing the size of the outer relation --- that
 is, the expected number of execution loops for the indexscan --- in order
 to account for cache effects sanely while estimating the average cost of
 any one inner indexscan.

Since this project has been stalled for so long, I am thinking that what
I need to do to make some progress is to punt on the repeated-execution
cache effects problem, at least for the first cut.  I propose costing
parameterized inner paths on the worst case basis that they're only
executed once, and don't get any benefit from caching across repeated
executions.  This seems like a reasonably sane first-order approximation
on two grounds:

1. In most of the cases where such a plan is of interest, the outer
relation for the nestloop actually does provide only one or a few rows.
If it generates a lot of rows, you probably don't want a nestloop
anyhow.

2. In the cases where we really care, the alternatives are so much worse
that the parameterized nestloop will win even if it's estimated very
conservatively.

Another thing in the back of my mind is that the whole issue of cache
effects is something we know we don't model very well, so putting large
amounts of time into enlarging the present approach to handle more
complicated plan structures may be misplaced effort anyway.

So unless somebody's got a better idea, I'm going to push forward with
getting the parameterized-path infrastructure in place, and just settle
for a crude cost model for the moment.  Perhaps the implementation
experience will shake loose some new ideas about how to do the cost
modeling.

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Cost estimates for parameterized paths

2011-11-09 Thread Robert Haas
On Wed, Nov 9, 2011 at 5:12 PM, Tom Lane t...@sss.pgh.pa.us wrote:
 More than a year ago, I wrote in
 http://archives.postgresql.org/message-id/14624.1283463...@sss.pgh.pa.us

 Awhile back I ranted about replacing the planner's concept of inner
 indexscans with a more generalized notion of parameterized paths:
 http://archives.postgresql.org/pgsql-hackers/2009-10/msg00994.php

 The executor fixes for that are done, and now I'm grappling with getting
 the planner to do something useful with it.  The biggest problem I've run
 into is that a parameterized path can't really be assigned a fixed cost
 in the same way that a normal path can.  The current implementation of
 cost_index() depends on knowing the size of the outer relation --- that
 is, the expected number of execution loops for the indexscan --- in order
 to account for cache effects sanely while estimating the average cost of
 any one inner indexscan.

 Since this project has been stalled for so long, I am thinking that what
 I need to do to make some progress is to punt on the repeated-execution
 cache effects problem, at least for the first cut.  I propose costing
 parameterized inner paths on the worst case basis that they're only
 executed once, and don't get any benefit from caching across repeated
 executions.  This seems like a reasonably sane first-order approximation
 on two grounds:

 1. In most of the cases where such a plan is of interest, the outer
 relation for the nestloop actually does provide only one or a few rows.
 If it generates a lot of rows, you probably don't want a nestloop
 anyhow.

 2. In the cases where we really care, the alternatives are so much worse
 that the parameterized nestloop will win even if it's estimated very
 conservatively.

I agree, on all counts.  Errors that make new planner possibilities
look unduly expensive aren't as serious as those that go the other
way, because the alternative is that you can never generate the new
plan at all.  That's why I'm sweating about the costing index-only
scans a bit.

 Another thing in the back of my mind is that the whole issue of cache
 effects is something we know we don't model very well, so putting large
 amounts of time into enlarging the present approach to handle more
 complicated plan structures may be misplaced effort anyway.

True.  And I think that might not even be the highest priority project
to tackle anyway.  The things that are hurting people most routinely
and hardest to fix with existing tools seem to be things like
cross-column correlation, and other selectivity estimation errors.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Cost estimates for parameterized paths

2010-09-03 Thread Robert Haas
On Thu, Sep 2, 2010 at 5:31 PM, Tom Lane t...@sss.pgh.pa.us wrote:
 Awhile back I ranted about replacing the planner's concept of inner
 indexscans with a more generalized notion of parameterized paths:
 http://archives.postgresql.org/pgsql-hackers/2009-10/msg00994.php

 The executor fixes for that are done, and now I'm grappling with getting
 the planner to do something useful with it.  The biggest problem I've run
 into is that a parameterized path can't really be assigned a fixed cost
 in the same way that a normal path can.  The current implementation of
 cost_index() depends on knowing the size of the outer relation --- that
 is, the expected number of execution loops for the indexscan --- in order
 to account for cache effects sanely while estimating the average cost of
 any one inner indexscan.  We know that that is an important thing to do
 because the cost estimates seem to be a lot closer to reality now that we
 do that than what we were getting before; so dropping the consideration is
 entirely out of the question.

 The planner is already cheating on this to a considerable extent, because
 it estimates the cost of an inner indexscan only once, using the first
 outer rel we try to join to.  That cost is cached and reused with other
 potential outer-rel join partners, even though very different numbers of
 outer rows might be involved.  This problem will get a lot worse with the
 types of plans that I hope the planner will be able to come up with after
 this fix goes in, because the indexscan might be at the bottom of a join
 nest.  So we need a real fix not another hack.

 The best idea I can come up with at the moment is to compute best case
 and worst case costs for a parameterized path, corresponding to the
 largest and smallest numbers of loops we might expect it to be repeated
 for.  The largest number of loops could be estimated via the cartesian
 product of the sizes of the other tables in the query, for example.  The
 worst case cost is its cost if only repeated once.  Then, during path
 pruning in add_path, we only discard a parameterized path if its best-case
 cost is worse than the worst-case cost of some otherwise comparable path.
 Whenever we join the parameterized path with the required outer relation,
 we redo the cost calculation using that rel's actual rowcount estimate in
 order to form a final cost estimate for the no-longer-parameterized join
 path.

Interestingly, I previously proposed almost exactly this approach to
handle a couple of other problems:

http://archives.postgresql.org/pgsql-hackers/2009-09/msg01379.php
(index only scans)
http://archives.postgresql.org/pgsql-hackers/2009-10/msg00125.php
(per-query work mem)

I'm not entirely sure whether we can use this approach for more than
one kind of problem at a time; if we can't, it's probably not a good
idea to do it at all.  I also fear that any venture into this area
will  involve slowing down add_path(), which is a hotspot when
planning large join nests.  That might not be a win, if this is the
only case it allows us to improve.

*thinks*

It strikes me that the outer tuple count is essentially being used to
derate the cost of the index probes.  So another way to look at this
is that the best-case cost of an index probe is just the CPU cost, and
the worst-case cost of an index probe is the CPU cost plus the disk
cost.  So your fear that not many parameterized paths will be
discarded is probably valid.  But then, maybe that's OK.  It seems to
me that the critical point is to make sure that we don't form them in
the first place in cases where we could get the same benefit by
commuting the joins.  Once we've gotten to the point of considering a
plan of this type, the chances that it's actually the best plan seem
pretty high.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise Postgres Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Cost estimates for parameterized paths

2010-09-03 Thread Tom Lane
Robert Haas robertmh...@gmail.com writes:
 On Thu, Sep 2, 2010 at 5:31 PM, Tom Lane t...@sss.pgh.pa.us wrote:
 The best idea I can come up with at the moment is to compute best case
 and worst case costs for a parameterized path,

 Interestingly, I previously proposed almost exactly this approach to
 handle a couple of other problems:

I thought it seemed familiar ;-)

 I'm not entirely sure whether we can use this approach for more than
 one kind of problem at a time; if we can't, it's probably not a good
 idea to do it at all.

I don't see why we couldn't do it in principle.  The issue of course
is how many paths survive, and can we tolerate that from a planner
performance standpoint?

On reflection I think that for parameterized paths the problem won't be
too bad, because (a) we'll ignore parameterized paths except when
considering a join to the right outer rel, so their presence in the
rel's pathlist won't cost much otherwise, and (b) we will only generate
them when there's a suitable joinclause, so the number of potential
paths will be limited.  I am not sure those statements can be made for
the other applications you suggested, though.

There's probably not much we can do at this point except code up the
idea and see how well it works.

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Cost estimates for parameterized paths

2010-09-03 Thread Robert Haas
On Fri, Sep 3, 2010 at 2:04 PM, Tom Lane t...@sss.pgh.pa.us wrote:
 On reflection I think that for parameterized paths the problem won't be
 too bad, because (a) we'll ignore parameterized paths except when
 considering a join to the right outer rel, so their presence in the
 rel's pathlist won't cost much otherwise,

Hmm.  Maybe they should go into a separate path list, and perhaps we
could do the min/max calculations only with that pathlist (at least
for now), thus avoiding taking a generalized penalty to handle this
specific case.  IIUC, a parameterized path should never cause an
unparamaterized path to be thrown out, even if the unparameterized
path costs more than the worst-case cost for the parameterized path,
because the parameterized path constrains the possible join strategies
higher up, so what looked like a great idea at first blush might turn
out to suck when the chips are down.  Then, too, I'm not sure we can
even guarantee it will always be possible to form a valid plan around
a given parameterized path, which is another good reason not to
discard any unparameterized alternatives that may exist.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise Postgres Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Cost estimates for parameterized paths

2010-09-03 Thread Tom Lane
Robert Haas robertmh...@gmail.com writes:
 On Fri, Sep 3, 2010 at 2:04 PM, Tom Lane t...@sss.pgh.pa.us wrote:
 On reflection I think that for parameterized paths the problem won't be
 too bad, because (a) we'll ignore parameterized paths except when
 considering a join to the right outer rel, so their presence in the
 rel's pathlist won't cost much otherwise,

 Hmm.  Maybe they should go into a separate path list, and perhaps we
 could do the min/max calculations only with that pathlist (at least
 for now), thus avoiding taking a generalized penalty to handle this
 specific case.  IIUC, a parameterized path should never cause an
 unparamaterized path to be thrown out,

Yeah, but the converse isn't true.  I had considered the idea of keeping
parameterized paths in a separate list, but you'd still have to examine
the main list to look for unparameterized paths that might dominate them.

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Cost estimates for parameterized paths

2010-09-03 Thread Robert Haas
On Fri, Sep 3, 2010 at 5:53 PM, Tom Lane t...@sss.pgh.pa.us wrote:
 Robert Haas robertmh...@gmail.com writes:
 On Fri, Sep 3, 2010 at 2:04 PM, Tom Lane t...@sss.pgh.pa.us wrote:
 On reflection I think that for parameterized paths the problem won't be
 too bad, because (a) we'll ignore parameterized paths except when
 considering a join to the right outer rel, so their presence in the
 rel's pathlist won't cost much otherwise,

 Hmm.  Maybe they should go into a separate path list, and perhaps we
 could do the min/max calculations only with that pathlist (at least
 for now), thus avoiding taking a generalized penalty to handle this
 specific case.  IIUC, a parameterized path should never cause an
 unparamaterized path to be thrown out,

 Yeah, but the converse isn't true.  I had considered the idea of keeping
 parameterized paths in a separate list, but you'd still have to examine
 the main list to look for unparameterized paths that might dominate them.

Definitely true, but if it avoids slowing down add_path() in the
common case, it's worth it.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise Postgres Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers