On Fri, Dec 5, 2014 at 12:46 AM, Simon Riggs si...@2ndquadrant.com wrote:
On 30 September 2014 at 05:53, Simon Riggs si...@2ndquadrant.com wrote:
On 29 September 2014 16:00, Merlin Moncure mmonc...@gmail.com wrote:
On Fri, Sep 26, 2014 at 3:06 AM, Simon Riggs si...@2ndquadrant.com wrote:
The problem, as I see it, is different. We assume that if there are
100 distinct values and you use LIMIT 1 that you would only need to
scan 1% of rows. We assume that the data is arranged in the table in a
very homogenous layout. When data is not, and it seldom is, we get
problems.
Hm, good point -- 'data proximity'. At least in theory, can't this be
measured and quantified? For example, given a number of distinct
values, you could estimate the % of pages read (or maybe non
sequential seeks relative to the number of pages) you'd need to read
all instances of a particular value in the average (or perhaps the
worst) case. One way of trying to calculate that would be to look at
proximity of values in sampled pages (and maybe a penalty assigned for
high update activity relative to table size). Data proximity would
then become a cost coefficient to the benefits of LIMIT.
The necessary first step to this is to realise that we can't simply
apply the LIMIT as a reduction in query cost, in all cases.
The way I'm seeing it, you can't assume the LIMIT will apply to any
IndexScan that doesn't have an index condition. If it has just a
filter, or nothing at all, just an ordering then it could easily scan
the whole index if the stats are wrong.
So plans like this could be wrong, by assuming the scan will end
earlier because of the LIMIT than it actually will.
Limit
IndexScan (no index cond)
Limit
NestJoin
IndexScan (no index cond)
SomeScan
Limit
NestJoin
NestJoin
IndexScan (no index cond)
SomeScan
SomeScan
and deeper...
I'm looking for a way to identify and exclude such plans, assuming
that this captures at least some of the problem plans.
After looking at this for some time I now have a patch that solves this.
It relies on the observation that index scans with no bounded quals
don't play nicely with LIMIT. The solution relies upon the point that
LIMIT does not reduce the startup cost of plans, only the total cost.
So we can solve the problem by keeping the total cost estimate, just
move some of that into startup cost so LIMIT does not reduce costs as
much as before.
It's a simple patch, but it solves the test cases I know about and
does almost nothing to planning time.
I tried much less subtle approaches involving direct prevention of
LIMIT pushdown but the code was much too complex for my liking.
Neat -- got any test cases (would this have prevented OP's problem)?
merlin
--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance