Re: [HACKERS] Linear vs. nonlinear planner cost estimates

2016-12-16 Thread Robert Haas
On Wed, Dec 14, 2016 at 3:46 PM, Tom Lane  wrote:
> The really *critical* aspect of this, which could cost far more than a
> few extra executions of a costing function, comes in if it damages
> add_path's ability to prune inferior paths early.  So we would not want
> to simply drop the startup_cost field; and we certainly don't want
> add_path calling this function during every path comparison, either.

Agreed.

> So my thought is that we'd need to keep startup_cost for add_path to
> use, though we could (and likely should) redefine it as below.  We
> should still be able to assume that if path A dominates path B at
> first-tuple and last-tuple costs, it dominates everywhere in between.

In theory, there's no reason that has to be true.   B could start out
a little more expensive than A, and then grow very slowly thereafter
until suddenly becoming super-expensive for the very last tuple, so
that for limits more than 1 but less than the total number of tuples
produced B actually wins.  This actually isn't a totally unrealistic
case: consider an index scan with a filter condition where all of the
tuples that pass the filter are clustered near the beginning.  Getting
any number of tuples that are actually present is cheap, but the very
last fetch that fails to find any more tuples is crushingly expensive.

That having been said, I doubt it's realistic to make the cost model
good enough to hope we'd get such cases correct, so making the
assumption that you propose is probably the right idea anyway.

> A thought that occurred to me after more study of the problem example
> is that the "good" index is a bit bigger than the "bad" one, meaning
> it will get charged a bit more in index descent costs.  With our current
> definition of startup_cost as being only the index descent cost, that
> means the good index looks worse on startup_cost than the bad one.  In
> this example, the good index should survive the add_path tournament
> anyway because it has better total_cost --- but it's easy to imagine
> cases where that wouldn't be true.  So I'm thinking that redefining
> startup_cost as time to fetch the first tuple would be a good thing
> in terms of making sure that desirable plans don't lose out in add_path.

+1.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Linear vs. nonlinear planner cost estimates

2016-12-14 Thread Tom Lane
Robert Haas  writes:
> On Wed, Dec 14, 2016 at 2:12 PM, Tom Lane  wrote:
>> I don't have a concrete proposal right now about how to fix this.  The
>> most expansive response would be to decorate every path with an explicitly
>> nonlinear cost function, which would need to be able to report the cost
>> to fetch the first N tuples rather than the total scan cost.  That would
>> be a lot of work though.

> And it would have some significant overhead, I bet.

I think the extra cost of running such a function would only come in when
we're actually considering a query with LIMIT and are ready to choose
which path gets the LIMIT stuck on top of it.  That seems okay to me.
My real pain point with extra planner work is when it happens in queries
that get no benefit.

The really *critical* aspect of this, which could cost far more than a
few extra executions of a costing function, comes in if it damages
add_path's ability to prune inferior paths early.  So we would not want
to simply drop the startup_cost field; and we certainly don't want
add_path calling this function during every path comparison, either.
So my thought is that we'd need to keep startup_cost for add_path to
use, though we could (and likely should) redefine it as below.  We
should still be able to assume that if path A dominates path B at
first-tuple and last-tuple costs, it dominates everywhere in between.

A thought that occurred to me after more study of the problem example
is that the "good" index is a bit bigger than the "bad" one, meaning
it will get charged a bit more in index descent costs.  With our current
definition of startup_cost as being only the index descent cost, that
means the good index looks worse on startup_cost than the bad one.  In
this example, the good index should survive the add_path tournament
anyway because it has better total_cost --- but it's easy to imagine
cases where that wouldn't be true.  So I'm thinking that redefining
startup_cost as time to fetch the first tuple would be a good thing
in terms of making sure that desirable plans don't lose out in add_path.

>> Maybe we could make this work just by setting the startup cost of an
>> indexscan differently.  We could redefine "startup cost" as "cost to fetch
>> the first tuple and then stop", and compute that independently of the
>> total cost for plan types where it'd matter.  That would fix the problem
>> for LIMIT 1 cases, and even for cases with a larger limit, scaling
>> linearly from that to the total cost would produce a better estimate than
>> what we get now.  But it feels like it's still a band-aid.

> I think that would be a good place to start.  I bet it would help
> materially, and it's a lot less work than anything more sophisticated.
> There are plenty of cases (semi joins, and many nested loops that are
> not semi joins) where the number of rows that we expect to fetch is 1
> or very close to 1; LIMIT 100 or LIMIT 1000 happens, but it's a lot
> less common.

Yeah.  Also, per the further thoughts above, we'd probably still end up
redefining startup_cost like this even if we then plastered on a function
to do more complex interpolation in between first and last tuple.  So
it'd make sense to do this as a first step even if we add the additional
mechanism later.

I'll put this on my to-do list...

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Linear vs. nonlinear planner cost estimates

2016-12-14 Thread Robert Haas
On Wed, Dec 14, 2016 at 2:12 PM, Tom Lane  wrote:
> I don't have a concrete proposal right now about how to fix this.  The
> most expansive response would be to decorate every path with an explicitly
> nonlinear cost function, which would need to be able to report the cost
> to fetch the first N tuples rather than the total scan cost.  That would
> be a lot of work though.

And it would have some significant overhead, I bet.

> Maybe we could make this work just by setting the startup cost of an
> indexscan differently.  We could redefine "startup cost" as "cost to fetch
> the first tuple and then stop", and compute that independently of the
> total cost for plan types where it'd matter.  That would fix the problem
> for LIMIT 1 cases, and even for cases with a larger limit, scaling
> linearly from that to the total cost would produce a better estimate than
> what we get now.  But it feels like it's still a band-aid.

I think that would be a good place to start.  I bet it would help
materially, and it's a lot less work than anything more sophisticated.
There are plenty of cases (semi joins, and many nested loops that are
not semi joins) where the number of rows that we expect to fetch is 1
or very close to 1; LIMIT 100 or LIMIT 1000 happens, but it's a lot
less common.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers