Re: [HACKERS] Risk Estimation WAS: Planner hints in Postgresql

2014-03-20 Thread Robert Haas
On Thu, Mar 20, 2014 at 10:45 AM, Tom Lane  wrote:
> Robert Haas  writes:
>> So you might think that the problem here is that we're assuming
>> uniform density.  Let's say there are a million rows in the table, and
>> there are 100 that match our criteria, so the first one is going to
>> happen 1/10,000'th of the way through the table.  Thus we set SC =
>> 0.0001 * TC, and that turns out to be an underestimate if the
>> distribution isn't as favorable as we're hoping.  However, that is NOT
>> what we are doing.  What we are doing is setting SC = 0.  I mean, not
>> quite 0, but yeah, effectively 0. Essentially we're assuming that no
>> matter how selective the filter condition may be, we assume that it
>> will match *the very first row*.
>
> I think this is wrong.  Yeah, the SC may be 0 or near it, but the time to
> fetch the first tuple is estimated as SC + (TC-SC)/N.

Hmm, you're right, and experimentation confirms that the total cost of
the limit comes out to about TC/selectivity.  So scratch that theory.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Risk Estimation WAS: Planner hints in Postgresql

2014-03-20 Thread Atri Sharma
On Thu, Mar 20, 2014 at 8:51 PM, Tom Lane  wrote:

> Atri Sharma  writes:
> > Now, why cannot we take the estimate of all the buckets behind the bucket
> > in which our value is present? Will that estimate not give us the
> fraction
> > of tuples that are expected to be before the first matching row?
>
> Uh, no, not unless you assume that the table happens to be perfectly
> sorted by the column's value.
>
>
>

Yes, that is true. So, if an attribute has an index present, can we do this
somehow?

Regards,

Atri



-- 
Regards,

Atri
*l'apprenant*


Re: [HACKERS] Risk Estimation WAS: Planner hints in Postgresql

2014-03-20 Thread Tom Lane
Atri Sharma  writes:
> Now, why cannot we take the estimate of all the buckets behind the bucket
> in which our value is present? Will that estimate not give us the fraction
> of tuples that are expected to be before the first matching row?

Uh, no, not unless you assume that the table happens to be perfectly
sorted by the column's value.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Risk Estimation WAS: Planner hints in Postgresql

2014-03-20 Thread Atri Sharma
On Thu, Mar 20, 2014 at 8:10 PM, Robert Haas  wrote:

> On Tue, Mar 18, 2014 at 2:41 PM, Tom Lane  wrote:
> > Atri Sharma  writes:
> >> One of the factors that leads to bad estimates is that the histogram of
> the
> >> values of a column maintained by the planner gets old by time and the
> data
> >> in the column changes. So, the histogram is no longer a quite accurate
> view
> >> of the data and it leads to bad selectivity.
> >
> > TBH, this is so far down the list of problems that it'll be a long time
> > before we need to worry about it.  It's certainly not the number one
> > priority for any project to model risk in the planner.
> >
> > The thing that I think is probably the number one problem is estimates
> > that depend on an assumption of uniform distribution of sought-after rows
> > among those encountered by a scan.  This is usually where bad plans for
> > LIMIT queries are coming from.  We could certainly add some sort of fudge
> > factor to those costs, but I'd like to have a more-or-less principled
> > framework for doing so.
>
> I think the problem is, in some sense, more basic than that.  I think
> the kind of query we're talking about here is:
>
> SELECT * FROM foo WHERE unlikely ORDER BY indexed_column LIMIT 1
>
> Assume for the sake of argument that there are 100 rows that would be
> returned in the absence of the limit.  Let SC and TC be the startup
> cost and total cost of the index scan.  As a matter of general policy,
> we're going to say that the cost of this is SC + 0.01 * (TC - SC).
> What makes this path look appealing to the planner is that SC is small
> relative to TC.  If we knew, for example, that we weren't going to
> find the first match until 90% of the way through the index scan, then
> we could set SC = 90% * TC and, all else being equal, the planner
> would make the right decision.
>
> So you might think that the problem here is that we're assuming
> uniform density.  Let's say there are a million rows in the table, and
> there are 100 that match our criteria, so the first one is going to
> happen 1/10,000'th of the way through the table.  Thus we set SC =
> 0.0001 * TC, and that turns out to be an underestimate if the
> distribution isn't as favorable as we're hoping.  However, that is NOT
> what we are doing.  What we are doing is setting SC = 0.  I mean, not
> quite 0, but yeah, effectively 0. Essentially we're assuming that no
> matter how selective the filter condition may be, we assume that it
> will match *the very first row*.
>
>

Cannot we reuse the same histogram we have in the planner right now for
this? I mean, AFAIK, the heuristic we have is that we divide the histogram
into equal size buckets and then find the bucket in which our predicate
value lies, then take some part of that bucket and the rest of the buckets
before that bucket,right?

So, suppose a query is SELECT * FROM table WHERE a > 10, we shall find the
bucket that 10 lies in, right?

Now, why cannot we take the estimate of all the buckets behind the bucket
in which our value is present? Will that estimate not give us the fraction
of tuples that are expected to be before the first matching row?

Its pretty wild, but I wanted to know if my understanding of this scenario
is correct or not.

Regards,

Atri

-- 
Regards,

Atri
*l'apprenant*


Re: [HACKERS] Risk Estimation WAS: Planner hints in Postgresql

2014-03-20 Thread Tom Lane
Robert Haas  writes:
> So you might think that the problem here is that we're assuming
> uniform density.  Let's say there are a million rows in the table, and
> there are 100 that match our criteria, so the first one is going to
> happen 1/10,000'th of the way through the table.  Thus we set SC =
> 0.0001 * TC, and that turns out to be an underestimate if the
> distribution isn't as favorable as we're hoping.  However, that is NOT
> what we are doing.  What we are doing is setting SC = 0.  I mean, not
> quite 0, but yeah, effectively 0. Essentially we're assuming that no
> matter how selective the filter condition may be, we assume that it
> will match *the very first row*.

I think this is wrong.  Yeah, the SC may be 0 or near it, but the time to
fetch the first tuple is estimated as SC + (TC-SC)/N.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Risk Estimation WAS: Planner hints in Postgresql

2014-03-20 Thread Robert Haas
On Tue, Mar 18, 2014 at 2:41 PM, Tom Lane  wrote:
> Atri Sharma  writes:
>> One of the factors that leads to bad estimates is that the histogram of the
>> values of a column maintained by the planner gets old by time and the data
>> in the column changes. So, the histogram is no longer a quite accurate view
>> of the data and it leads to bad selectivity.
>
> TBH, this is so far down the list of problems that it'll be a long time
> before we need to worry about it.  It's certainly not the number one
> priority for any project to model risk in the planner.
>
> The thing that I think is probably the number one problem is estimates
> that depend on an assumption of uniform distribution of sought-after rows
> among those encountered by a scan.  This is usually where bad plans for
> LIMIT queries are coming from.  We could certainly add some sort of fudge
> factor to those costs, but I'd like to have a more-or-less principled
> framework for doing so.

I think the problem is, in some sense, more basic than that.  I think
the kind of query we're talking about here is:

SELECT * FROM foo WHERE unlikely ORDER BY indexed_column LIMIT 1

Assume for the sake of argument that there are 100 rows that would be
returned in the absence of the limit.  Let SC and TC be the startup
cost and total cost of the index scan.  As a matter of general policy,
we're going to say that the cost of this is SC + 0.01 * (TC - SC).
What makes this path look appealing to the planner is that SC is small
relative to TC.  If we knew, for example, that we weren't going to
find the first match until 90% of the way through the index scan, then
we could set SC = 90% * TC and, all else being equal, the planner
would make the right decision.

So you might think that the problem here is that we're assuming
uniform density.  Let's say there are a million rows in the table, and
there are 100 that match our criteria, so the first one is going to
happen 1/10,000'th of the way through the table.  Thus we set SC =
0.0001 * TC, and that turns out to be an underestimate if the
distribution isn't as favorable as we're hoping.  However, that is NOT
what we are doing.  What we are doing is setting SC = 0.  I mean, not
quite 0, but yeah, effectively 0. Essentially we're assuming that no
matter how selective the filter condition may be, we assume that it
will match *the very first row*.

So we're not assuming the average case and getting hosed when things
come out worse than average.  We're assuming the *best* case.  So
unless things happen to really swing in our favor, we got hosed.

Now it might be that a fudge factor of 2 or 1.5 or 10 or 3 or 17 is
appropriate, so that we actually assume we're going to have to scan a
little more of the index than we expect.  That can perhaps be
justified by the possibility that there may actually be NO rows
matching the filter condition, and we'll have to try scanning the
entire index to get off the ground.  We could also try to come up with
a mathematical model for that.  But that fudge factor would presumably
be a multiplier on the effort of finding the first tuple.  And right
now we assume that finding the first tuple will be trivial.  So I
think we should fix THAT problem first, and then if that turns out to
be insufficient, we can worry about what further fudging is required.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Risk Estimation WAS: Planner hints in Postgresql

2014-03-18 Thread Tom Lane
Atri Sharma  writes:
> One of the factors that leads to bad estimates is that the histogram of the
> values of a column maintained by the planner gets old by time and the data
> in the column changes. So, the histogram is no longer a quite accurate view
> of the data and it leads to bad selectivity.

TBH, this is so far down the list of problems that it'll be a long time
before we need to worry about it.  It's certainly not the number one
priority for any project to model risk in the planner.

The thing that I think is probably the number one problem is estimates
that depend on an assumption of uniform distribution of sought-after rows
among those encountered by a scan.  This is usually where bad plans for
LIMIT queries are coming from.  We could certainly add some sort of fudge
factor to those costs, but I'd like to have a more-or-less principled
framework for doing so.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Risk Estimation WAS: Planner hints in Postgresql

2014-03-18 Thread Atri Sharma
On Tue, Mar 18, 2014 at 11:43 PM, Josh Berkus  wrote:

>
> > On Mon, Mar 17, 2014 at 8:47 PM, Tom Lane  wrote:
> >> Yeah.  I would like to see the planner's cost estimates extended to
> >> include some sort of uncertainty estimate, whereupon risk-averse people
> >> could ask it to prefer low-uncertainty plans over high-uncertainty ones
> >> (the plans we typically choose for ORDER BY ... LIMIT queries being
> great
> >> examples of the latter).  But it's a long way from wishing that to
> making
> >> it so.  Right now it's not even clear (to me anyway) how we'd measure or
> >> model such uncertainty.
>
>

I have been thinking of some ways to have a risk estimate of each
selectivity that our planner gives. I think a way to do it is as follows:

One of the factors that leads to bad estimates is that the histogram of the
values of a column maintained by the planner gets old by time and the data
in the column changes. So, the histogram is no longer a quite accurate view
of the data and it leads to bad selectivity.

One thing we can try to do is to add a factor of error that we feel the
selectivity given can have. This allows us to factor in the probability
that the data changed and the estimate of the difference of the current
histogram and the histogram of the actual data currently present in the
column in the table.

We can use Central Limit Theorem (
http://en.wikipedia.org/wiki/Central_limit_theorem). Essentially, what the
theorem says is that given a distribution that has finite variance and
finite mean, we can take random independent samples from the data and
calculate the standard deviation and the mean of the sample. If we have
large enough number of samples and if we plot the mean and SD, they would
follow a normal distribution.

What is interesting is that this can allow us to predict the SD of a given
dataset from the curve and the SD should be directly proportional to the
deviation it has from the given planner histogram.

I am no mathematician hence its hard for me to explain. I think this link
[1] will be more helpful.

So, we can have a probability value for the random variable and that shall
model the confidence we have in our estimate.

I may be wrong in some parts but I hope I have been able to convey the
general idea.

If this idea develops, I shall be happy to work on this but my hands are
full in ROLLUPS right now, so for my part it shall take time. I just want
to float the idea and get a general feel about the idea right now.

Please let me know your comments and feedback on this.

Regards,

Atri

[1]: http://www.theriac.org/DeskReference/viewDocument.php?id=177
-- 
Regards,

Atri
*l'apprenant*


Re: [HACKERS] Risk Estimation WAS: Planner hints in Postgresql

2014-03-18 Thread Josh Berkus

> On Mon, Mar 17, 2014 at 8:47 PM, Tom Lane  wrote:
>> Yeah.  I would like to see the planner's cost estimates extended to
>> include some sort of uncertainty estimate, whereupon risk-averse people
>> could ask it to prefer low-uncertainty plans over high-uncertainty ones
>> (the plans we typically choose for ORDER BY ... LIMIT queries being great
>> examples of the latter).  But it's a long way from wishing that to making
>> it so.  Right now it's not even clear (to me anyway) how we'd measure or
>> model such uncertainty.

This is not a model, but here's some starting thoughts:

A "high risk" plan has two components:

a) our statistical data is out-of-date or inadequate

b) the potential execution time if our estimates of selectivity are
wrong is high

c) the cost ratio of certain operations is wrong.

Factor (a) can be modeled two ways:

1. If last_analyze is a long time ago, we have increased the risk.
   (Ideally, we'd have some idea of the change rate on the table vs.
the last analyze time; right now we don't have those stats)

2. Certain patterns, such as multi-column selectivity and GIN/GiST
selectivity are known to have poor estimates, and be higher risk.
Certainly selectivity functions which have been programmed with a flat
coefficient (like default 0.05 selectivity for gist_ops) could also
return a risk factor which is fairly high.

Factor (b) can be modeled simply by estimating the cost of a plan where
all row estimates are changed by 10X, or even better by a calculation on
the risk factor calculated in (a).  This would then give us the "failure
cost" of the bad plan.  Note that we need to estimate in both
directions, both for higher estimates and lower ones; "abort early"
plans fail because the rows returned are lower than expected, for example.

(b) estimation would be expensive if we did every combination of the
entire plan with wrong estimates, so I'm wondering if it would be
adequate to just estimate the node selectivity being off on a per-node
basis.

(c) we can't realistically estimate for at all (i.e. if we knew the cost
factor was wrong, we'd fix it) so I suggest ignoring it for risk estimation.

-- 
Josh Berkus
PostgreSQL Experts Inc.
http://pgexperts.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers