Re: [HACKERS] Proposal - improve eqsel estimates by including histogram bucket numdistinct statistics

2008-06-12 Thread Nathan Boley
 Assuming that the threshold
 for switching to an indexscan is somewhere around selectivity 0.005
 (I am not certain offhand, but it's in that general area), this cannot
 possibly require more than 200 MCV slots, and for most data
 distributions it'd be a whole lot less.

Thats a really good point.

 Given such an MCV list, the planner will always make the right choice
 of whether to do index or seqscan

Given that, wouldn't it be smarter to consider a value as an mcv
candidate iff it has a density greater than 0.005, rather than having
a count greater than 1.5*average? This would allow people to raise the
hard mcv limit without having to worry as much about including
worthless mcv values...

Cheers,
Nathan

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Proposal - improve eqsel estimates by including histogram bucket numdistinct statistics

2008-06-12 Thread Tom Lane
Nathan Boley [EMAIL PROTECTED] writes:
 Given that, wouldn't it be smarter to consider a value as an mcv
 candidate iff it has a density greater than 0.005, rather than having
 a count greater than 1.5*average?

Yeah, perhaps ... want to experiment with that?  Though I'd be a bit
worried about how to get the threshold right, seeing that it will
depend a whole lot on what the user has selected for random_page_cost
and other parameters.

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Proposal - improve eqsel estimates by including histogram bucket numdistinct statistics

2008-06-11 Thread Gregory Stark
Tom Lane [EMAIL PROTECTED] writes:

 Gregory Stark [EMAIL PROTECTED] writes:
 It's possible that the second option I described -- teaching Append when to
 use something other than sum() -- would only work in the cases where
 constraint exclusion could be fixed though. In which case having fractional
 row counts might actually be necessary. 

 The above is just armwaving.  IMHO, if you don't understand the
 structure of the table set then you're not going to be able to get the
 desired behavior via fractional rowcounts either.

That's only a specific subset of cases. You could just as easily have quals
which are only coincidentally related to the partition key or even not related
at all, just very selective and produce no records from some partitions.

The bottom line is that if you have a large table our statistics do a good job
estimating the selectivity of a where clause with the minimum clamped to 1. If
you partition it into 100 partitions then the minimum is clamped to 100.

-- 
  Gregory Stark
  EnterpriseDB  http://www.enterprisedb.com
  Ask me about EnterpriseDB's On-Demand Production Tuning

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Proposal - improve eqsel estimates by including histogram bucket numdistinct statistics

2008-06-10 Thread Zeugswetter Andreas OSB sIT

 Obviously we run into problems when
 a) we have a poor estimate for ndistinct - but then we have
 worse problems
 b) our length measure doesn't correspond well with ndistinct
 in an interval

One more problem with low ndistinct values is that the condition might very well
hit no rows at all. But Idea 1 will largely overestimate the number of hits.

e.g. char(2) field has a histogram bin for 'a1' - 'b1' ndistinct is 2 because 
actual
values in the bin are 'a1' and 'a2'. A query for 'a3' now has a bogus estimate 
of nrowsperbin / 2.

I think for low ndistinct values we will want to know the exact
value + counts and not a bin. So I think we will want additional stats rows
that represent value 'a1' stats.

Andreas


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Proposal - improve eqsel estimates by including histogram bucket numdistinct statistics

2008-06-10 Thread Gregory Stark
Zeugswetter Andreas OSB sIT [EMAIL PROTECTED] writes:

 I think for low ndistinct values we will want to know the exact
 value + counts and not a bin. So I think we will want additional stats rows
 that represent value 'a1' stats.

Isn't that what our most frequent values list does?

-- 
  Gregory Stark
  EnterpriseDB  http://www.enterprisedb.com
  Ask me about EnterpriseDB's Slony Replication support!

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Proposal - improve eqsel estimates by including histogram bucket numdistinct statistics

2008-06-10 Thread Nathan Boley
  One more problem with low ndistinct values is that the condition might 
  very well
  hit no rows at all. But Idea 1 will largely overestimate the number of 
  hits.

Thats a good point, but I don't see a clear solution. Maybe we could
look at past queries
and keep track of how often they return empty result sets?

It seems that, in some ways, we care about the distribution of the
query values in addition to the column values...

  I think for low ndistinct values we will want to know the exact
  value + counts and not a bin. So I think we will want additional stats rows
  that represent value 'a1' stats.

 Isn't that what our most frequent values list does?

 Maybe ? Do we have the relevant stats for each ?
 But the trick is to then exclude those values from the histogram bins.

Currently, the histogram is only made up of non-mcv values.

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Proposal - improve eqsel estimates by including histogram bucket numdistinct statistics

2008-06-10 Thread Jeff Davis
On Tue, 2008-06-10 at 08:51 -0700, Nathan Boley wrote:
   One more problem with low ndistinct values is that the condition might 
   very well
   hit no rows at all. But Idea 1 will largely overestimate the number of 
   hits.
 
 Thats a good point, but I don't see a clear solution. Maybe we could

I think that MCVs are the solution, right? A low ndistinct means that
those values will likely be MCVs.

Since you brought up a different process for choosing MCVs, maybe a
better name might be Most Interesting Values.

Regards,
Jeff Davis




-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Proposal - improve eqsel estimates by including histogram bucket numdistinct statistics

2008-06-10 Thread Nathan Boley
   One more problem with low ndistinct values is that the condition might 
   very well
   hit no rows at all. But Idea 1 will largely overestimate the number of 
   hits.

 Thats a good point, but I don't see a clear solution. Maybe we could

 I think that MCVs are the solution, right?

Only if they cover the entire range of values in the table.

 A low ndistinct means that those values will likely be MCVs.

Yes, but I don't think thats the point.

If we query on values that aren't in the table, the planner will
always overestimate the expected number of returned rows because it (
implicitly ) assumes that every query will return at least 1 record.

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Proposal - improve eqsel estimates by including histogram bucket numdistinct statistics

2008-06-10 Thread Tom Lane
Nathan Boley [EMAIL PROTECTED] writes:
 If we query on values that aren't in the table, the planner will
 always overestimate the expected number of returned rows because it (
 implicitly ) assumes that every query will return at least 1 record.

That's intentional and should not be changed.  I can't see the value
of allowing fractional-row estimates anyway.  Now if we can *prove*
the query matches no rows, that's a whole nother matter, but statistics
won't help us with that.

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Proposal - improve eqsel estimates by including histogram bucket numdistinct statistics

2008-06-10 Thread Nathan Boley
 If we query on values that aren't in the table, the planner will
 always overestimate the expected number of returned rows because it (
 implicitly ) assumes that every query will return at least 1 record.

 That's intentional and should not be changed.

Why?  What if ( somehow ) we knew that there was a 90% chance that
query would return an empty result set on a big table with 20 non-mcv
distinct values. Currently the planner would always choose a seq scan,
where an index scan might be better. Better yet, couldn't that be
optimized to *if record exists, execute seq scan*. That being said, I
think queries are generally searching for values that exist in the
table.

 I can't see the value of allowing fractional-row estimates anyway.

Neither can I, but I could probably think of cases where knowing the
SD of the result set could result in better plans.

-Nathan

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Proposal - improve eqsel estimates by including histogram bucket numdistinct statistics

2008-06-10 Thread Tom Lane
Nathan Boley [EMAIL PROTECTED] writes:
 If we query on values that aren't in the table, the planner will
 always overestimate the expected number of returned rows because it (
 implicitly ) assumes that every query will return at least 1 record.
 
 That's intentional and should not be changed.

 Why?  What if ( somehow ) we knew that there was a 90% chance that
 query would return an empty result set on a big table with 20 non-mcv
 distinct values. Currently the planner would always choose a seq scan,
 where an index scan might be better.

(1) On what grounds do you assert the above?

(2) What makes you think that an estimate of zero rather than one row
would change the plan?

(In fact, I don't think the plan would change, in this case.  The reason
for the clamp to 1 row is to avoid foolish results for join situations.)

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Proposal - improve eqsel estimates by including histogram bucket numdistinct statistics

2008-06-10 Thread Nathan Boley
 Why?  What if ( somehow ) we knew that there was a 90% chance that
 query would return an empty result set on a big table with 20 non-mcv
 distinct values. Currently the planner would always choose a seq scan,
 where an index scan might be better.

 (1) On what grounds do you assert the above?

For a table with 100 non-mcv rows, the planner estimates a result
set of cardinality 100/20 = 5, not 1.

 (2) What makes you think that an estimate of zero rather than one row
 would change the plan?

I see where the confusion is coming from. When I said

 What if ( somehow ) we knew that there was a 90%
 chance that query would return an empty result set

I meant that the planner doesn't know that information. And how could it?

The estimate for ndistinct is an estimate for the number of distinct
values in the table, not an estimate for the number of distinct values
that will be queried for.  My original point was that we sometimes
care about the distribution of what's being queried for and not just
what's in the table.

But this is all silly anyways: if this was really a concern you would
write a function

if values exist
   return values
else return none

 (In fact, I don't think the plan would change, in this case.  The reason
 for the clamp to 1 row is to avoid foolish results for join situations.)

Which makes sense. My point certainly wasn't, in any way, a criticism
of clamping selectivity to 1.

Cheers,
Nathan

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Proposal - improve eqsel estimates by including histogram bucket numdistinct statistics

2008-06-10 Thread Tom Lane
Nathan Boley [EMAIL PROTECTED] writes:
 (1) On what grounds do you assert the above?

 For a table with 100 non-mcv rows, the planner estimates a result
 set of cardinality 100/20 = 5, not 1.

The real problem in that situation is that you need another twenty slots
in the MCV list.  The MCV list should *always* exhaust the set of values
for which it'd be bad to do an indexscan.  Assuming that the threshold
for switching to an indexscan is somewhere around selectivity 0.005
(I am not certain offhand, but it's in that general area), this cannot
possibly require more than 200 MCV slots, and for most data
distributions it'd be a whole lot less.

Given such an MCV list, the planner will always make the right choice
of whether to do index or seqscan ... as long as it knows the value
being searched for, that is.  Parameterized plans have a hard time here,
but that's not really the fault of the statistics.

 I see where the confusion is coming from. When I said
 What if ( somehow ) we knew that there was a 90%
 chance that query would return an empty result set
 I meant that the planner doesn't know that information. And how could it?

Hmm.  IIRC the estimates are set up on the assumption that you are
searching for a value that occurs in the table.  I suppose there are
applications where that's often false, but as you say, it's hard to know
that in advance.

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Proposal - improve eqsel estimates by including histogram bucket numdistinct statistics

2008-06-10 Thread Gregory Stark
Tom Lane [EMAIL PROTECTED] writes:

 (In fact, I don't think the plan would change, in this case.  The reason
 for the clamp to 1 row is to avoid foolish results for join situations.)

The screw case I've seen is when you have a large partitioned table where
constraint_exclusion fails to exclude the irrelevant partitions. You're going
to get 0 rows from all but the one partition which contains the 1 row you're
looking for. But since each partition is clamped to 1 you end up with an
estimate of a few hundred rows coming out of the Append node.

The natural way to kill this is to allow fractional rows for these scans. We
know they're usually going to be producing 0 so if the estimates produced the
right average expected value the sum would add up to 1 and the Append node
would get the right value.

Alternatively we could make Append more clever about estimating the number of
rows it produces. Somehow it could be informed of some holistic view of the
quals on its children and how they're inter-dependent. If it's told that only
one of its children will produce rows then it can use max() instead of sum()
to calculate its rows estimate.

-- 
  Gregory Stark
  EnterpriseDB  http://www.enterprisedb.com
  Ask me about EnterpriseDB's Slony Replication support!

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Proposal - improve eqsel estimates by including histogram bucket numdistinct statistics

2008-06-10 Thread Tom Lane
Gregory Stark [EMAIL PROTECTED] writes:
 The screw case I've seen is when you have a large partitioned table where
 constraint_exclusion fails to exclude the irrelevant partitions. You're going
 to get 0 rows from all but the one partition which contains the 1 row you're
 looking for. But since each partition is clamped to 1 you end up with an
 estimate of a few hundred rows coming out of the Append node.

 The natural way to kill this is to allow fractional rows for these scans.

No, the right fix is to fix the constraint-exclusion failure.

 Alternatively we could make Append more clever about estimating the number of
 rows it produces. Somehow it could be informed of some holistic view of the
 quals on its children and how they're inter-dependent. If it's told that only
 one of its children will produce rows then it can use max() instead of sum()
 to calculate its rows estimate.

This gets back to the discussions at PGCon about needing to have a more
explicit representation of partitioning.  Right now, for a
many-partition table we spend huge amounts of time deriving the expected
behavior from first principles, each time we make a plan.  And even then
we can only get it right for limited cases (eg, no parameterized
queries).  If the planner actually understood that a set of tables
formed a partitioned set then it'd be a lot easier and faster to get the
desired behavior, not only with respect to the rowcount estimates but
the plan's structure.

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Proposal - improve eqsel estimates by including histogram bucket numdistinct statistics

2008-06-10 Thread Gregory Stark
Tom Lane [EMAIL PROTECTED] writes:

 Gregory Stark [EMAIL PROTECTED] writes:
 The screw case I've seen is when you have a large partitioned table where
 constraint_exclusion fails to exclude the irrelevant partitions. You're going
 to get 0 rows from all but the one partition which contains the 1 row you're
 looking for. But since each partition is clamped to 1 you end up with an
 estimate of a few hundred rows coming out of the Append node.

 The natural way to kill this is to allow fractional rows for these scans.

 No, the right fix is to fix the constraint-exclusion failure.

There are always going to be cases where that's not possible.

It's possible that the second option I described -- teaching Append when to
use something other than sum() -- would only work in the cases where
constraint exclusion could be fixed though. In which case having fractional
row counts might actually be necessary. 

-- 
  Gregory Stark
  EnterpriseDB  http://www.enterprisedb.com
  Ask me about EnterpriseDB's Slony Replication support!

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Proposal - improve eqsel estimates by including histogram bucket numdistinct statistics

2008-06-10 Thread Tom Lane
Gregory Stark [EMAIL PROTECTED] writes:
 It's possible that the second option I described -- teaching Append when to
 use something other than sum() -- would only work in the cases where
 constraint exclusion could be fixed though. In which case having fractional
 row counts might actually be necessary. 

The above is just armwaving.  IMHO, if you don't understand the
structure of the table set then you're not going to be able to get the
desired behavior via fractional rowcounts either.

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Proposal - improve eqsel estimates by including histogram bucket numdistinct statistics

2008-06-10 Thread Andrew Dunstan



Tom Lane wrote:

This gets back to the discussions at PGCon about needing to have a more
explicit representation of partitioning.  Right now, for a
many-partition table we spend huge amounts of time deriving the expected
behavior from first principles, each time we make a plan.  And even then
we can only get it right for limited cases (eg, no parameterized
queries).  If the planner actually understood that a set of tables
formed a partitioned set then it'd be a lot easier and faster to get the
desired behavior, not only with respect to the rowcount estimates but
the plan's structure.


  


Even if this doesn't solve every problem, it's surely worth doing for 
those it will solve.


cheers

andrew

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Proposal - improve eqsel estimates by including histogram bucket numdistinct statistics

2008-06-09 Thread Jeff Davis
On Sun, 2008-06-08 at 19:03 -0400, Tom Lane wrote:
 Your argument seems to consider only columns having a normal
 distribution.  How badly does it fall apart for non-normal
 distributions?  (For instance, Zipfian distributions seem to be pretty
 common in database work, from what I've seen.)
 

If using Idea 1: Keep an array of stadistinct that correspond to each
bucket size, I would expect it to still be a better estimate than it is
currently, because it's keeping a separate ndistinct for each histogram
bucket.

Regards,
Jeff Davis


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Proposal - improve eqsel estimates by including histogram bucket numdistinct statistics

2008-06-09 Thread Nathan Boley
 Your argument seems to consider only columns having a normal
 distribution.

My example was based upon normally distributed data because people
usually know what they are and they are reasonably common.

 How badly does it fall apart for non-normal distributions?

This should work to the extent that there is a correlation between the
number of distinct values in a histogram bin and the 'size' of the
bin. I know that this isn't a very good answer, but it's a hard
problem in the general case.

However, for any non-uniform probability distribution there exists a
measure under which there is a non-zero correlation between these. I
wrote up a hand-wavie semi-formal argument at the end, but it's not
tough to see intuitively.

Just think about the graph of the cdf. The histogram boundaries are
horizontal lines that are equally spaced from top to bottom. If we
rescale the x-axis st there is a fixed distance between each distinct
value, and define the distance as ( ndistinct in a given
interval)/(total ndistinct) then the only place where this doesn't
hold is when the CDF is f(x) = x, aka the dist is uniform. And even
then we just get a 0 coefficient, which is exactly what we are always
assuming now.

Obviously we run into problems when
a) we have a poor estimate for ndistinct - but then we have worse problems
b) our length measure doesn't correspond well with ndistinct in an interval

expanding on b)...

your mention of Zipfian distributions is particularly good example of
where this could be poor. Right now ( someone correct me if I'm wrong
) words are sorted alphabetically. However, if we wanted this
estimator to do a good job, we would sort them by their length or,
better yet, frequency in the english language ( which is highly
correlated with length ).

 (For instance, Zipfian distributions seem to be pretty common in database 
 work, from what I've seen.)

This should work well for any power law distribution. If people are
curious, I can rerun my previous example using a power law
distribution instead of normal.

However, the easy way of thinking about all of this is that we're
building a linear model between ndistinct and histogram bucket width.
It's intuitive to expect there to be a correlation between the two,
and so the model should have some predictive power.

-Nathan


somewhat formal - probably will be difficult to read without some
basic probability theory
To see this, first note that we can alternatively define the uniform
distribution on [a,b] as the distribution whose CDF is a straight line
that passes through both (a,0) and (b,1) ( just integrate the PDF ).
So any non-uniform distribution will have a CDF with slope that is
both below and above 1/(b-a) at some set of points, implying the
existence of an interval [i1, i2] st ( CDF(i2) - CDF(i1) )  ( i2 - i1
)/(b-a). Then, under the constraints of the probability measure, there
exists a second disjoint interval st ( CDF(i2') - CDF(i1') )  ( i2' -
i1' )/(b-a). In other words,

Next, assume that the number of potential distinct values in our
interval scales linearly with the length of the interval. Although it
seems as if this assumption could be somewhat burdensome, there always
exists a measure under which this is true for sets with a defined
order relation.  ( As remarked earlier by Tom, we are already making
this assumption ). To see this, consider defining the length(i1, i2)
as ( the number of distinct value in  [i1, i2] )/( total num distinct
values ), where the number of distinct values is the set of values { v
| v = i1 and v = i2 }.

Next, note that the joint distribution of identically distributed,
independent random variables is multinomial with cell probabilities
given by the value of the pdf at each distinct point.  Next, I'll
state without proof that for an IID RV  the expected number of
distinct values is maximized for a uniform distribution ( this is
pretty easy to see: think about the binomial case. Do you want your
cell probabilities to be ( 1.0, 0 ) or ( 0.5, 0.5 ) )

Finally, note that the number of expected distinct values decreases
faster than linearly in the length of the interval. This is pretty
clear when we consider the sparse case. As the number of potential
entries ( in this case, the interval length) approaches infinity, the
probability of a new entry being distinct approaches 1. This means
that,  in this limit, every new entry ends up being distinct, aka the
number of distinct values scales linearly in the number of new
entries. As the interval shrinks, new entries have some probability of
being repeats. As the interval shrinks to 0, there is a zero
probability of new entries being unique. Since,

a) there doesn't exists a linear relationship that contains the two
boundary points
b) the multinomial distribution of the PDF is continuous
c) the relationship is clearly decreasing

we can surmise that it is sub-linear.

Therefore, we have two intervals that have sub and super linear slopes
that cancel one another. However, 

[HACKERS] Proposal - improve eqsel estimates by including histogram bucket numdistinct statistics

2008-06-08 Thread Nathan Boley
Currently eqsel assumes that, except for the values stored as mcv's,
the number of times that a value appears in a table column is
independent of it's value. Unfortunately, this often yields
inappropriate selectivity estimates and frequently leads to
inappropriate plans.

As an example, consider an insurance company that keeps a record of
patient heights. Assume there are a 100 patient heights in this
column, and they are distributed normally with a mean of 1.7526 and a
standard deviation of 0.0762. Furthermore, assume that the heights are
only measured to the nearest centimeter. Then, we'd expect there to be
about 73 distinct heights, with a SD of 1.5.

Ignoring the effects of MCV's, the planner expects
  SELECT height FROM heights WHERE height = 1.75;
to yield roughly 13000 results. However, given that we know the
underlying distribution, we would expect to see ~52000 results.

Similarly, the planner expects to see 13000 results from
  SELECT 1.75 FROM heights WHERE height = 2.05;
While we expect to see 2.7.

Obviously this example is not totally convincing: if I were to post
this to pg-general looking for advice I'm sure that everyone would
tell me to just increase the size of my mcv stats. However, in cases
where the number of distinct values is higher, this isn't always
feasible. Also, why store a list of 50 values and their frequencies
when 10 extra would provide the same plans without bloating
pg_statistics?

To combat this problem, I have two different proposals.

Idea 1: Keep an array of stadistinct that correspond to each bucket size.

In the example above, ( again ignoring mcv's ) the quantile data is

0%10%   20%   30%   40%   50%   60%   70%   80%   90%   100%
1.38  1.66  1.69  1.71  1.73  1.75  1.77  1.79  1.82  1.85  2.12

with numdistinct values of ( respectively )

29  2  2  2  2  2  2  3  3 25

For the two above examples, this new approach would yield selectivity
estimates of

(100/10)/2 = 5  ( vs an actual ED of ~52000 )
and
(100/10)/25 = 4000  ( vs an actual ED of ~2.7 )

Furthermore, this is done without mcvs. Since mcv's would make the
histogram more sensitive to the edges, the estimates with mcv's should
be correspondingly better.


There are two potential problems that I see with this approach:

1) It assumes the = is equivalent to = and = . This is certainly
true for real numbers, but is it true for every equality relation that
eqsel predicts for?

2) It bloats the stats table.

Idea 2: Keep a correlation statistic between ndistinct and bucket size

This addresses problem #2.

In lieu of keeping an actual list of ndistinct per histogram bucket,
we store the linear scaling coefficient between histogram bucket width
and ndistinct/(avg ndistinct). To visualize this, it is easiest to
consider plotting the bucket width versus ndistinct. The scaling
coefficient is the linear line that passes through origin and
minimizes the square of the difference between it's estimate for
ndistinct and the actual value.

When I apply this method to the above data I find a coefficient of
13.63 for an average ndist of 72/10. This provides selectivity
estimates, for the above two examples, of
(100/10)/( 13.63*7.2*(1.77 - 1.75)  ) = 50950 ( vs an actual ED of ~52000 )
and
(100/10)/( 13.63*7.2*(2.12 - 1.85)  ) = 3774  ( vs an actual ED of ~2.7 )

Although this yields better results than idea 1 for this particular
example, it will be much more sensitive to weird distributions.

Obviously there are some special cases to consider: we wouldn't want
the stats to be skewed such that they provide really bad plans.
However, with some carefully designed caps I believe that we could
ensure that the estimates are at least as good as they are now. In
fact, I'm not certain that an R^2 penalty is the correct loss
function. Ideally, we want to minimize the extra time that the db
spends by choosing an incorrect plan. Maybe slight overestimations are
better than slight underestimations? Maybe the cost of the occasional
(really) bad plan is less than the cost of a bunch of kinda bad plans?

Finally, we aren't limited to just one coefficient. We could also
store multiple coefficents to improve our estimates, and provide a
compromise between ideas 1 and 2.

Food for future thought...

I addition to the previous benefits, I think that this method has the
potential to make the process by which MCV are chosen (or not chosen)
smarter. Now the planner chooses a value to be an mcv candidate if
it's frequency is greater than 1.25 * the average frequency. Given
that this improved selectivity estimate is implemented, maybe a better
way would be to include a value as an mcv if it's a) above a certain
threshold and b) the histogram selectivity estimator does do a poor
job.

What are peoples thoughts on idea 1 vs idea 2?

Am I missing any relevant details about the planner's operation?

Do people think that the improved estimates would be worth the
additional overhead?

-Nathan

-- 
Sent via pgsql-hackers 

Re: [HACKERS] Proposal - improve eqsel estimates by including histogram bucket numdistinct statistics

2008-06-08 Thread Tom Lane
Nathan Boley [EMAIL PROTECTED] writes:
 ... There are two potential problems that I see with this approach:

 1) It assumes the = is equivalent to = and = . This is certainly
 true for real numbers, but is it true for every equality relation that
 eqsel predicts for?

The cases that compute_scalar_stats is used in have that property, since
the  and = operators are taken from the same btree opclass.

 Do people think that the improved estimates would be worth the
 additional overhead?

Your argument seems to consider only columns having a normal
distribution.  How badly does it fall apart for non-normal
distributions?  (For instance, Zipfian distributions seem to be pretty
common in database work, from what I've seen.)

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers