Re: [HACKERS] Proposal - improve eqsel estimates by including histogram bucket numdistinct statistics

2008-06-12 Thread Tom Lane
"Nathan Boley" <[EMAIL PROTECTED]> writes: > Given that, wouldn't it be smarter to consider a value as an mcv > candidate iff it has a density greater than 0.005, rather than having > a count greater than 1.5*average? Yeah, perhaps ... want to experiment with that? Though I'd be a bit worried abo

Re: [HACKERS] Proposal - improve eqsel estimates by including histogram bucket numdistinct statistics

2008-06-12 Thread Nathan Boley
> Assuming that the threshold > for switching to an indexscan is somewhere around selectivity 0.005 > (I am not certain offhand, but it's in that general area), this cannot > possibly require more than 200 MCV slots, and for most data > distributions it'd be a whole lot less. Thats a really good p

Re: [HACKERS] Proposal - improve eqsel estimates by including histogram bucket numdistinct statistics

2008-06-11 Thread Gregory Stark
"Tom Lane" <[EMAIL PROTECTED]> writes: > Gregory Stark <[EMAIL PROTECTED]> writes: >> It's possible that the second option I described -- teaching Append when to >> use something other than sum() -- would only work in the cases where >> constraint exclusion could be fixed though. In which case hav

Re: [HACKERS] Proposal - improve eqsel estimates by including histogram bucket numdistinct statistics

2008-06-10 Thread Andrew Dunstan
Tom Lane wrote: This gets back to the discussions at PGCon about needing to have a more explicit representation of partitioning. Right now, for a many-partition table we spend huge amounts of time deriving the expected behavior from first principles, each time we make a plan. And even then we

Re: [HACKERS] Proposal - improve eqsel estimates by including histogram bucket numdistinct statistics

2008-06-10 Thread Tom Lane
Gregory Stark <[EMAIL PROTECTED]> writes: > It's possible that the second option I described -- teaching Append when to > use something other than sum() -- would only work in the cases where > constraint exclusion could be fixed though. In which case having fractional > row counts might actually be

Re: [HACKERS] Proposal - improve eqsel estimates by including histogram bucket numdistinct statistics

2008-06-10 Thread Gregory Stark
"Tom Lane" <[EMAIL PROTECTED]> writes: > Gregory Stark <[EMAIL PROTECTED]> writes: >> The screw case I've seen is when you have a large partitioned table where >> constraint_exclusion fails to exclude the irrelevant partitions. You're going >> to get 0 rows from all but the one partition which con

Re: [HACKERS] Proposal - improve eqsel estimates by including histogram bucket numdistinct statistics

2008-06-10 Thread Tom Lane
Gregory Stark <[EMAIL PROTECTED]> writes: > The screw case I've seen is when you have a large partitioned table where > constraint_exclusion fails to exclude the irrelevant partitions. You're going > to get 0 rows from all but the one partition which contains the 1 row you're > looking for. But sin

Re: [HACKERS] Proposal - improve eqsel estimates by including histogram bucket numdistinct statistics

2008-06-10 Thread Gregory Stark
"Tom Lane" <[EMAIL PROTECTED]> writes: > (In fact, I don't think the plan would change, in this case. The reason > for the clamp to 1 row is to avoid foolish results for join situations.) The screw case I've seen is when you have a large partitioned table where constraint_exclusion fails to excl

Re: [HACKERS] Proposal - improve eqsel estimates by including histogram bucket numdistinct statistics

2008-06-10 Thread Tom Lane
"Nathan Boley" <[EMAIL PROTECTED]> writes: >> (1) On what grounds do you assert the above? > For a table with 100 non-mcv rows, the planner estimates a result > set of cardinality 100/20 = 5, not 1. The real problem in that situation is that you need another twenty slots in the MCV li

Re: [HACKERS] Proposal - improve eqsel estimates by including histogram bucket numdistinct statistics

2008-06-10 Thread Nathan Boley
>> Why? What if ( somehow ) we knew that there was a 90% chance that >> query would return an empty result set on a big table with 20 non-mcv >> distinct values. Currently the planner would always choose a seq scan, >> where an index scan might be better. > > (1) On what grounds do you assert the

Re: [HACKERS] Proposal - improve eqsel estimates by including histogram bucket numdistinct statistics

2008-06-10 Thread Tom Lane
"Nathan Boley" <[EMAIL PROTECTED]> writes: >>> If we query on values that aren't in the table, the planner will >>> always overestimate the expected number of returned rows because it ( >>> implicitly ) assumes that every query will return at least 1 record. >> >> That's intentional and should not

Re: [HACKERS] Proposal - improve eqsel estimates by including histogram bucket numdistinct statistics

2008-06-10 Thread Nathan Boley
>> If we query on values that aren't in the table, the planner will >> always overestimate the expected number of returned rows because it ( >> implicitly ) assumes that every query will return at least 1 record. > > That's intentional and should not be changed. Why? What if ( somehow ) we knew t

Re: [HACKERS] Proposal - improve eqsel estimates by including histogram bucket numdistinct statistics

2008-06-10 Thread Tom Lane
"Nathan Boley" <[EMAIL PROTECTED]> writes: > If we query on values that aren't in the table, the planner will > always overestimate the expected number of returned rows because it ( > implicitly ) assumes that every query will return at least 1 record. That's intentional and should not be changed.

Re: [HACKERS] Proposal - improve eqsel estimates by including histogram bucket numdistinct statistics

2008-06-10 Thread Nathan Boley
>> >> > One more problem with low ndistinct values is that the condition might >> >> > very well >> >> > hit no rows at all. But Idea 1 will largely overestimate the number of >> >> > hits. >> >> Thats a good point, but I don't see a clear solution. Maybe we could > > I think that MCVs are the so

Re: [HACKERS] Proposal - improve eqsel estimates by including histogram bucket numdistinct statistics

2008-06-10 Thread Jeff Davis
On Tue, 2008-06-10 at 08:51 -0700, Nathan Boley wrote: > >> > One more problem with low ndistinct values is that the condition might > >> > very well > >> > hit no rows at all. But Idea 1 will largely overestimate the number of > >> > hits. > > Thats a good point, but I don't see a clear solutio

Re: [HACKERS] Proposal - improve eqsel estimates by including histogram bucket numdistinct statistics

2008-06-10 Thread Nathan Boley
>> > One more problem with low ndistinct values is that the condition might >> > very well >> > hit no rows at all. But Idea 1 will largely overestimate the number of >> > hits. Thats a good point, but I don't see a clear solution. Maybe we could look at past queries and keep track of how often

Re: [HACKERS] Proposal - improve eqsel estimates by including histogram bucket numdistinct statistics

2008-06-10 Thread Gregory Stark
"Zeugswetter Andreas OSB sIT" <[EMAIL PROTECTED]> writes: > I think for low ndistinct values we will want to know the exact > value + counts and not a bin. So I think we will want additional stats rows > that represent "value 'a1' stats". Isn't that what our most frequent values list does? --

Re: [HACKERS] Proposal - improve eqsel estimates by including histogram bucket numdistinct statistics

2008-06-10 Thread Zeugswetter Andreas OSB sIT
> Obviously we run into problems when > a) we have a poor estimate for ndistinct - but then we have > worse problems > b) our length measure doesn't correspond well with ndistinct > in an interval One more problem with low ndistinct values is that the condition might very well hit no rows at all.

Re: [HACKERS] Proposal - improve eqsel estimates by including histogram bucket numdistinct statistics

2008-06-09 Thread Nathan Boley
> Your argument seems to consider only columns having a normal > distribution. My example was based upon normally distributed data because people usually know what they are and they are reasonably common. > How badly does it fall apart for non-normal distributions? This should work to the extent

Re: [HACKERS] Proposal - improve eqsel estimates by including histogram bucket numdistinct statistics

2008-06-09 Thread Jeff Davis
On Sun, 2008-06-08 at 19:03 -0400, Tom Lane wrote: > Your argument seems to consider only columns having a normal > distribution. How badly does it fall apart for non-normal > distributions? (For instance, Zipfian distributions seem to be pretty > common in database work, from what I've seen.) >

Re: [HACKERS] Proposal - improve eqsel estimates by including histogram bucket numdistinct statistics

2008-06-08 Thread Tom Lane
"Nathan Boley" <[EMAIL PROTECTED]> writes: > ... There are two potential problems that I see with this approach: > 1) It assumes the = is equivalent to <= and >= . This is certainly > true for real numbers, but is it true for every equality relation that > eqsel predicts for? The cases that compu

[HACKERS] Proposal - improve eqsel estimates by including histogram bucket numdistinct statistics

2008-06-08 Thread Nathan Boley
Currently eqsel assumes that, except for the values stored as mcv's, the number of times that a value appears in a table column is independent of it's value. Unfortunately, this often yields inappropriate selectivity estimates and frequently leads to inappropriate plans. As an example, consider an