Re: [HACKERS] Proposal - improve eqsel estimates by including histogram bucket numdistinct statistics

2008-06-12 Thread Nathan Boley
Assuming that the threshold for switching to an indexscan is somewhere around selectivity 0.005 (I am not certain offhand, but it's in that general area), this cannot possibly require more than 200 MCV slots, and for most data distributions it'd be a whole lot less. Thats a really good

Re: [HACKERS] Proposal - improve eqsel estimates by including histogram bucket numdistinct statistics

2008-06-12 Thread Tom Lane
Nathan Boley [EMAIL PROTECTED] writes: Given that, wouldn't it be smarter to consider a value as an mcv candidate iff it has a density greater than 0.005, rather than having a count greater than 1.5*average? Yeah, perhaps ... want to experiment with that? Though I'd be a bit worried about how

Re: [HACKERS] Proposal - improve eqsel estimates by including histogram bucket numdistinct statistics

2008-06-11 Thread Gregory Stark
Tom Lane [EMAIL PROTECTED] writes: Gregory Stark [EMAIL PROTECTED] writes: It's possible that the second option I described -- teaching Append when to use something other than sum() -- would only work in the cases where constraint exclusion could be fixed though. In which case having

Re: [HACKERS] Proposal - improve eqsel estimates by including histogram bucket numdistinct statistics

2008-06-10 Thread Zeugswetter Andreas OSB sIT
Obviously we run into problems when a) we have a poor estimate for ndistinct - but then we have worse problems b) our length measure doesn't correspond well with ndistinct in an interval One more problem with low ndistinct values is that the condition might very well hit no rows at all. But

Re: [HACKERS] Proposal - improve eqsel estimates by including histogram bucket numdistinct statistics

2008-06-10 Thread Gregory Stark
Zeugswetter Andreas OSB sIT [EMAIL PROTECTED] writes: I think for low ndistinct values we will want to know the exact value + counts and not a bin. So I think we will want additional stats rows that represent value 'a1' stats. Isn't that what our most frequent values list does? -- Gregory

Re: [HACKERS] Proposal - improve eqsel estimates by including histogram bucket numdistinct statistics

2008-06-10 Thread Nathan Boley
One more problem with low ndistinct values is that the condition might very well hit no rows at all. But Idea 1 will largely overestimate the number of hits. Thats a good point, but I don't see a clear solution. Maybe we could look at past queries and keep track of how often they return

Re: [HACKERS] Proposal - improve eqsel estimates by including histogram bucket numdistinct statistics

2008-06-10 Thread Jeff Davis
On Tue, 2008-06-10 at 08:51 -0700, Nathan Boley wrote: One more problem with low ndistinct values is that the condition might very well hit no rows at all. But Idea 1 will largely overestimate the number of hits. Thats a good point, but I don't see a clear solution. Maybe we could

Re: [HACKERS] Proposal - improve eqsel estimates by including histogram bucket numdistinct statistics

2008-06-10 Thread Nathan Boley
One more problem with low ndistinct values is that the condition might very well hit no rows at all. But Idea 1 will largely overestimate the number of hits. Thats a good point, but I don't see a clear solution. Maybe we could I think that MCVs are the solution, right? Only if

Re: [HACKERS] Proposal - improve eqsel estimates by including histogram bucket numdistinct statistics

2008-06-10 Thread Tom Lane
Nathan Boley [EMAIL PROTECTED] writes: If we query on values that aren't in the table, the planner will always overestimate the expected number of returned rows because it ( implicitly ) assumes that every query will return at least 1 record. That's intentional and should not be changed. I

Re: [HACKERS] Proposal - improve eqsel estimates by including histogram bucket numdistinct statistics

2008-06-10 Thread Nathan Boley
If we query on values that aren't in the table, the planner will always overestimate the expected number of returned rows because it ( implicitly ) assumes that every query will return at least 1 record. That's intentional and should not be changed. Why? What if ( somehow ) we knew that

Re: [HACKERS] Proposal - improve eqsel estimates by including histogram bucket numdistinct statistics

2008-06-10 Thread Tom Lane
Nathan Boley [EMAIL PROTECTED] writes: If we query on values that aren't in the table, the planner will always overestimate the expected number of returned rows because it ( implicitly ) assumes that every query will return at least 1 record. That's intentional and should not be changed.

Re: [HACKERS] Proposal - improve eqsel estimates by including histogram bucket numdistinct statistics

2008-06-10 Thread Nathan Boley
Why? What if ( somehow ) we knew that there was a 90% chance that query would return an empty result set on a big table with 20 non-mcv distinct values. Currently the planner would always choose a seq scan, where an index scan might be better. (1) On what grounds do you assert the above?

Re: [HACKERS] Proposal - improve eqsel estimates by including histogram bucket numdistinct statistics

2008-06-10 Thread Tom Lane
Nathan Boley [EMAIL PROTECTED] writes: (1) On what grounds do you assert the above? For a table with 100 non-mcv rows, the planner estimates a result set of cardinality 100/20 = 5, not 1. The real problem in that situation is that you need another twenty slots in the MCV list.

Re: [HACKERS] Proposal - improve eqsel estimates by including histogram bucket numdistinct statistics

2008-06-10 Thread Gregory Stark
Tom Lane [EMAIL PROTECTED] writes: (In fact, I don't think the plan would change, in this case. The reason for the clamp to 1 row is to avoid foolish results for join situations.) The screw case I've seen is when you have a large partitioned table where constraint_exclusion fails to exclude

Re: [HACKERS] Proposal - improve eqsel estimates by including histogram bucket numdistinct statistics

2008-06-10 Thread Tom Lane
Gregory Stark [EMAIL PROTECTED] writes: The screw case I've seen is when you have a large partitioned table where constraint_exclusion fails to exclude the irrelevant partitions. You're going to get 0 rows from all but the one partition which contains the 1 row you're looking for. But since

Re: [HACKERS] Proposal - improve eqsel estimates by including histogram bucket numdistinct statistics

2008-06-10 Thread Gregory Stark
Tom Lane [EMAIL PROTECTED] writes: Gregory Stark [EMAIL PROTECTED] writes: The screw case I've seen is when you have a large partitioned table where constraint_exclusion fails to exclude the irrelevant partitions. You're going to get 0 rows from all but the one partition which contains the 1

Re: [HACKERS] Proposal - improve eqsel estimates by including histogram bucket numdistinct statistics

2008-06-10 Thread Tom Lane
Gregory Stark [EMAIL PROTECTED] writes: It's possible that the second option I described -- teaching Append when to use something other than sum() -- would only work in the cases where constraint exclusion could be fixed though. In which case having fractional row counts might actually be

Re: [HACKERS] Proposal - improve eqsel estimates by including histogram bucket numdistinct statistics

2008-06-10 Thread Andrew Dunstan
Tom Lane wrote: This gets back to the discussions at PGCon about needing to have a more explicit representation of partitioning. Right now, for a many-partition table we spend huge amounts of time deriving the expected behavior from first principles, each time we make a plan. And even then

Re: [HACKERS] Proposal - improve eqsel estimates by including histogram bucket numdistinct statistics

2008-06-09 Thread Jeff Davis
On Sun, 2008-06-08 at 19:03 -0400, Tom Lane wrote: Your argument seems to consider only columns having a normal distribution. How badly does it fall apart for non-normal distributions? (For instance, Zipfian distributions seem to be pretty common in database work, from what I've seen.) If

Re: [HACKERS] Proposal - improve eqsel estimates by including histogram bucket numdistinct statistics

2008-06-09 Thread Nathan Boley
Your argument seems to consider only columns having a normal distribution. My example was based upon normally distributed data because people usually know what they are and they are reasonably common. How badly does it fall apart for non-normal distributions? This should work to the extent

Re: [HACKERS] Proposal - improve eqsel estimates by including histogram bucket numdistinct statistics

2008-06-08 Thread Tom Lane
Nathan Boley [EMAIL PROTECTED] writes: ... There are two potential problems that I see with this approach: 1) It assumes the = is equivalent to = and = . This is certainly true for real numbers, but is it true for every equality relation that eqsel predicts for? The cases that