Re: [HACKERS] Group-count estimation statistics
On Mon, 31 Jan 2005 14:40:08 -0500, Tom Lane [EMAIL PROTECTED] wrote: Manfred Koizar [EMAIL PROTECTED] writes: That's not what I meant. I tried to say that if we have a GROUP BY several columns and one of these columns alone has more than N/10 distinct values, there's no way to get less than that many groups. Oh, I see, you want a max calculation in there too. Seems reasonable. Any objections? Yes. :-( What I said is only true in the absence of any WHERE clause (or join). Otherwise the same cross-column correlation issues you tried to work around with the N/10 clamping might come back through the backdoor. I'm not sure whether coding for such a narrow use case is worth the trouble. Forget my idea. Servus Manfred ---(end of broadcast)--- TIP 9: the planner will ignore your desire to choose an index scan if your joining column's datatypes do not match
Re: [HACKERS] Group-count estimation statistics
Manfred Koizar [EMAIL PROTECTED] writes: On Mon, 31 Jan 2005 14:40:08 -0500, Tom Lane [EMAIL PROTECTED] wrote: Oh, I see, you want a max calculation in there too. Seems reasonable. Any objections? Yes. :-( What I said is only true in the absence of any WHERE clause (or join). Otherwise the same cross-column correlation issues you tried to work around with the N/10 clamping might come back through the backdoor. I'm not sure whether coding for such a narrow use case is worth the trouble. Forget my idea. No, I think it's still good. The WHERE clauses are factored in separately (essentially by assuming their selectivity on the grouped rows is the same as it would be on the raw rows, which is pretty bogus but it's hard to do better). The important point is that the group count before WHERE filtering certainly does behave as you suggest, and so the clamp is going to be overoptimistic if it clamps to less than the largest individual number-of-distinct-values. regards, tom lane ---(end of broadcast)--- TIP 9: the planner will ignore your desire to choose an index scan if your joining column's datatypes do not match
Re: [HACKERS] Group-count estimation statistics
On Fri, 28 Jan 2005 10:53:33 -0500, Tom Lane [EMAIL PROTECTED] wrote: we should consider something like clamp to size of table / 10 instead. ... unless a *single* grouping column is estimated to have more than N/10 distinct values, which should be easy to check. Servus Manfred ---(end of broadcast)--- TIP 4: Don't 'kill -9' the postmaster
Re: [HACKERS] Group-count estimation statistics
Manfred Koizar [EMAIL PROTECTED] writes: On Fri, 28 Jan 2005 10:53:33 -0500, Tom Lane [EMAIL PROTECTED] wrote: we should consider something like clamp to size of table / 10 instead. ... unless a *single* grouping column is estimated to have more than N/10 distinct values, which should be easy to check. Already done that way. /* * Clamp to size of rel, or size of rel / 10 if multiple Vars. * The fudge factor is because the Vars are probably correlated * but we don't know by how much. */ doubleclamp = rel-tuples; if (relvarcount 1) clamp *= 0.1; if (reldistinct clamp) reldistinct = clamp; regards, tom lane ---(end of broadcast)--- TIP 3: if posting/reading through Usenet, please send an appropriate subscribe-nomail command to [EMAIL PROTECTED] so that your message can get through to the mailing list cleanly
Re: [HACKERS] Group-count estimation statistics
On Mon, 31 Jan 2005 11:20:31 -0500, Tom Lane [EMAIL PROTECTED] wrote: Already done that way. if (relvarcount 1) clamp *= 0.1; That's not what I meant. I tried to say that if we have a GROUP BY several columns and one of these columns alone has more than N/10 distinct values, there's no way to get less than that many groups. Servus Manfred ---(end of broadcast)--- TIP 2: you can get off all lists at once with the unregister command (send unregister YourEmailAddressHere to [EMAIL PROTECTED])
Re: [HACKERS] Group-count estimation statistics
Manfred Koizar [EMAIL PROTECTED] writes: That's not what I meant. I tried to say that if we have a GROUP BY several columns and one of these columns alone has more than N/10 distinct values, there's no way to get less than that many groups. Oh, I see, you want a max calculation in there too. Seems reasonable. Any objections? regards, tom lane ---(end of broadcast)--- TIP 9: the planner will ignore your desire to choose an index scan if your joining column's datatypes do not match
Re: [HACKERS] Group-count estimation statistics
From: Sailesh Krishnamurthy [EMAIL PROTECTED] Tom == Tom Lane [EMAIL PROTECTED] writes: Tom The only real solution, of course, is to acquire cross-column Tom statistics, but I don't see that happening in the near Tom future. Another approach is a hybrid hashing scheme where we use a hash table until we run out of memory at which time we start spilling to disk. In other words, no longer use SortAgg at all .. Under what circumstances will a SortAgg consumer more IOs than a hybrid hash strategy ? Goetz Graefe did a heck of a lot of analysis of this, prior to his being snapped up by Microsoft. He also worked out a lot of the nitty-gritty for hybrid hash algorithms, extending the Grace hash for spill-to-disk, and adding a kind of recursion for really huge sets. The figures say that hybrid hash beats sort-aggregate, across the board. ---(end of broadcast)--- TIP 1: subscribe and unsubscribe commands go to [EMAIL PROTECTED]
Re: [HACKERS] Group-count estimation statistics
Greg Stark [EMAIL PROTECTED] writes: So why is it any more reasonable for Postgres to assume 0 correlation than any other value. Perhaps Postgres should calculate these cases assuming some arbitrary level of correlation. [ shrug... ] Sure, if you want to do the legwork to develop something credible. But I think I'd still throw in the number-of-rows-over-10 clamp, or something much like it. As the total number of records goes up the expected number of distinct values should approach the total number of records, even if the number of distinct values of each column doesn't change. Well, that's what I thought when I wrote the existing code, but it's wrong: you don't GROUP BY unique combinations of columns over huge tables --- or at least, you shouldn't expect great performance if you do. It'd probably be more reasonable to use a heuristic that expects a *smaller* fraction of distinct combinations, instead of a larger one, as the table size goes up. There's another possible solution, if Postgres kept statistics on the actual results of the query it could later use that feedback to come up with better guesses even if it doesn't know *why* they're better. That's been proposed before but I think it's a blind alley. In most cases (certainly with anything as complex as a multiply grouped query) you're not going to be able to derive any trustworthy corrections to your original statistical estimates. There are too many variables and their relationships to the end costs are not simple. regards, tom lane ---(end of broadcast)--- TIP 3: if posting/reading through Usenet, please send an appropriate subscribe-nomail command to [EMAIL PROTECTED] so that your message can get through to the mailing list cleanly
Re: [HACKERS] Group-count estimation statistics
Tom == Tom Lane [EMAIL PROTECTED] writes: Tom The only real solution, of course, is to acquire cross-column Tom statistics, but I don't see that happening in the near Tom future. Another approach is a hybrid hashing scheme where we use a hash table until we run out of memory at which time we start spilling to disk. In other words, no longer use SortAgg at all .. Under what circumstances will a SortAgg consumer more IOs than a hybrid hash strategy ? -- Pip-pip Sailesh http://www.cs.berkeley.edu/~sailesh ---(end of broadcast)--- TIP 2: you can get off all lists at once with the unregister command (send unregister YourEmailAddressHere to [EMAIL PROTECTED])
Re: [HACKERS] Group-count estimation statistics
Tom Lane [EMAIL PROTECTED] writes: Greg Stark's thought about a power correction seemed interesting too, though again far too optimistic to trust without some good math to back it up. Fwiw, I'm pretty sure good math is not going to back up my off-the-cuff algorithm. But I did like the answer it gave in this instance. I'm told an actual solution to the problem is hard and probably not even solvable in closed form. I'm still looking around but I suspect we would need some pretty severe simplifying assumptions to make it work. -- greg ---(end of broadcast)--- TIP 4: Don't 'kill -9' the postmaster