Re: [HACKERS] Group-count estimation statistics

2005-02-01 Thread Manfred Koizar
On Mon, 31 Jan 2005 14:40:08 -0500, Tom Lane [EMAIL PROTECTED] wrote:
Manfred Koizar [EMAIL PROTECTED] writes:
 That's not what I meant.  I tried to say that if we have a GROUP BY
 several columns and one of these columns alone has more than N/10
 distinct values, there's no way to get less than that many groups.

Oh, I see, you want a max calculation in there too.  Seems reasonable.
Any objections?

Yes.  :-(  What I said is only true in the absence of any WHERE clause
(or join).  Otherwise the same cross-column correlation issues you tried
to work around with the N/10 clamping might come back through the
backdoor.  I'm not sure whether coding for such a narrow use case is
worth the trouble.  Forget my idea.

Servus
 Manfred

---(end of broadcast)---
TIP 9: the planner will ignore your desire to choose an index scan if your
  joining column's datatypes do not match


Re: [HACKERS] Group-count estimation statistics

2005-02-01 Thread Tom Lane
Manfred Koizar [EMAIL PROTECTED] writes:
 On Mon, 31 Jan 2005 14:40:08 -0500, Tom Lane [EMAIL PROTECTED] wrote:
 Oh, I see, you want a max calculation in there too.  Seems reasonable.
 Any objections?

 Yes.  :-(  What I said is only true in the absence of any WHERE clause
 (or join).  Otherwise the same cross-column correlation issues you tried
 to work around with the N/10 clamping might come back through the
 backdoor.  I'm not sure whether coding for such a narrow use case is
 worth the trouble.  Forget my idea.

No, I think it's still good.  The WHERE clauses are factored in
separately (essentially by assuming their selectivity on the grouped
rows is the same as it would be on the raw rows, which is pretty bogus
but it's hard to do better).  The important point is that the group
count before WHERE filtering certainly does behave as you suggest,
and so the clamp is going to be overoptimistic if it clamps to less than
the largest individual number-of-distinct-values.

regards, tom lane

---(end of broadcast)---
TIP 9: the planner will ignore your desire to choose an index scan if your
  joining column's datatypes do not match


Re: [HACKERS] Group-count estimation statistics

2005-01-31 Thread Manfred Koizar
On Fri, 28 Jan 2005 10:53:33 -0500, Tom Lane [EMAIL PROTECTED] wrote:
 we should consider
something like clamp to size of table / 10 instead.

... unless a *single* grouping column is estimated to have more than
N/10 distinct values, which should be easy to check.

Servus
 Manfred

---(end of broadcast)---
TIP 4: Don't 'kill -9' the postmaster


Re: [HACKERS] Group-count estimation statistics

2005-01-31 Thread Tom Lane
Manfred Koizar [EMAIL PROTECTED] writes:
 On Fri, 28 Jan 2005 10:53:33 -0500, Tom Lane [EMAIL PROTECTED] wrote:
 we should consider
 something like clamp to size of table / 10 instead.

 ... unless a *single* grouping column is estimated to have more than
 N/10 distinct values, which should be easy to check.

Already done that way.

/*
 * Clamp to size of rel, or size of rel / 10 if multiple Vars.
 * The fudge factor is because the Vars are probably correlated
 * but we don't know by how much.
 */
doubleclamp = rel-tuples;

if (relvarcount  1)
clamp *= 0.1;
if (reldistinct  clamp)
reldistinct = clamp;


regards, tom lane

---(end of broadcast)---
TIP 3: if posting/reading through Usenet, please send an appropriate
  subscribe-nomail command to [EMAIL PROTECTED] so that your
  message can get through to the mailing list cleanly


Re: [HACKERS] Group-count estimation statistics

2005-01-31 Thread Manfred Koizar
On Mon, 31 Jan 2005 11:20:31 -0500, Tom Lane [EMAIL PROTECTED] wrote:
Already done that way.
if (relvarcount  1)
clamp *= 0.1;

That's not what I meant.  I tried to say that if we have a GROUP BY
several columns and one of these columns alone has more than N/10
distinct values, there's no way to get less than that many groups.

Servus
 Manfred

---(end of broadcast)---
TIP 2: you can get off all lists at once with the unregister command
(send unregister YourEmailAddressHere to [EMAIL PROTECTED])


Re: [HACKERS] Group-count estimation statistics

2005-01-31 Thread Tom Lane
Manfred Koizar [EMAIL PROTECTED] writes:
 That's not what I meant.  I tried to say that if we have a GROUP BY
 several columns and one of these columns alone has more than N/10
 distinct values, there's no way to get less than that many groups.

Oh, I see, you want a max calculation in there too.  Seems reasonable.
Any objections?

regards, tom lane

---(end of broadcast)---
TIP 9: the planner will ignore your desire to choose an index scan if your
  joining column's datatypes do not match


Re: [HACKERS] Group-count estimation statistics

2005-01-29 Thread Mischa
 From: Sailesh Krishnamurthy [EMAIL PROTECTED]
  Tom == Tom Lane [EMAIL PROTECTED] writes:
 
 Tom The only real solution, of course, is to acquire cross-column
 Tom statistics, but I don't see that happening in the near
 Tom future.
 
 Another approach is a hybrid hashing scheme where we use a hash table
 until we run out of memory at which time we start spilling to disk. In
 other words, no longer use SortAgg at all ..
 
 Under what circumstances will a SortAgg consumer more IOs than a
 hybrid hash strategy ?

Goetz Graefe did a heck of a lot of analysis of this, prior to his being snapped
up by Microsoft. He also worked out a lot of the nitty-gritty for hybrid hash
algorithms, extending the Grace hash for spill-to-disk, and adding a kind of
recursion for really huge sets. The figures say that hybrid hash beats
sort-aggregate, across the board. 


---(end of broadcast)---
TIP 1: subscribe and unsubscribe commands go to [EMAIL PROTECTED]


Re: [HACKERS] Group-count estimation statistics

2005-01-28 Thread Tom Lane
Greg Stark [EMAIL PROTECTED] writes:
 So why is it any more reasonable for Postgres to assume 0 correlation than any
 other value. Perhaps Postgres should calculate these cases assuming some
 arbitrary level of correlation.

[ shrug... ]  Sure, if you want to do the legwork to develop something
credible.  But I think I'd still throw in the number-of-rows-over-10
clamp, or something much like it.

 As the total number of records
 goes up the expected number of distinct values should approach the total
 number of records, even if the number of distinct values of each column
 doesn't change.

Well, that's what I thought when I wrote the existing code, but it's
wrong: you don't GROUP BY unique combinations of columns over huge
tables --- or at least, you shouldn't expect great performance if you do.
It'd probably be more reasonable to use a heuristic that expects a
*smaller* fraction of distinct combinations, instead of a larger one,
as the table size goes up.

 There's another possible solution, if Postgres kept statistics on the actual
 results of the query it could later use that feedback to come up with better
 guesses even if it doesn't know *why* they're better.

That's been proposed before but I think it's a blind alley.  In most
cases (certainly with anything as complex as a multiply grouped query)
you're not going to be able to derive any trustworthy corrections to
your original statistical estimates.  There are too many variables and
their relationships to the end costs are not simple.

regards, tom lane

---(end of broadcast)---
TIP 3: if posting/reading through Usenet, please send an appropriate
  subscribe-nomail command to [EMAIL PROTECTED] so that your
  message can get through to the mailing list cleanly


Re: [HACKERS] Group-count estimation statistics

2005-01-28 Thread Sailesh Krishnamurthy
 Tom == Tom Lane [EMAIL PROTECTED] writes:

Tom The only real solution, of course, is to acquire cross-column
Tom statistics, but I don't see that happening in the near
Tom future.

Another approach is a hybrid hashing scheme where we use a hash table
until we run out of memory at which time we start spilling to disk. In
other words, no longer use SortAgg at all ..

Under what circumstances will a SortAgg consumer more IOs than a
hybrid hash strategy ?

-- 
Pip-pip
Sailesh
http://www.cs.berkeley.edu/~sailesh



---(end of broadcast)---
TIP 2: you can get off all lists at once with the unregister command
(send unregister YourEmailAddressHere to [EMAIL PROTECTED])


Re: [HACKERS] Group-count estimation statistics

2005-01-28 Thread Greg Stark

Tom Lane [EMAIL PROTECTED] writes:

 Greg Stark's thought about a power correction seemed interesting too, though
 again far too optimistic to trust without some good math to back it up.

Fwiw, I'm pretty sure good math is not going to back up my off-the-cuff
algorithm. But I did like the answer it gave in this instance.

I'm told an actual solution to the problem is hard and probably not even
solvable in closed form. I'm still looking around but I suspect we would need
some pretty severe simplifying assumptions to make it work.

-- 
greg


---(end of broadcast)---
TIP 4: Don't 'kill -9' the postmaster