Re: [HACKERS] improving GROUP BY estimation

2016-04-02 Thread Tom Lane
Dean Rasheed  writes:
> Here's an updated patch with references to both papers, and a more
> detailed justification for the formula, along with the other changes
> discussed. Note that although equation (2) in the Dell'Era paper looks
> different from the Yao formula, it's actually the same.

Looks good to me.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] improving GROUP BY estimation

2016-04-02 Thread Dean Rasheed
On 31 March 2016 at 22:02, Tom Lane  wrote:
> I'm just concerned about what happens when the Dellera paper stops being
> available.  I don't mind including that URL as a backup to the written-out
> argument I just suggested.
>

Here's an updated patch with references to both papers, and a more
detailed justification for the formula, along with the other changes
discussed. Note that although equation (2) in the Dell'Era paper looks
different from the Yao formula, it's actually the same.

Regards,
Dean
diff --git a/src/backend/utils/adt/selfuncs.c b/src/backend/utils/adt/selfuncs.c
new file mode 100644
index a6555e9..99f5f7c
--- a/src/backend/utils/adt/selfuncs.c
+++ b/src/backend/utils/adt/selfuncs.c
@@ -3439,9 +3439,51 @@ estimate_num_groups(PlannerInfo *root, L
 reldistinct = clamp;
 
 			/*
-			 * Multiply by restriction selectivity.
+			 * Update the estimate based on the restriction selectivity,
+			 * guarding against division by zero when reldistinct is zero.
+			 * Also skip this if we know that we are returning all rows.
 			 */
-			reldistinct *= rel->rows / rel->tuples;
+			if (reldistinct > 0 && rel->rows < rel->tuples)
+			{
+/*
+ * Given a table containing N rows with n distinct values in a
+ * uniform distribution, if we select p rows at random then
+ * the expected number of distinct values selected is
+ *
+ * n * (1 - product((N-N/n-i)/(N-i), i=0..p-1))
+ *
+ * = n * (1 - (N-N/n)! / (N-N/n-p)! * (N-p)! / N!)
+ *
+ * See "Approximating block accesses in database
+ * organizations", S. B. Yao, Communications of the ACM,
+ * Volume 20 Issue 4, April 1977 Pages 260-261.
+ *
+ * Alternatively, re-arranging the terms from the factorials,
+ * this may be written as
+ *
+ * n * (1 - product((N-p-i)/(N-i), i=0..N/n-1))
+ *
+ * This form of the formula is more efficient to compute in
+ * the common case where p is larger than N/n.  Additionally,
+ * as pointed out by Dell'Era, if i << N for all terms in the
+ * product, it can be approximated by
+ *
+ * n * (1 - ((N-p)/N)^(N/n))
+ *
+ * See "Expected distinct values when selecting from a bag
+ * without replacement", Alberto Dell'Era,
+ * http://www.adellera.it/investigations/distinct_balls/.
+ *
+ * The condition i << N is equivalent to n >> 1, so this is a
+ * good approximation when the number of distinct values in
+ * the table is large.  It turns out that this formula also
+ * works well even when n is small.
+ */
+reldistinct *=
+	(1 - pow((rel->tuples - rel->rows) / rel->tuples,
+			 rel->tuples / reldistinct));
+			}
+			reldistinct = clamp_row_est(reldistinct);
 
 			/*
 			 * Update estimate of total distinct groups.
diff --git a/src/test/regress/expected/subselect.out b/src/test/regress/expected/subselect.out
new file mode 100644
index de64ca7..0fc93d9
--- a/src/test/regress/expected/subselect.out
+++ b/src/test/regress/expected/subselect.out
@@ -807,27 +807,24 @@ select * from int4_tbl where
 explain (verbose, costs off)
 select * from int4_tbl o where (f1, f1) in
   (select f1, generate_series(1,2) / 10 g from int4_tbl i group by f1);
-  QUERY PLAN  
---
- Hash Join
+   QUERY PLAN   
+
+ Hash Semi Join
Output: o.f1
Hash Cond: (o.f1 = "ANY_subquery".f1)
->  Seq Scan on public.int4_tbl o
  Output: o.f1
->  Hash
  Output: "ANY_subquery".f1, "ANY_subquery".g
- ->  HashAggregate
+ ->  Subquery Scan on "ANY_subquery"
Output: "ANY_subquery".f1, "ANY_subquery".g
-   Group Key: "ANY_subquery".f1, "ANY_subquery".g
-   ->  Subquery Scan on "ANY_subquery"
- Output: "ANY_subquery".f1, "ANY_subquery".g
- Filter: ("ANY_subquery".f1 = "ANY_subquery".g)
- ->  HashAggregate
-   Output: i.f1, (generate_series(1, 2) / 10)
-   Group Key: i.f1
-   ->  Seq Scan on public.int4_tbl i
- Output: i.f1
-(18 rows)
+   Filter: ("ANY_subquery".f1 = "ANY_subquery".g)
+   ->  HashAggregate
+ Output: i.f1, (generate_series(1, 2) / 10)
+ Group Key: i.f1
+ ->  Seq Scan on public.int4_tbl i
+   Output: i.f1
+(15 rows)
 
 select * from int4_tbl o where (f1, f1) in
   (select f1, generate_series(1,2) / 10 g from int4_tbl i group by f1);

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] improving GROUP BY estimation

2016-03-31 Thread Tom Lane
Alvaro Herrera  writes:
> Tom Lane wrote:
>> See "Approximating block accesses in database organizations", S. B. Yao,
>> Communications of the ACM, Volume 20 Issue 4, April 1977 Pages 260-261

> That sounds nice all right, but I'm not sure it's actually helpful,
> because the article text is not available anywhere.  I doubt most people
> will spend 15 bucks to buy that paper ... so we don't actually know
> whether the paper supports the chosen formula :-)  unless you have a
> CACM subscription and can verify it.

Well, I do and I did, see my previous response.

> I think it's good to have the ACM reference anyway, for posterity, but
> it'd be good to (additionally) have something that people can read.

I'm just concerned about what happens when the Dellera paper stops being
available.  I don't mind including that URL as a backup to the written-out
argument I just suggested.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] improving GROUP BY estimation

2016-03-31 Thread Tom Lane
Dean Rasheed  writes:
> Yeah, that makes sense. In fact, if we only apply the adjustment when
> reldistinct > 0 and rel->rows < rel->tuples, and rewrite the first
> argument to pow() as (rel->tuples - rel->rows) / rel->tuples, then it
> is guaranteed to be non-negative. If rel->rows >= rel->tuples (not
> sure if it can be greater), then we just want the original
> reldistinct.

Works for me.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] improving GROUP BY estimation

2016-03-31 Thread Tom Lane
Dean Rasheed  writes:
> On 31 March 2016 at 21:40, Tom Lane  wrote:
>> Let's use something like this:
>> See "Approximating block accesses in database organizations", S. B. Yao,
>> Communications of the ACM, Volume 20 Issue 4, April 1977 Pages 260-261

Actually, having now looked at both the Dellera paper and the Yao one,
what the latter shows seems to be equivalent to Dellera's equation (2)
(the first one in his section 2.2).  But what the code is actually
using is the approximation in the second equation in 2.2, and that
one is certainly not in Yao, although it's not hard to see how you
get from the first to the second if you assume i << Nr.

I think it'd be worth writing out equation (2), attributing it to the Yao
paper, and then saying something like "Alberto Dellera points out that if
Nd is large, so that all the values of i are much less than Nr, then this
formula can be approximated by [the formula used in the code].  It turns
out that that formula also works well even when Nd is small."

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] improving GROUP BY estimation

2016-03-31 Thread Alvaro Herrera
Tom Lane wrote:

> > The article text refers to this 1977 S. B. Yao paper "Approximating
> > block accesses in database organizations" which doesn't appear to be
> > available online, except behind ACM's paywall at
> > http://dl.acm.org/citation.cfm?id=359475 
> 
> Well, a CACM citation is perfectly fine by my lights (especially one
> that's that far back and therefore certainly patent-free ...)
> 
> Let's use something like this:
> 
> See "Approximating block accesses in database organizations", S. B. Yao,
> Communications of the ACM, Volume 20 Issue 4, April 1977 Pages 260-261

That sounds nice all right, but I'm not sure it's actually helpful,
because the article text is not available anywhere.  I doubt most people
will spend 15 bucks to buy that paper ... so we don't actually know
whether the paper supports the chosen formula :-)  unless you have a
CACM subscription and can verify it.

I think it's good to have the ACM reference anyway, for posterity, but
it'd be good to (additionally) have something that people can read.

-- 
Álvaro Herrerahttp://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] improving GROUP BY estimation

2016-03-31 Thread Dean Rasheed
On 31 March 2016 at 21:40, Tom Lane  wrote:
> Alvaro Herrera  writes:
>> Tom Lane wrote:
>>> Another minor gripe is the use of a random URL as justification.  This
>>> code will still be around when that URL exists nowhere but the Wayback
>>> Machine.  Can't we find a more formal citation to use?
>
>> The article text refers to this 1977 S. B. Yao paper "Approximating
>> block accesses in database organizations" which doesn't appear to be
>> available online, except behind ACM's paywall at
>> http://dl.acm.org/citation.cfm?id=359475
>
> Well, a CACM citation is perfectly fine by my lights (especially one
> that's that far back and therefore certainly patent-free ...)
>
> Let's use something like this:
>
> See "Approximating block accesses in database organizations", S. B. Yao,
> Communications of the ACM, Volume 20 Issue 4, April 1977 Pages 260-261
>

Sounds good.

Regards,
Dean


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] improving GROUP BY estimation

2016-03-31 Thread Dean Rasheed
On 31 March 2016 at 20:18, Tom Lane  wrote:
> Also, I wonder if it'd be a good idea to provide a guard against division
> by zero --- we know rel->tuples > 0 at this point, but I'm less sure that
> reldistinct can't be zero.  In the same vein, I'm worried about the first
> argument of pow() being slightly negative due to roundoff error, leading
> to a NaN result.

Yeah, that makes sense. In fact, if we only apply the adjustment when
reldistinct > 0 and rel->rows < rel->tuples, and rewrite the first
argument to pow() as (rel->tuples - rel->rows) / rel->tuples, then it
is guaranteed to be non-negative. If rel->rows >= rel->tuples (not
sure if it can be greater), then we just want the original
reldistinct.

> Maybe we should also consider clamping the final reldistinct estimate to
> an integer with clamp_row_est().  The existing code doesn't do that but
> it seems like a good idea on general principles.

OK, that seems sensible.

Regards,
Dean


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] improving GROUP BY estimation

2016-03-31 Thread Tom Lane
Alvaro Herrera  writes:
> Tom Lane wrote:
>> Another minor gripe is the use of a random URL as justification.  This
>> code will still be around when that URL exists nowhere but the Wayback
>> Machine.  Can't we find a more formal citation to use?

> The article text refers to this 1977 S. B. Yao paper "Approximating
> block accesses in database organizations" which doesn't appear to be
> available online, except behind ACM's paywall at
> http://dl.acm.org/citation.cfm?id=359475 

Well, a CACM citation is perfectly fine by my lights (especially one
that's that far back and therefore certainly patent-free ...)

Let's use something like this:

See "Approximating block accesses in database organizations", S. B. Yao,
Communications of the ACM, Volume 20 Issue 4, April 1977 Pages 260-261

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] improving GROUP BY estimation

2016-03-31 Thread Alvaro Herrera
Tom Lane wrote:

> Another minor gripe is the use of a random URL as justification.  This
> code will still be around when that URL exists nowhere but the Wayback
> Machine.  Can't we find a more formal citation to use?

The article text refers to this 1977 S. B. Yao paper "Approximating
block accesses in database organizations" which doesn't appear to be
available online, except behind ACM's paywall at
http://dl.acm.org/citation.cfm?id=359475 

-- 
Álvaro Herrerahttp://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] improving GROUP BY estimation

2016-03-31 Thread Tom Lane
Dean Rasheed  writes:
> On 30 March 2016 at 14:03, Tomas Vondra  wrote:
>> Attached is v4 of the patch

> Thanks, I think this is good to go, except that I think we need to use
> pow() rather than powl() because AIUI powl() is new to C99, and so
> won't necessarily be available on all supported platforms. I don't
> think we need worry about loss of precision, since that would only be
> an issue if rel->rows / rel->tuples were smaller than maybe 10^-14 or
> so, and it seems unlikely we'll get anywhere near that any time soon.

I took a quick look.  I concur with using pow() not powl(); the latter
is not in SUS v2 which is our baseline portability expectation, and in
fact there is *noplace* where we expect long double to work.  Moreover,
I don't believe that any of the estimates we're working with are so
accurate that a double-width power result would be a useful improvement.

Also, I wonder if it'd be a good idea to provide a guard against division
by zero --- we know rel->tuples > 0 at this point, but I'm less sure that
reldistinct can't be zero.  In the same vein, I'm worried about the first
argument of pow() being slightly negative due to roundoff error, leading
to a NaN result.

Maybe we should also consider clamping the final reldistinct estimate to
an integer with clamp_row_est().  The existing code doesn't do that but
it seems like a good idea on general principles.

Another minor gripe is the use of a random URL as justification.  This
code will still be around when that URL exists nowhere but the Wayback
Machine.  Can't we find a more formal citation to use?

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] improving GROUP BY estimation

2016-03-31 Thread Dean Rasheed
On 30 March 2016 at 14:03, Tomas Vondra  wrote:
> Attached is v4 of the patch

Thanks, I think this is good to go, except that I think we need to use
pow() rather than powl() because AIUI powl() is new to C99, and so
won't necessarily be available on all supported platforms. I don't
think we need worry about loss of precision, since that would only be
an issue if rel->rows / rel->tuples were smaller than maybe 10^-14 or
so, and it seems unlikely we'll get anywhere near that any time soon.

I think this is a good, well thought-out change, so unless anyone
objects I'll push it (probably this weekend).

Regards,
Dean


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] improving GROUP BY estimation

2016-03-30 Thread Tomas Vondra

Hi,

On 03/22/2016 03:40 PM, Alexander Korotkov wrote:


I think you should send a revision of patch including comments proposed
by Deam Rasheed.
I'm switching patch status to waiting on author in commitfest.



Attached is v4 of the patch - the only difference w.r.t. v3 is that I've 
used the comment as proposed by Dean.


regards

--
Tomas Vondra  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


estimate-num-groups-v4.patch
Description: binary/octet-stream

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] improving GROUP BY estimation

2016-03-29 Thread David Steele

Hi Thomas,

On 3/22/16 10:40 AM, Alexander Korotkov wrote:


I think you should send a revision of patch including comments proposed
by Deam Rasheed.
I'm switching patch status to waiting on author in commitfest.


We're getting pretty close to the end of the CF.  Do you know when you 
will have a new patch ready?


Thanks,
--
-David
da...@pgmasters.net


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] improving GROUP BY estimation

2016-03-22 Thread Alexander Korotkov
Hi, Tomas!

On Mon, Mar 21, 2016 at 2:39 PM, Alexander Korotkov <
a.korot...@postgrespro.ru> wrote:

> On Fri, Mar 18, 2016 at 1:20 PM, Dean Rasheed 
> wrote:
>
>> Probably a better URL to give is
>> http://www.adellera.it/investigations/distinct_balls/ which has a link
>> to the PDF version of the paper and also some supporting material.
>>
>> However, while that paper is in general very clear, I don't think it
>> gives a very clear explanation of that particular formula, and it
>> doesn't explain what it represents. It merely says that if "i" can be
>> ignored "for some reason (e.g. i << Nr)", then that formula is an
>> approximation to the exact "without replacement" formula, which is the
>> subject of that paper.
>>
>> But actually, that formula is the exact formula for the expected
>> number of distinct values when selecting *with replacement* from a
>> collection where each of the distinct values occurs an equal number of
>> times. So I think we should say that.
>>
>> Perhaps something along the lines of:
>>
>> /*
>>  * Update the estimate based on the restriction selectivity.
>>  *
>>  * This uses the formula for the expected number of distinct
>> values
>>  * when selecting with replacement from a collection where
>> each of
>>  * the distinct values occurs an equal number of times (a
>> uniform
>>  * distribution of values). This is a very close
>> approximation to
>>  * the more correct method of selecting without replacement
>> when
>>  * the number of distinct values is quite large --- for
>> example,
>>  * see http://www.adellera.it/investigations/distinct_balls/.
>>  * Additionally, it also turns out to work very well even
>> when the
>>  * number of distinct values is small.
>>  */
>>
>
> +1
> Thank you for work on this patch. The formula you propose and explanation
> look great!
>

I think you should send a revision of patch including comments proposed by
Deam Rasheed.
I'm switching patch status to waiting on author in commitfest.

--
Alexander Korotkov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company


Re: [HACKERS] improving GROUP BY estimation

2016-03-21 Thread Alexander Korotkov
Hi, Dean!

On Fri, Mar 18, 2016 at 1:20 PM, Dean Rasheed 
wrote:

> Probably a better URL to give is
> http://www.adellera.it/investigations/distinct_balls/ which has a link
> to the PDF version of the paper and also some supporting material.
>
> However, while that paper is in general very clear, I don't think it
> gives a very clear explanation of that particular formula, and it
> doesn't explain what it represents. It merely says that if "i" can be
> ignored "for some reason (e.g. i << Nr)", then that formula is an
> approximation to the exact "without replacement" formula, which is the
> subject of that paper.
>
> But actually, that formula is the exact formula for the expected
> number of distinct values when selecting *with replacement* from a
> collection where each of the distinct values occurs an equal number of
> times. So I think we should say that.
>
> Perhaps something along the lines of:
>
> /*
>  * Update the estimate based on the restriction selectivity.
>  *
>  * This uses the formula for the expected number of distinct
> values
>  * when selecting with replacement from a collection where
> each of
>  * the distinct values occurs an equal number of times (a
> uniform
>  * distribution of values). This is a very close approximation
> to
>  * the more correct method of selecting without replacement
> when
>  * the number of distinct values is quite large --- for
> example,
>  * see http://www.adellera.it/investigations/distinct_balls/.
>  * Additionally, it also turns out to work very well even when
> the
>  * number of distinct values is small.
>  */
>

+1
Thank you for work on this patch. The formula you propose and explanation
look great!

--
Alexander Korotkov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company


Re: [HACKERS] improving GROUP BY estimation

2016-03-19 Thread Dean Rasheed
On 18 March 2016 at 00:37, Tomas Vondra  wrote:
>> On Sun, 2016-03-13 at 15:24 +, Dean Rasheed wrote:
>>> I think that a better formula to use would be
>>>
>>> reldistinct *= (1 - powl(1 - rel-rows / rel->tuples, rel->tuples /
>>> reldistinct)
>
> Attached is a v3 of the patch using this formula instead of the original
> one. Interestingly, that apparently reduces the number of regression tests
> that get broken to a single one.
>

Cool.


> I'm not sure whether we need to provide a link to the PDF the formula comes
> from - perhaps we should?
>

Probably a better URL to give is
http://www.adellera.it/investigations/distinct_balls/ which has a link
to the PDF version of the paper and also some supporting material.

However, while that paper is in general very clear, I don't think it
gives a very clear explanation of that particular formula, and it
doesn't explain what it represents. It merely says that if "i" can be
ignored "for some reason (e.g. i << Nr)", then that formula is an
approximation to the exact "without replacement" formula, which is the
subject of that paper.

But actually, that formula is the exact formula for the expected
number of distinct values when selecting *with replacement* from a
collection where each of the distinct values occurs an equal number of
times. So I think we should say that.

Perhaps something along the lines of:

/*
 * Update the estimate based on the restriction selectivity.
 *
 * This uses the formula for the expected number of distinct values
 * when selecting with replacement from a collection where each of
 * the distinct values occurs an equal number of times (a uniform
 * distribution of values). This is a very close approximation to
 * the more correct method of selecting without replacement when
 * the number of distinct values is quite large --- for example,
 * see http://www.adellera.it/investigations/distinct_balls/.
 * Additionally, it also turns out to work very well even when the
 * number of distinct values is small.
 */

Regards,
Dean


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] improving GROUP BY estimation

2016-03-19 Thread Tomas Vondra

On 03/13/2016 11:09 PM, Tomas Vondra wrote:

Hi,

On Sun, 2016-03-13 at 15:24 +, Dean Rasheed wrote:

On 4 March 2016 at 13:10, Tomas Vondra 
wrote:



...
>>>


I think that a better formula to use would be

reldistinct *= (1 - powl(1 - rel-rows / rel->tuples, rel->tuples /
reldistinct)


Attached is a v3 of the patch using this formula instead of the original 
one. Interestingly, that apparently reduces the number of regression 
tests that get broken to a single one.


I'm not sure whether we need to provide a link to the PDF the formula 
comes from - perhaps we should?


I've also repeated the tests for the two tables (dependent and 
independent columns), comparing the actual number of groups and 
different estimates, and the results look like this (v3 is the formula 
used in this patch):



1) independent

   |   10 |   50 |  100 |  500 |  1000 |  5000
   -
actual |  919 | 3829 | 6244 | 9944 | 10001 | 10001
   current |   10 |   50 |  102 |  516 |  1018 |  4996
  new (v1) |  973 | 4001 | 6382 | 9897 |  9951 |  9951
  new (v3) | 1117 | 3852 | 6229 | 9943 | 10004 | 10004


2) dependent

|  10 |   50 |  100 |  500 |  1000 |  5000
 
 actual |  10 |   50 |  100 |  500 |  1000 |  5000
current |  10 |   53 |  105 |  508 |  1016 |  5014
   new (v1) | 880 | 4105 | 6472 | 9955 | 10018 | 10018
   new (v3) | 807 | 3680 | 6050 | 9916 |  9983 |  9983

I only collected numbers for the new estimator, the other numbers are 
just a copy from the previous message. So there might be minor 
differences due to slightly different ndistinct estimates etc.


Anyway, the numbers are obviously quite close to the formula from v1 of 
the patch, plus the formula gives better estimates when scanning nearly 
all rows.


regards

--
Tomas Vondra  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


estimate-num-groups-v3.patch
Description: binary/octet-stream

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] improving GROUP BY estimation

2016-03-13 Thread Tomas Vondra
Hi,

On Sun, 2016-03-13 at 15:24 +, Dean Rasheed wrote:
> On 4 March 2016 at 13:10, Tomas Vondra 
> wrote:
> > 
> > The risk associated with over-estimation is that we may switch from
> > HashAggregate to GroupAggregate, and generally select paths better
> > suited for large number of rows. Not great, because the plan may
> > not be
> > optimal and thus run much slower - but that usually only happens on
> > small tables, because on large tables we would probably end up
> > using
> > those same paths anyway.
> > 
> > With under-estimation, the main risks are much graver, as we may
> > end up
> > using HashAggregate only to get killed by OOM. When we're lucky not
> > to
> > hit that, my experience this often leads to a cascade of NestedLoop
> > nodes processing orders of magnitude more tuples than expected.
> > Awful.
> > 
> > So I think the risk is asymmetric, and that's why I like the new
> > estimator more, because it tends to over-estimate, but in a very
> > reasonable way.
> > 
> Agreed. Under-estimating is worse than over-estimating.
> 
> 
> -   reldistinct *= rel->rows / rel->tuples;
> +   reldistinct *= (1 - powl((reldistinct - 1) / reldistinct,
> rel->rows)
> 
> Looking at this, I agree that this new formula from [1] is generally
> better than the old one in most (but not all) cases.
> 
> One problem with it is that it no longer takes into account
> rel->tuples, which means that when returning all the tuples from the
> table, it no longer gives an estimate equal to the total number of
> distinct values in the table. In fact it tends to underestimate when
> returning nearly all the rows from the table.

Yes, that's a good point.

> 
> The old formula is clearly not a good general-purpose estimate.
> However, there is one case where it does return an accurate estimate
> -- when all the values in the underlying table are distinct. In this
> case the new formula consistently produces underestimates, while the
> old formula works well. For example:
> 
> CREATE TABLE t AS SELECT (1 * random())::int a,
>  i::int b
> FROM generate_series(1,100) s(i);
> ANALYSE t;
> 
> EXPLAIN ANALYSE
> SELECT a FROM t GROUP BY a;
> 
> So there are clearly around 1 million distinct values for "a", but
> the new formula gives an estimate of around 630,000. That's not a
> massive underestimate, but it's sufficiently obvious and visible that
> it would be good to do better if we can.
> 
> 
> I think that a better formula to use would be
> 
> reldistinct *= (1 - powl(1 - rel-rows / rel->tuples, rel->tuples /
> reldistinct)
> 
> This comes from [2], which gives a general formula for selection
> without replacement, and then gives the above as an approximation to
> the uniform distribution case. It's not really made clear in that
> paper, but that approximation is actually the "with replacement"
> approximation, but starting from different initial assumptions to
> give a formula that has better boundary conditions and special-case
> behaviour, as described below.
> 
> ...
> 
> For most values of N and n, the approximation from [2] is almost
> indistinguishable from the exact formula, whereas the formula from
> [1] tends to underestimate when selecting a large number of distinct
> values (e.g., try setting n=90 in the plot above).

Yes, I agree that formula you propose seems to behave better.

regards

-- 
Tomas Vondra  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] improving GROUP BY estimation

2016-03-13 Thread Dean Rasheed
On 4 March 2016 at 13:10, Tomas Vondra  wrote:
> The risk associated with over-estimation is that we may switch from
> HashAggregate to GroupAggregate, and generally select paths better
> suited for large number of rows. Not great, because the plan may not be
> optimal and thus run much slower - but that usually only happens on
> small tables, because on large tables we would probably end up using
> those same paths anyway.
>
> With under-estimation, the main risks are much graver, as we may end up
> using HashAggregate only to get killed by OOM. When we're lucky not to
> hit that, my experience this often leads to a cascade of NestedLoop
> nodes processing orders of magnitude more tuples than expected. Awful.
>
> So I think the risk is asymmetric, and that's why I like the new
> estimator more, because it tends to over-estimate, but in a very
> reasonable way.
>

Agreed. Under-estimating is worse than over-estimating.


-   reldistinct *= rel->rows / rel->tuples;
+   reldistinct *= (1 - powl((reldistinct - 1) / reldistinct, rel->rows)

Looking at this, I agree that this new formula from [1] is generally
better than the old one in most (but not all) cases.

One problem with it is that it no longer takes into account
rel->tuples, which means that when returning all the tuples from the
table, it no longer gives an estimate equal to the total number of
distinct values in the table. In fact it tends to underestimate when
returning nearly all the rows from the table.

The old formula is clearly not a good general-purpose estimate.
However, there is one case where it does return an accurate estimate
-- when all the values in the underlying table are distinct. In this
case the new formula consistently produces underestimates, while the
old formula works well. For example:

CREATE TABLE t AS SELECT (1 * random())::int a,
 i::int b
FROM generate_series(1,100) s(i);
ANALYSE t;

EXPLAIN ANALYSE
SELECT a FROM t GROUP BY a;

So there are clearly around 1 million distinct values for "a", but the
new formula gives an estimate of around 630,000. That's not a massive
underestimate, but it's sufficiently obvious and visible that it would
be good to do better if we can.


I think that a better formula to use would be

reldistinct *= (1 - powl(1 - rel-rows / rel->tuples, rel->tuples / reldistinct)

This comes from [2], which gives a general formula for selection
without replacement, and then gives the above as an approximation to
the uniform distribution case. It's not really made clear in that
paper, but that approximation is actually the "with replacement"
approximation, but starting from different initial assumptions to give
a formula that has better boundary conditions and special-case
behaviour, as described below.


For the record, here's a quick analysis of how the 2 formulae come about:

Assume there are:
  N rows in the table
  n distinct values (n <= N)
  p rows are chosen at random (p <= N)

so the problem is to estimate how many distinct values there will be
in the p rows chosen. Both approaches work by first estimating the
probability that some particular value x is *not* chosen.

[1] starts from the assumption that each row of the table has a
probability of 1/n of being equal to x. So the probability that x is
not the first value chosen is

  P(not x, 1) = 1 - 1/n

Then, if the value chosen first is replaced, the probability that x is
not the second value chosen is the same. The "with replacement"
approximation makes each choice an independent probability, so the
overall probability that x is not in any of the p rows chosen is just
the product of the p individual probabilities, which is just

  P(not x, p) = (1 - 1/n)^p

Then of course the probability that x *is* chosen at least once in the p rows is

  P(x, p) = 1 - (1 - 1/n)^p

Finally, since there are n possible values of x, and they're all
equally likely in the table, the expected number of distinct values is

  D(p) = n * (1 - (1 - 1/n)^p)

The flaw in this approach is that for large p the "with replacement"
approximation gets worse and worse, and the probabilities P(x, p)
systematically under-estimate the actual probability that x is chosen,
which increases as more values not equal to x are chosen. By the time
the last value is chosen P(x, p=N) should actually be 1.


[2] starts from a different initial assumption (uniform distribution
case) -- for any given value x, assume that the table contains N/n
instances of x (ignoring the fact that that might not be an integer).
Note that with this assumption there is guaranteed to be at least one
instance of x, which is not the case with the above approach.

Now consider the first instance of x in the table. If we choose p rows
from the table, the probability that that first instance of x is not
chosen is

  P(not x[1], p) = 1 - p / N

Making the same "with replacement" simplification, the probability
that the 

Re: [HACKERS] improving GROUP BY estimation

2016-03-04 Thread Tomas Vondra
On Thu, 2016-03-03 at 11:42 -0800, Mark Dilger wrote:
> > On Mar 3, 2016, at 11:27 AM, Alexander Korotkov  
> > wrote:
> > 
> > On Thu, Mar 3, 2016 at 10:16 PM, Tomas Vondra 
> >  wrote:
> > So yes, each estimator works great for exactly the opposite cases. But 
> > notice that typically, the results of the new formula is much higher than 
> > the old one, sometimes by two orders of magnitude (and it shouldn't be 
> > difficult to construct examples of much higher differences).
> > 
> > The table also includes the 'average' estimator you propose, but it's 
> > rather obvious that the result is always much closer to the new value, 
> > simply because
> > 
> >(small number) + (huge number)
> >--
> >   2
> > 
> > is always much closer to the huge number. We're usually quite happy
> > when the estimates are within the same order of magnitude, so whether
> > it's K or K/2 makes pretty much no difference.
> > 
> > I believe that Mark means geometrical average, i.e. sqrt((small number) * 
> > (huge number)).

Ah, OK. Haven't realized you meant geometric mean. With that it looks
like this:

1) independent

   10  50 100 50010005000
 --
 actual   919382962449944   10001   10001
 current   10  50 102 51610184996
 new  97340016382989799519951
 average  49120253205520654847473
 geom. mean99 447 807226031837051

2) dependent

   10  50 100 50010005000
 --
 actual10  50 100 50010005000
 current   10  53 105 50810165014
 new  880410564729955   10018   10018
 average  44520793288523155177516
 geom. mean94 466 824224931907087

To better illustrate the differences, we may use the "ratio error"
defined as

err(a,b) = MAX(a/b, b/a)

where 'a' is the actual value and 'b' is the estimate. That gives us:

1) independent

10   50  100  50010005000
 --
   current   91.9076.5861.2219.279.822.00
   new1.06 1.04 1.02 1.001.011.01
   average1.87 1.89 1.95 1.911.821.34
   geom. mean 9.32 8.56 7.74 4.403.141.42

2) dependent

10   50  100  50010005000
 --
   current1.00 1.06 1.05 1.021.021.00
   new   88.0082.1064.7219.91   10.022.00
   average   44.5041.5832.8810.465.521.50
   geom. mean 9.38 9.33 8.24 4.503.191.42

So the geometric mean seems to be working better than plain average. But
I don't think we can simply conclude we should use the geometric mean of
the estimates, as it assumes the risks of over- and under-estimating the
cardinality are the same. Because they aren't.

> 
> Yes, that is what I proposed upthread.  I'm not wedded to that, though.
> In general, I am with Tomas on this one, believing that his estimate
> will be much better than the current estimate.  But I believe the *best*
> estimate will be somewhere between his and the current one, and I'm
> fishing for any decent way of coming up with a weighted average that
> is closer to his than to the current, but not simply equal to his proposal.
> 
> The reason I want the formula to be closer to Tomas's than to the
> current is that I think that on average, across all tables, across all
> databases, in practice it will be closer to the right estimate than the
> current formula.  That's just my intuition, and so I can't defend it.
> But if my intuition is right, the best formula we can adopt would be one
> that is moderated from his by a little bit, and in the direction of the
> estimate that the current code generates.

I think looking merely at what fraction of data sets (or queries) uses
dependent GROUP BY and WHERE clauses is not sufficient.

The general behavior of the two GROUP BY estimators is that the current
one tends to under-estimate, while the new one tends to over-estimate.
(It's not difficult to construct counter-examples by using more complex
dependencies between the columns etc. but let's ignore that.)

The risk associated with over-estimation is that we may switch from
HashAggregate to GroupAggregate, and generally 

Re: [HACKERS] improving GROUP BY estimation

2016-03-03 Thread Mark Dilger

> On Mar 3, 2016, at 11:27 AM, Alexander Korotkov  
> wrote:
> 
> On Thu, Mar 3, 2016 at 10:16 PM, Tomas Vondra  
> wrote:
> So yes, each estimator works great for exactly the opposite cases. But notice 
> that typically, the results of the new formula is much higher than the old 
> one, sometimes by two orders of magnitude (and it shouldn't be difficult to 
> construct examples of much higher differences).
> 
> The table also includes the 'average' estimator you propose, but it's rather 
> obvious that the result is always much closer to the new value, simply because
> 
>(small number) + (huge number)
>--
>   2
> 
> is always much closer to the huge number. We're usually quite happy when the 
> estimates are within the same order of magnitude, so whether it's K or K/2 
> makes pretty much no difference.
> 
> I believe that Mark means geometrical average, i.e. sqrt((small number) * 
> (huge number)).

Yes, that is what I proposed upthread.  I'm not wedded to that, though.
In general, I am with Tomas on this one, believing that his estimate
will be much better than the current estimate.  But I believe the *best*
estimate will be somewhere between his and the current one, and I'm
fishing for any decent way of coming up with a weighted average that
is closer to his than to the current, but not simply equal to his proposal.

The reason I want the formula to be closer to Tomas's than to the
current is that I think that on average, across all tables, across all
databases, in practice it will be closer to the right estimate than the
current formula.  That's just my intuition, and so I can't defend it.
But if my intuition is right, the best formula we can adopt would be one
that is moderated from his by a little bit, and in the direction of the
estimate that the current code generates.

I could easily lose this debate merely for lack of a principled basis
for saying how far toward the current estimate the new estimate should
be adjusted.  The geometric average is one suggestion, but I don't have
a principled argument for it.

Like I said above, I'm fishing for a decent formula here.

Mark Dilger

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] improving GROUP BY estimation

2016-03-03 Thread Alexander Korotkov
On Thu, Mar 3, 2016 at 10:16 PM, Tomas Vondra 
wrote:

> So yes, each estimator works great for exactly the opposite cases. But
> notice that typically, the results of the new formula is much higher than
> the old one, sometimes by two orders of magnitude (and it shouldn't be
> difficult to construct examples of much higher differences).
>
> The table also includes the 'average' estimator you propose, but it's
> rather obvious that the result is always much closer to the new value,
> simply because
>
>(small number) + (huge number)
>--
>   2
>
> is always much closer to the huge number. We're usually quite happy when
> the estimates are within the same order of magnitude, so whether it's K or
> K/2 makes pretty much no difference.


I believe that Mark means geometrical average, i.e. sqrt((small number) *
(huge number)).

--
Alexander Korotkov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company


Re: [HACKERS] improving GROUP BY estimation

2016-03-03 Thread Tomas Vondra

Hi,

On 03/03/2016 06:40 PM, Mark Dilger wrote:



On Mar 3, 2016, at 8:35 AM, Tomas Vondra  wrote:

Hi,

On 03/03/2016 12:53 PM, Alexander Korotkov wrote:

Hi, Tomas!

I've assigned to review this patch.

I've checked version estimate-num-groups-v2.txt by Mark Dilger.
It applies to head cleanly, passes corrected regression tests.

About correlated/uncorrelated cases. I think now optimizer mostly assumes
all distributions to be independent. I think we should follow this
assumption in this case also until we have fundamentally better option (like
your multivariate statistics).

@@ -3438,9 +3438,9 @@ estimate_num_groups(PlannerInfo *root, List *groupExprs,
double input_rows,
reldistinct = clamp;

/*
-* Multiply by restriction selectivity.
+* Estimate number of distinct values expected in given number of rows.
*/
-reldistinct *= rel->rows / rel->tuples;
+reldistinct *= (1 - powl((reldistinct - 1) / reldistinct, rel->rows));

/*
* Update estimate of total distinct groups.

I think we need way more explanation in comments here (possibly with
external links). For now, it looks like formula which appears from nowhere.


I thought the details (particularly the link to stackexchange, discussing
the formula) would be enough, but let me elaborate.

The current formula

reldistinct *= rel->rows / rel->tuples;

simply multiplies the ndistinct estimate with selectivity. So when you
select 1% of the table, the estimate says we'll see 1% of ndistinct
values. But clearly that's naive, because for example when you have 10k
distinct values and you select 10M rows, you should expect nearly all of
them in the sample.


The current formula assumes total dependence between the columns.  You can
create tables where this is true, and the formula gives precisely the right
estimate.  I'm not claiming this is common, or that it should be the default
assumption, but it is one end of the spectrum of possible correct
estimates.


I'm not really sure what you mean by dependence between columns, as both the 
old and new formula only works with total cardinality and selectivity, and has 
absolutely no idea about multiple columns.


In case you mean dependence between columns in the GROUP BY and columns used to 
compute the selectivity (i.e. referenced in WHERE), then perhaps you could say 
it like that, although I think there's no such explicit assumption.





And that's what the formula does - it gives you the expected number of
distinct values, when selecting 'k' values from 'd' distinct values with
replacements:

count(k, d) = d * (1 - ((d-1)/d) ^ k)

It's assuming the distinct values are equally probable (uniform
distribution) and that the probabilities do not change (that's the "with
replacement").


The new formula assumes total independence between the columns. You can
likewise create tables where this is true, and you did so upthread, and for
those tables it gives precisely the right estimate. This is the other end of
the spectrum of possible correct estimates.

In the absence of multivariate statistics, either because you haven't
committed that work yet, or because the multivariate data the planner needs
hasn't been collected, choosing an estimate exactly in the middle of the
spectrum (whatever that would mean mathematically) would minimize the
maximum possible error in the estimate.


FWIW the current version of multivariate statistics can't really help in this 
case. Perhaps it will help in the future, but it's way more complicated that it 
might seem as it requires transferring information from the WHERE clause to the 
GROUP BY clause.




I have the sense that my point of view is in the minority, because the
expectation is the problems caused by independence assumptions will only be
fixed when multivariate statistics are implemented and available, and so we
should just keep the independence assumptions everywhere until that time. I
 tend to disagree with that, on the grounds that even when that work is
finished, we'll never have complete coverage of every possible set of
columns and what their degree of dependence is.


Well, in a sense you're right that those estimates work best in different 
situations. The problem with constructing an estimator the way you propose 
(simply returning an average) is that in both the extreme cases (perfect 
dependence or independence) one of those estimates is really bad. Moreover the 
new formula typically produces higher values than the old one,


Consider for example this:

CREATE TABLE t AS SELECT (1 * random())::int a,
 (1 * random())::int b
FROM generate_series(1,100) s(i);

and let's see estimates for queries like this:

SELECT a FROM t WHERE b < $1 GROUP BY a;

Then for different values of $1 we get this:

   |  10 |   50 |  100 |  500 |  1000 |  5000
   
actual | 919 | 3829 | 6244 | 9944 | 10001 | 

Re: [HACKERS] improving GROUP BY estimation

2016-03-03 Thread Alexander Korotkov
On Thu, Mar 3, 2016 at 7:35 PM, Tomas Vondra 
wrote:

> On 03/03/2016 12:53 PM, Alexander Korotkov wrote:
>
>> I've assigned to review this patch.
>>
>> I've checked version estimate-num-groups-v2.txt by Mark Dilger.
>> It applies to head cleanly, passes corrected regression tests.
>>
>> About correlated/uncorrelated cases. I think now optimizer mostly assumes
>> all distributions to be independent. I think we should follow this
>> assumption in this case also until we have fundamentally better option
>> (like
>> your multivariate statistics).
>>
>> @@ -3438,9 +3438,9 @@ estimate_num_groups(PlannerInfo *root, List
>> *groupExprs,
>> double input_rows,
>> reldistinct = clamp;
>>
>> /*
>> -* Multiply by restriction selectivity.
>> +* Estimate number of distinct values expected in given number of rows.
>> */
>> -reldistinct *= rel->rows / rel->tuples;
>> +reldistinct *= (1 - powl((reldistinct - 1) / reldistinct, rel->rows));
>>
>> /*
>> * Update estimate of total distinct groups.
>>
>> I think we need way more explanation in comments here (possibly with
>> external links). For now, it looks like formula which appears from
>> nowhere.
>>
>
> I thought the details (particularly the link to stackexchange, discussing
> the formula) would be enough, but let me elaborate.
>
> The current formula
>
> reldistinct *= rel->rows / rel->tuples;
>
> simply multiplies the ndistinct estimate with selectivity. So when you
> select 1% of the table, the estimate says we'll see 1% of ndistinct values.
> But clearly that's naive, because for example when you have 10k distinct
> values and you select 10M rows, you should expect nearly all of them in the
> sample.
>
> And that's what the formula does - it gives you the expected number of
> distinct values, when selecting 'k' values from 'd' distinct values with
> replacements:
>
> count(k, d) = d * (1 - ((d-1)/d) ^ k)
>
> It's assuming the distinct values are equally probable (uniform
> distribution) and that the probabilities do not change (that's the "with
> replacement").
>
> I guess we could relax those assumptions and for example use the MCV
> statistics to further improve the estimate, and also relax the 'with
> replacement' but that'd make the formula way more complicated.
>
> [1]
> http://math.stackexchange.com/questions/72223/finding-expected-number-of-distinct-values-selected-from-a-set-of-integers


Your explanation in the first mail was good enough. Not it's even better.
But actually I mean that we need to include some brief explanation into
source code (either comments or readme). It would be good if one who want
to understand this could do without searching mailing list archives or git
history.


>> Also, note that rel->tuples gone away from formula parameters. So, we
>> actually don't care about how may tuples are in the table. This is because
>> this formula assumes that same tuple could be selected multiple times. For
>> low numbers this can lead to some errors. Consider selecting 2 from 3
>> distinct tuples. If you can't select same tuple twice then you always
>> selecting 2 distinct tuples. But according to formula you will select 5/3
>> in
>> average. I think this error in small numbers is negligible, especially
>> because getting rid of this error would require way more computations. But
>> it worth mentioning in comments though.
>>
>
> Well, yeah. That's the consequence of 'with replacement' assumption.
>
> I guess we could handle this somehow, though. For example we can compute
> the minimum possible number of distinct values like this:
>
> average_rows_per_value = (tuples / ndistinct);
> min_estimate = ceil(rows / average_rows_per_value);
>
> and use that as a minimum for the estimate. In the example you've given
> this would say
>
> average_rows_per_value = (3 / 3) = 1;
> min_estimate = ceil(2 / 1) = 2;
>
> So it serves as a correction for the 'with replacement' assumption. With
> only 2 distinct values we'd get this:
>
> average_rows_per_value = (3 / 2) = 1.5;
> min_estimate = ceil(2 / 1.5) = 2;
>
> Of course, it's just a heuristics, so this may fail in some other cases.


I'm not sure we actually need it. Even in worst case error doesn't seems to
be very big. But feel free to add this heuristics, it looks OK for me.

--
Alexander Korotkov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company


Re: [HACKERS] improving GROUP BY estimation

2016-03-03 Thread Mark Dilger

> On Mar 3, 2016, at 8:35 AM, Tomas Vondra  wrote:
> 
> Hi,
> 
> On 03/03/2016 12:53 PM, Alexander Korotkov wrote:
>> Hi, Tomas!
>> 
>> I've assigned to review this patch.
>> 
>> I've checked version estimate-num-groups-v2.txt by Mark Dilger.
>> It applies to head cleanly, passes corrected regression tests.
>> 
>> About correlated/uncorrelated cases. I think now optimizer mostly assumes
>> all distributions to be independent. I think we should follow this
>> assumption in this case also until we have fundamentally better option (like
>> your multivariate statistics).
>> 
>> @@ -3438,9 +3438,9 @@ estimate_num_groups(PlannerInfo *root, List 
>> *groupExprs,
>> double input_rows,
>> reldistinct = clamp;
>> 
>> /*
>> -* Multiply by restriction selectivity.
>> +* Estimate number of distinct values expected in given number of rows.
>> */
>> -reldistinct *= rel->rows / rel->tuples;
>> +reldistinct *= (1 - powl((reldistinct - 1) / reldistinct, rel->rows));
>> 
>> /*
>> * Update estimate of total distinct groups.
>> 
>> I think we need way more explanation in comments here (possibly with
>> external links). For now, it looks like formula which appears from nowhere.
> 
> I thought the details (particularly the link to stackexchange, discussing the 
> formula) would be enough, but let me elaborate.
> 
> The current formula
> 
>reldistinct *= rel->rows / rel->tuples;
> 
> simply multiplies the ndistinct estimate with selectivity. So when you select 
> 1% of the table, the estimate says we'll see 1% of ndistinct values. But 
> clearly that's naive, because for example when you have 10k distinct values 
> and you select 10M rows, you should expect nearly all of them in the sample.

The current formula assumes total dependence between the
columns.  You can create tables where this is true, and the
formula gives precisely the right estimate.  I'm not claiming this
is common, or that it should be the default assumption, but it
is one end of the spectrum of possible correct estimates.

> And that's what the formula does - it gives you the expected number of 
> distinct values, when selecting 'k' values from 'd' distinct values with 
> replacements:
> 
>count(k, d) = d * (1 - ((d-1)/d) ^ k)
> 
> It's assuming the distinct values are equally probable (uniform distribution) 
> and that the probabilities do not change (that's the "with replacement").

The new formula assumes total independence between the
columns.  You can likewise create tables where this is true,
and you did so upthread, and for those tables it gives precisely
the right estimate.  This is the other end of the spectrum of
possible correct estimates.

In the absence of multivariate statistics, either because you
haven't committed that work yet, or because the multivariate
data the planner needs hasn't been collected, choosing an
estimate exactly in the middle of the spectrum (whatever
that would mean mathematically) would minimize the
maximum possible error in the estimate.

I have the sense that my point of view is in the minority, because
the expectation is the problems caused by independence
assumptions will only be fixed when multivariate statistics are
implemented and available, and so we should just keep the
independence assumptions everywhere until that time.  I
tend to disagree with that, on the grounds that even when that
work is finished, we'll never have complete coverage of every
possible set of columns and what their degree of dependence
is.

Perhaps I am badly mistaken.

Can somebody more familiar with the issues than I am please
disabuse me of my wrongheadedness?

Mark Dilger



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] improving GROUP BY estimation

2016-03-03 Thread Tomas Vondra

Hi,

On 03/03/2016 12:53 PM, Alexander Korotkov wrote:

Hi, Tomas!

I've assigned to review this patch.

I've checked version estimate-num-groups-v2.txt by Mark Dilger.
It applies to head cleanly, passes corrected regression tests.

About correlated/uncorrelated cases. I think now optimizer mostly assumes
all distributions to be independent. I think we should follow this
assumption in this case also until we have fundamentally better option (like
your multivariate statistics).

@@ -3438,9 +3438,9 @@ estimate_num_groups(PlannerInfo *root, List *groupExprs,
double input_rows,
reldistinct = clamp;

/*
-* Multiply by restriction selectivity.
+* Estimate number of distinct values expected in given number of rows.
*/
-reldistinct *= rel->rows / rel->tuples;
+reldistinct *= (1 - powl((reldistinct - 1) / reldistinct, rel->rows));

/*
* Update estimate of total distinct groups.

I think we need way more explanation in comments here (possibly with
external links). For now, it looks like formula which appears from nowhere.


I thought the details (particularly the link to stackexchange, discussing the 
formula) would be enough, but let me elaborate.


The current formula

reldistinct *= rel->rows / rel->tuples;

simply multiplies the ndistinct estimate with selectivity. So when you select 
1% of the table, the estimate says we'll see 1% of ndistinct values. But 
clearly that's naive, because for example when you have 10k distinct values and 
you select 10M rows, you should expect nearly all of them in the sample.


And that's what the formula does - it gives you the expected number of distinct 
values, when selecting 'k' values from 'd' distinct values with replacements:


count(k, d) = d * (1 - ((d-1)/d) ^ k)

It's assuming the distinct values are equally probable (uniform distribution) 
and that the probabilities do not change (that's the "with replacement").


I guess we could relax those assumptions and for example use the MCV statistics 
to further improve the estimate, and also relax the 'with replacement' but 
that'd make the formula way more complicated.


[1] 
http://math.stackexchange.com/questions/72223/finding-expected-number-of-distinct-values-selected-from-a-set-of-integers




Also, note that rel->tuples gone away from formula parameters. So, we
actually don't care about how may tuples are in the table. This is because
this formula assumes that same tuple could be selected multiple times. For
low numbers this can lead to some errors. Consider selecting 2 from 3
distinct tuples. If you can't select same tuple twice then you always
selecting 2 distinct tuples. But according to formula you will select 5/3 in
average. I think this error in small numbers is negligible, especially
because getting rid of this error would require way more computations. But
it worth mentioning in comments though.


Well, yeah. That's the consequence of 'with replacement' assumption.

I guess we could handle this somehow, though. For example we can compute the 
minimum possible number of distinct values like this:


average_rows_per_value = (tuples / ndistinct);
min_estimate = ceil(rows / average_rows_per_value);

and use that as a minimum for the estimate. In the example you've given this 
would say


average_rows_per_value = (3 / 3) = 1;
min_estimate = ceil(2 / 1) = 2;

So it serves as a correction for the 'with replacement' assumption. With only 2 
distinct values we'd get this:


average_rows_per_value = (3 / 2) = 1.5;
min_estimate = ceil(2 / 1.5) = 2;

Of course, it's just a heuristics, so this may fail in some other cases.

regards

--
Tomas Vondra  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] improving GROUP BY estimation

2016-03-03 Thread Alexander Korotkov
Hi, Tomas!

I've assigned to review this patch.

I've checked version estimate-num-groups-v2.txt by Mark Dilger.
It applies to head cleanly, passes corrected regression tests.

About correlated/uncorrelated cases. I think now optimizer mostly assumes
all distributions to be independent.
I think we should follow this assumption in this case also until we have
fundamentally better option (like your multivariate statistics).

@@ -3438,9 +3438,9 @@ estimate_num_groups(PlannerInfo *root, List
*groupExprs, double input_rows,
  reldistinct = clamp;

  /*
- * Multiply by restriction selectivity.
+ * Estimate number of distinct values expected in given number of rows.
  */
- reldistinct *= rel->rows / rel->tuples;
+ reldistinct *= (1 - powl((reldistinct - 1) / reldistinct, rel->rows));

  /*
  * Update estimate of total distinct groups.

I think we need way more explanation in comments here (possibly with
external links). For now, it looks like formula which appears from nowhere.

Also, note that rel->tuples gone away from formula parameters. So, we
actually don't care about how may tuples are in the table. This is because
this formula assumes that same tuple could be selected multiple times. For
low numbers this can lead to some errors. Consider selecting 2 from 3
distinct tuples. If you can't select same tuple twice then you always
selecting 2 distinct tuples. But according to formula you will select 5/3
in average. I think this error in small numbers is negligible, especially
because getting rid of this error would require way more computations. But
it worth mentioning in comments though.

--
Alexander Korotkov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company


Re: [HACKERS] improving GROUP BY estimation

2016-02-25 Thread Tomas Vondra

Hi,

On 02/26/2016 04:32 AM, Mark Dilger wrote:



On Feb 25, 2016, at 4:59 PM, Tomas Vondra  wrote:


...


The culprit here is that the two columns are not independent, but
are (rather strongly) correlated due to the way you've generated
the data.


Yes, that was intentional. Your formula is exactly correct, so far as
i can tell, for completely uncorrelated data. I don't have any tables
with completely uncorrelated data, and was therefore interested in
what happens when the data is correlated and your patch is applied.

BTW, the whole reason I responded to your post is that I think I would like
to have your changes in the code base.  I'm just playing Devil's Advocate
here, to see if there is room for any improvement.


Sure, that's how I understood it. I just wanted to point out the 
correlation, as that might not have been obvious to others.



Thanks for the patch submission. I hope my effort to review it is on
the whole more helpful than harmful.


Thanks for the feedback!

regards

--
Tomas Vondra  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] improving GROUP BY estimation

2016-02-25 Thread Mark Dilger

> On Feb 25, 2016, at 4:59 PM, Tomas Vondra  
> wrote:
> 
> Hi,
> 
> On 02/26/2016 12:16 AM, Mark Dilger wrote:
>> 
>> It is not hard to write test cases where your patched version overestimates
>> the number of rows by a very similar factor as the old code underestimates
>> them.  My very first test, which was not specifically designed to demonstrate
>> this, happens to be one such example:
>> 
>> 
>> CREATE TABLE t (a INT, b int);
>> INSERT INTO t SELECT sqrt(gs)::int, gs FROM generate_series(1,1000) gs;
>> ANALYZE t;
>> EXPLAIN SELECT a FROM t WHERE b < 1000 GROUP BY a;
>>   QUERY PLAN
>> ---
>>  HashAggregate  (cost=169250.21..169258.71 rows=850 width=4)
>>Group Key: a
>>->  Seq Scan on t  (cost=0.00..169247.71 rows=1000 width=4)
>>  Filter: (b < 1000)
>> (4 rows)
>> 
>> SELECT COUNT(*) FROM (SELECT a FROM t WHERE b < 1000 GROUP BY a) AS ss;
>>  count
>> ---
>> 32
>> (1 row)
>> 
>> 
>> 
>> So, it estimates 850 rows where only 32 are returned . Without
>> applying your patch, it estimates just 1 row where 32 are returned.
>> That's an overestimate of roughly 26 times, rather than an
>> underestimate of 32 times.
> 
> The culprit here is that the two columns are not independent, but are (rather 
> strongly) correlated due to the way you've generated the data.

Yes, that was intentional.  Your formula is exactly correct, so far as i can 
tell,
for completely uncorrelated data.  I don't have any tables with completely
uncorrelated data, and was therefore interested in what happens when the
data is correlated and your patch is applied.

BTW, the whole reason I responded to your post is that I think I would like
to have your changes in the code base.  I'm just playing Devil's Advocate
here, to see if there is room for any improvement.

> In cases like this (with correlated columns), it's mostly just luck how 
> accurate estimates you get, no matter which of the estimators you use. It's 
> pretty easy to generate arbitrary errors by breaking the independence 
> assumption, and it's not just this particular place of the planner. And it 
> won't change until we add some sort of smartness about dependencies between 
> columns.
> 
> I think over-estimates are a bit more acceptable in this case, e.g. because 
> of the HashAggregate OOM risk. Also, I've seen too many nested loop cascades 
> due to under-estimates recently, but that's a bit unrelated.

I have seen similar problems in systems I maintain, hence my interest
in your patch.


>> As a patch review, I'd say that your patch does what you claim it
>> does, and it applies cleanly, and passes the regression tests with
>> my included modifications. I think there needs to be some discussion
>> on the list about whether the patch is agood idea.
> 
> Yes, that's why I haven't bothered with fixing the regression tests in the 
> patch, actually.

Right, but for the group as a whole, I thought it might generate some
feedback if people saw what else changed in the regression results.
If you look, the changes have to do with plans chosen and row estimates --
exactly the sort of stuff you would expect to change.

Thanks for the patch submission.  I hope my effort to review it is on the
whole more helpful than harmful.

Mark Dilger

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] improving GROUP BY estimation

2016-02-25 Thread Tomas Vondra

Hi,

On 02/26/2016 12:16 AM, Mark Dilger wrote:


It is not hard to write test cases where your patched version overestimates
the number of rows by a very similar factor as the old code underestimates
them.  My very first test, which was not specifically designed to demonstrate
this, happens to be one such example:


CREATE TABLE t (a INT, b int);
INSERT INTO t SELECT sqrt(gs)::int, gs FROM generate_series(1,1000) gs;
ANALYZE t;
EXPLAIN SELECT a FROM t WHERE b < 1000 GROUP BY a;
   QUERY PLAN
---
  HashAggregate  (cost=169250.21..169258.71 rows=850 width=4)
Group Key: a
->  Seq Scan on t  (cost=0.00..169247.71 rows=1000 width=4)
  Filter: (b < 1000)
(4 rows)

SELECT COUNT(*) FROM (SELECT a FROM t WHERE b < 1000 GROUP BY a) AS ss;
  count
---
 32
(1 row)



So, it estimates 850 rows where only 32 are returned . Without
applying your patch, it estimates just 1 row where 32 are returned.
That's an overestimate of roughly 26 times, rather than an
underestimate of 32 times.


The culprit here is that the two columns are not independent, but are 
(rather strongly) correlated due to the way you've generated the data.


In cases like this (with correlated columns), it's mostly just luck how 
accurate estimates you get, no matter which of the estimators you use. 
It's pretty easy to generate arbitrary errors by breaking the 
independence assumption, and it's not just this particular place of the 
planner. And it won't change until we add some sort of smartness about 
dependencies between columns.


I think over-estimates are a bit more acceptable in this case, e.g. 
because of the HashAggregate OOM risk. Also, I've seen too many nested 
loop cascades due to under-estimates recently, but that's a bit unrelated.



As a patch review, I'd say that your patch does what you claim it
does, and it applies cleanly, and passes the regression tests with
my included modifications. I think there needs to be some discussion
on the list about whether the patch is agood idea.


Yes, that's why I haven't bothered with fixing the regression tests in 
the patch, actually.


regards

--
Tomas Vondra  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] improving GROUP BY estimation

2016-02-25 Thread Mark Dilger

> On Feb 25, 2016, at 3:16 PM, Mark Dilger  wrote:
> 
> 
>> On Feb 23, 2016, at 5:12 PM, Tomas Vondra  
>> wrote:
>> 
>> 
>> 
>> So much better. Clearly, there are cases where this will over-estimate the 
>> cardinality - for example when the values are somehow correlated.
>> 
> 
> I applied your patch, which caused a few regression tests to fail.  Attached
> is a patch that includes the necessary changes to the expected test results.
> 
> It is not hard to write test cases where your patched version overestimates
> the number of rows by a very similar factor as the old code underestimates
> them.  My very first test, which was not specifically designed to demonstrate
> this, happens to be one such example:
> 
> 
> CREATE TABLE t (a INT, b int);
> INSERT INTO t SELECT sqrt(gs)::int, gs FROM generate_series(1,1000) gs;
> ANALYZE t;
> EXPLAIN SELECT a FROM t WHERE b < 1000 GROUP BY a;
>  QUERY PLAN
> ---
> HashAggregate  (cost=169250.21..169258.71 rows=850 width=4)
>   Group Key: a
>   ->  Seq Scan on t  (cost=0.00..169247.71 rows=1000 width=4)
> Filter: (b < 1000)
> (4 rows)
> 
> SELECT COUNT(*) FROM (SELECT a FROM t WHERE b < 1000 GROUP BY a) AS ss;
> count 
> --- 
>32
> (1 row)
> 
> 
> 
> So, it estimates 850 rows where only 32 are returned .  Without applying your 
> patch,
> it estimates just 1 row where 32 are returned.  That's an overestimate of 
> roughly 26 times,
> rather than an underestimate of 32 times.
> 
> As a patch review, I'd say that your patch does what you claim it does, and 
> it applies
> cleanly, and passes the regression tests with my included modifications.  I 
> think there
> needs to be some discussion on the list about whether the patch is a good 
> idea.
> 
> Mark Dilger
> 
> 
> 

Assuming the goal is to minimize the degree to which the estimates are 
inaccurate, I 
get better results by a kind of averaging of the old values from the current 
code base
and the new values from Tomas's patch.  I tried this and at least for the few 
examples
we've been throwing around, I found the results were not as wildly inaccurate 
in the
worst case examples than in either of the other two approaches:

diff --git a/src/backend/utils/adt/selfuncs.c b/src/backend/utils/adt/selfuncs.c
index 46c95b0..c83135a 100644
--- a/src/backend/utils/adt/selfuncs.c
+++ b/src/backend/utils/adt/selfuncs.c
@@ -3414,6 +3414,8 @@ estimate_num_groups(PlannerInfo *root, List *groupExprs, 
double input_rows,
Assert(rel->reloptkind == RELOPT_BASEREL);
if (rel->tuples > 0)
{
+   double  old_factor; /* The way it 
is currently done on master */
+   double  new_factor; /* The way 
Tomas Vondra proposes doing it */
/*
 * Clamp to size of rel, or size of rel / 10 if 
multiple Vars. The
 * fudge factor is because the Vars are probably 
correlated but we
@@ -3440,7 +3442,10 @@ estimate_num_groups(PlannerInfo *root, List *groupExprs, 
double input_rows,
/*
 * Multiply by restriction selectivity.
 */
-   reldistinct *= rel->rows / rel->tuples;
+   old_factor = rel->rows / rel->tuples;   /* old 
way */
+   new_factor = (1 - powl((reldistinct - 1) / reldistinct, 
rel->rows));  /* Tomas's way */
+
+   reldistinct *= sqrt(old_factor * new_factor);   /* 
average of old way and new way, sort of */
 
/*
 * Update estimate of total distinct groups.




-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] improving GROUP BY estimation

2016-02-25 Thread Mark Dilger

> On Feb 23, 2016, at 5:12 PM, Tomas Vondra  
> wrote:
> 
> 
> 
> So much better. Clearly, there are cases where this will over-estimate the 
> cardinality - for example when the values are somehow correlated.
> 

I applied your patch, which caused a few regression tests to fail.  Attached
is a patch that includes the necessary changes to the expected test results.

It is not hard to write test cases where your patched version overestimates
the number of rows by a very similar factor as the old code underestimates
them.  My very first test, which was not specifically designed to demonstrate
this, happens to be one such example:


CREATE TABLE t (a INT, b int);
INSERT INTO t SELECT sqrt(gs)::int, gs FROM generate_series(1,1000) gs;
ANALYZE t;
EXPLAIN SELECT a FROM t WHERE b < 1000 GROUP BY a;
  QUERY PLAN
---
 HashAggregate  (cost=169250.21..169258.71 rows=850 width=4)
   Group Key: a
   ->  Seq Scan on t  (cost=0.00..169247.71 rows=1000 width=4)
 Filter: (b < 1000)
(4 rows)

SELECT COUNT(*) FROM (SELECT a FROM t WHERE b < 1000 GROUP BY a) AS ss;
 count 
--- 
32
(1 row)



So, it estimates 850 rows where only 32 are returned .  Without applying your 
patch,
it estimates just 1 row where 32 are returned.  That's an overestimate of 
roughly 26 times,
rather than an underestimate of 32 times.

As a patch review, I'd say that your patch does what you claim it does, and it 
applies
cleanly, and passes the regression tests with my included modifications.  I 
think there
needs to be some discussion on the list about whether the patch is a good idea.

Mark Dilger


diff --git a/src/backend/utils/adt/selfuncs.c b/src/backend/utils/adt/selfuncs.c
index 46c95b0..82a7f7f 100644
--- a/src/backend/utils/adt/selfuncs.c
+++ b/src/backend/utils/adt/selfuncs.c
@@ -3438,9 +3438,9 @@ estimate_num_groups(PlannerInfo *root, List *groupExprs, 
double input_rows,
reldistinct = clamp;
 
/*
-* Multiply by restriction selectivity.
+* Estimate number of distinct values expected in given 
number of rows.
 */
-   reldistinct *= rel->rows / rel->tuples;
+   reldistinct *= (1 - powl((reldistinct - 1) / 
reldistinct, rel->rows));
 
/*
 * Update estimate of total distinct groups.
diff --git a/src/test/regress/expected/join.out 
b/src/test/regress/expected/join.out
index 59d7877..d9dd5ca 100644
--- a/src/test/regress/expected/join.out
+++ b/src/test/regress/expected/join.out
@@ -3951,17 +3951,17 @@ select d.* from d left join (select * from b group by 
b.id, b.c_id) s
   on d.a = s.id;
   QUERY PLAN   
 ---
- Merge Left Join
-   Merge Cond: (d.a = s.id)
-   ->  Sort
- Sort Key: d.a
- ->  Seq Scan on d
+ Merge Right Join
+   Merge Cond: (s.id = d.a)
->  Sort
  Sort Key: s.id
  ->  Subquery Scan on s
->  HashAggregate
  Group Key: b.id
  ->  Seq Scan on b
+   ->  Sort
+ Sort Key: d.a
+ ->  Seq Scan on d
 (11 rows)
 
 -- similarly, but keying off a DISTINCT clause
@@ -3970,17 +3970,17 @@ select d.* from d left join (select distinct * from b) s
   on d.a = s.id;
  QUERY PLAN  
 -
- Merge Left Join
-   Merge Cond: (d.a = s.id)
-   ->  Sort
- Sort Key: d.a
- ->  Seq Scan on d
+ Merge Right Join
+   Merge Cond: (s.id = d.a)
->  Sort
  Sort Key: s.id
  ->  Subquery Scan on s
->  HashAggregate
  Group Key: b.id, b.c_id
  ->  Seq Scan on b
+   ->  Sort
+ Sort Key: d.a
+ ->  Seq Scan on d
 (11 rows)
 
 -- check join removal works when uniqueness of the join condition is enforced
diff --git a/src/test/regress/expected/subselect.out 
b/src/test/regress/expected/subselect.out
index de64ca7..0fc93d9 100644
--- a/src/test/regress/expected/subselect.out
+++ b/src/test/regress/expected/subselect.out
@@ -807,27 +807,24 @@ select * from int4_tbl where
 explain (verbose, costs off)
 select * from int4_tbl o where (f1, f1) in
   (select f1, generate_series(1,2) / 10 g from int4_tbl i group by f1);
-  QUERY PLAN  
---
- Hash Join
+   QUERY PLAN   
+
+ Hash Semi Join
Output: o.f1
Hash Cond: (o.f1 = "ANY_subquery".f1)
->  Seq Scan on public.int4_tbl o
  Output: o.f1
->  Hash
  Output: 

[HACKERS] improving GROUP BY estimation

2016-02-23 Thread Tomas Vondra

Hi,

while working of using correlation statistics when estimating GROUP BY 
(or generally whatever uses estimate_num_groups), I've noticed that we 
apply the selectivity in a rather naive way.


After estimating number of groups in the whole table - either by reading 
pg_stats.n_distinct for a single column, or multiplying (and doing some 
additional magic) for a group of columns, we do this:


   reldistinct *= rel->rows / rel->tuples;

which essentially applies the selectivity to the ndistinct estimate.

So when you have a table with 1000 distinct values in a column, and we 
read 10% of the table, we expect to get 100 distinct values.


But consider for example the table has 100M rows, that the column has 
n_distinct=1 and we're reading 1% of the table (i.e. 1M rows). The 
current code will estimate the GROUP BY cardinality to be 100 (as 1% of 
the n_distinct), but in reality we're more likely to see all 1 
distinct values.


Example:

CREATE TABLE t (a INT, b FLOAT);
INSERT INTO t SELECT (10 * random())::int, random()
FROM generate_series(1,1000) s(i);
ANALYZE t;

SELECT n_distinct FROM pg_stats WHERE attname = 'a';

 n_distinct

  98882

-- 1% of the table
EXPLAIN ANALYZE SELECT a FROM t WHERE b < 0.01 GROUP BY a;

   QUERY PLAN
-
 HashAggregate  (cost=179514.33..179523.45 rows=912 width=4)
(actual time=916.224..955.136 rows=63492 loops=1)
   Group Key: a
   ->  Seq Scan on t  (cost=0.00..179053.25 rows=92216 width=4)
  (actual time=0.103..830.104 rows=100268 loops=1)
 Filter: (b < '0.01'::double precision)
 Rows Removed by Filter: 9899732
 Planning time: 1.237 ms
 Execution time: 982.600 ms
(7 rows)

-- 5% of the table
EXPLAIN ANALYZE SELECT a FROM t WHERE b < 0.05 GROUP BY a;

QUERY PLAN
-
 HashAggregate  (cost=180296.06..180345.22 rows=4916 width=4)
(actual time=1379.519..1440.005 rows=99348 loops=1)
   Group Key: a
   ->  Seq Scan on t  (cost=0.00..179053.25 rows=497123 width=4)
  (actual time=0.096..1000.494 rows=500320 loops=1)
 Filter: (b < '0.05'::double precision)
 Rows Removed by Filter: 9499680
 Planning time: 0.129 ms
 Execution time: 1480.986 ms
(7 rows)

This is getting especially bad when reading only a small fraction of a 
huge table - it's easy to construct cases with arbitrarily large error. 
And it's worth noting that the error is almost exclusively massive 
under-estimate, so the "wrong" direction as HashAggregate is still 
vulnerable to OOM (again, the larger the table the worse).


So I think a more appropriate way to estimate this would be to find the 
expected number of distinct values when sampling with replacements, as 
explained for example in [1].


Applied to the code, it means changing the line from

  reldistinct *= rel->rows / rel->tuples;

to

  reldistinct *= (1 - powl((reldistinct - 1) / reldistinct, rel->rows));

With this change, the estimates for the examples look like this:

   QUERY PLAN
-
 HashAggregate  (cost=179283.79..179883.48 rows=59969 width=4)
(actual time=897.071..934.764 rows=63492 loops=1)
   Group Key: a
   ->  Seq Scan on t  (cost=0.00..179053.25 rows=92216 width=4)
  (actual time=0.104..820.276 rows=100268 loops=1)
 Filter: (b < '0.01'::double precision)
 Rows Removed by Filter: 9899732
 Planning time: 1.261 ms
 Execution time: 962.812 ms
(7 rows)

and

QUERY PLAN
-
 Group  (cost=232886.15..235371.76 rows=98234 width=4)
(actual time=1519.545..2104.447 rows=99348 loops=1)
   Group Key: a
   ->  Sort  (cost=232886.15..234128.95 rows=497123 width=4)
 (actual time=1519.540..1838.575 rows=500320 loops=1)
 Sort Key: a
 Sort Method: external merge  Disk: 6824kB
 ->  Seq Scan on t  (cost=0.00..179053.25 rows=497123 width=4)
 (actual time=0.090..1044.099 rows=500320 loops=1)
   Filter: (b < '0.05'::double precision)
   Rows Removed by Filter: 9499680
 Planning time: 0.133 ms
 Execution time: 2147.340 ms
(10 rows)

So much better. Clearly, there are cases where this will over-estimate 
the cardinality - for example when the values are somehow correlated.


But I'd argue over-estimating is better because of the OOM issues in 
Hash Aggregate.


regards

[1] 
http://math.stackexchange.com/questions/72223/finding-expected-number-of-distinct-values-selected-from-a-set-of-integers


--
Tomas Vondra  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7