Re: [HACKERS] Bogus ANALYZE results for an otherwise-unique column with many nulls

2016-08-14 Thread Andreas Joseph Krogh
På søndag 07. august 2016 kl. 18:35:56, skrev Tom Lane >:
Dean Rasheed  writes:
 > On 5 August 2016 at 21:48, Tom Lane  wrote:
 >> OK, thanks.  What shall we do about Andreas' request to back-patch this?
 >> I'm personally willing to do it, but there is the old bugaboo of "maybe
 >> it will destabilize a plan that someone is happy with".

 > My inclination would be to back-patch it because arguably it's a
 > bug-fix -- at the very least the old behaviour didn't match the docs
 > for stadistinct:

 Yeah.  I suspect that situations like this are pretty rare, or we'd have
 recognized the problem sooner.  And at least for Andreas, it'd be fixing
 a real problem.  I'll apply the back-patch unless I hear objections in
 the next couple of hours.

 > Additionally, I think that example is misleading because it's only
 > really true if there are no null values in the column. Perhaps it
 > would help to have a more explicit example to illustrate how nulls
 > affect stadistinct, for example:

 Good idea, will do.

 regards, tom lane
 
 
(Sorry for the HTML-formatting, I'm using a web-based client which isn't very 
good at converting the HTML to plain-text)
 
Just wanted to thank you for fixing this.
 
Take this query (pg-9.5):
 
explain analyze SELECT com.entity_id AS company_id , sum((coalesce(log
.adjusted_time,log.duration) / (3600 * 1000)::numeric) * coalesce(log
.adjusted_hourly_rate,log.hourly_rate)) as not_invoiced_hours_amount_adjusted , 
sum((log.duration / (3600 * 1000)::numeric) * log.hourly_rate) as 
not_invoiced_hours_amount_original ,sum(coalesce(log.adjusted_time, log
.duration))AS not_invoiced_hours_duration_adjusted , sum(log.duration) AS 
not_invoiced_hours_duration_originalFROM onp_crm_relation com JOIN 
onp_crm_activity_loglog ON com.entity_id = log.relation_id JOIN onp_crm_person 
logforON logfor.onp_user_id = log.logged_for AND logfor.is_resource = FALSE  
LEFT OUTER JOIN( onp_crm_activity act INNER JOIN onp_crm_project proj ON 
act.project_id = proj.id )ON log.activity_id = act.id WHERE 1 = 1 AND log
.entry_type ='TIMESHEET_ENTRY' AND log.start_date IS NOT NULL  AND log
.is_chargable =TRUE  AND log.status IN (1,2,3,5) AND log.start_date <= 
'2016-06-27' -- AND com.entity_id = 2246068 AND NOT EXISTS( SELECT * FROM 
onp_crm_calendarentry_invoice_membership cemJOIN onp_crm_invoice_line il ON 
cem.invoice_line_id = il.idJOIN onp_crm_invoice inv ON il.invoice_id = 
inv.entity_idWHERE cem.calendar_entry_id = log.id AND inv.status_key = 
'INVOICE_STATUS_INVOICED' AND inv.sent_date <= '2016-06-27' AND NOT EXISTS( 
SELECT* FROM onp_crm_invoice creditnote WHERE il.invoice_id = 
creditnote.credit_againstAND creditnote.status_key = 'INVOICE_STATUS_INVOICED' 
ANDcreditnote.sent_date <= '2016-06-27' ) ) GROUP BY com.entity_id; 
 
Before the fix, the plan was like this:
 
QUERY PLAN 
---
HashAggregate (cost=13874.94..13894.38 rows=972 width=32) (actual time
=420188.465..420189.232rows=462 loops=1) Group Key: com.entity_id -> Nested 
Loop Left Join(cost=1710.21..13848.21 rows=972 width=32) (actual time
=1004.858..420166.736rows=3882 loops=1) -> Nested Loop Anti Join (cost
=1709.51..12990.02rows=972 width=36) (actual time=1003.001..419356.749 rows=3882
loops=1) Join Filter: (cem.calendar_entry_id = log.id) Rows Removed by Join 
Filter: 3996874113 -> Hash Join (cost=1172.69..10612.53 rows=972 width=44) 
(actualtime=29.085..333.323 rows=89431 loops=1) Hash Cond: (log.relation_id = 
com.entity_id) ->Hash Join (cost=1063.39..10489.85 rows=976 width=40) (actual 
time=27.598..261.822 rows=89431 loops=1) Hash Cond: (log.logged_for = 
logfor.onp_user_id) -> Bitmap Heap Scanon onp_crm_activity_log log (cost
=976.83..10055.99rows=90010 width=44) (actual time=27.402..223.407 rows=89486 
loops=1) Recheck Cond: ((status = ANY ('{1,2,3,5}'::integer[])) AND is_chargable
AND(start_date IS NOT NULL) AND ((entry_type)::text = 'TIMESHEET_ENTRY'::text)) 
Filter: (start_date <= '2016-06-27'::date) Rows Removed by Filter: 991 Heap 
Blocks: exact=6095 -> Bitmap Index Scan on origo_calendar_entry_status_1_idx (
cost=0.00..954.33 rows=93008 width=0) (actual time=26.339..26.339 rows=101274 
loops=1) Index Cond: (status = ANY ('{1,2,3,5}'::integer[])) -> Hash (cost
=39.46..39.46rows=3768 width=8) (actual time=0.148..0.148 rows=36 loops=1) 
Buckets:4096 Batches: 1 Memory Usage: 34kB -> Bitmap Heap Scan on 
onp_crm_person logfor (cost=3.69..39.46 rows=3768 width=8) (actual time
=0.049..0.137rows=36 loops=1) Recheck Cond: (onp_user_id IS NOT NULL) Filter: (
NOTis_resource) Rows Removed by Filter: 5 Heap Blocks: exact=11 -> Bitmap Index 
Scanon onp_crm_person_onp_id_idx (cost=0.00..2.75 rows=41 width=0) 

Re: [HACKERS] Bogus ANALYZE results for an otherwise-unique column with many nulls

2016-08-07 Thread Tom Lane
Dean Rasheed  writes:
> On 5 August 2016 at 21:48, Tom Lane  wrote:
>> OK, thanks.  What shall we do about Andreas' request to back-patch this?
>> I'm personally willing to do it, but there is the old bugaboo of "maybe
>> it will destabilize a plan that someone is happy with".

> My inclination would be to back-patch it because arguably it's a
> bug-fix -- at the very least the old behaviour didn't match the docs
> for stadistinct:

Yeah.  I suspect that situations like this are pretty rare, or we'd have
recognized the problem sooner.  And at least for Andreas, it'd be fixing
a real problem.  I'll apply the back-patch unless I hear objections in
the next couple of hours.

> Additionally, I think that example is misleading because it's only
> really true if there are no null values in the column. Perhaps it
> would help to have a more explicit example to illustrate how nulls
> affect stadistinct, for example:

Good idea, will do.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Bogus ANALYZE results for an otherwise-unique column with many nulls

2016-08-07 Thread Andreas Joseph Krogh
På søndag 07. august 2016 kl. 09:01:40, skrev Dean Rasheed <
dean.a.rash...@gmail.com >:
On 5 August 2016 at 21:48, Tom Lane  wrote:
 > OK, thanks.  What shall we do about Andreas' request to back-patch this?
 > I'm personally willing to do it, but there is the old bugaboo of "maybe
 > it will destabilize a plan that someone is happy with".
 >

 My inclination would be to back-patch it because arguably it's a
 bug-fix -- at the very least the old behaviour didn't match the docs
 for stadistinct:
 [snip]
 
 
Will this then make it into the soon-to-be-released 9.5.4?
 
-- Andreas Joseph Krogh
CTO / Partner - Visena AS
Mobile: +47 909 56 963
andr...@visena.com 
www.visena.com 
 


 


Re: [HACKERS] Bogus ANALYZE results for an otherwise-unique column with many nulls

2016-08-07 Thread Dean Rasheed
On 5 August 2016 at 21:48, Tom Lane  wrote:
> OK, thanks.  What shall we do about Andreas' request to back-patch this?
> I'm personally willing to do it, but there is the old bugaboo of "maybe
> it will destabilize a plan that someone is happy with".
>

My inclination would be to back-patch it because arguably it's a
bug-fix -- at the very least the old behaviour didn't match the docs
for stadistinct:

  The number of distinct nonnull data values in the column.
  A value greater than zero is the actual number of distinct values.
  A value less than zero is the negative of a multiplier for the number
  of rows in the table; for example, a column in which values appear about
  twice on the average could be represented by
  stadistinct = -0.5.

Additionally, I think that example is misleading because it's only
really true if there are no null values in the column. Perhaps it
would help to have a more explicit example to illustrate how nulls
affect stadistinct, for example:

  ... for example, a column in which about 80% of the values are nonnull
  and each nonnull value appears about twice on average could be
  represented by stadistinct = -0.4.

Regards,
Dean


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Bogus ANALYZE results for an otherwise-unique column with many nulls

2016-08-05 Thread Tom Lane
Andrew Gierth  writes:
> Objection withdrawn.

OK, thanks.  What shall we do about Andreas' request to back-patch this?
I'm personally willing to do it, but there is the old bugaboo of "maybe
it will destabilize a plan that someone is happy with".

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Bogus ANALYZE results for an otherwise-unique column with many nulls

2016-08-05 Thread Andrew Gierth
> "Tom" == Tom Lane  writes:

 Tom> Also, the way that the value is calculated in the
 Tom> samples-not-all-distinct case corresponds to the way I have it in
 Tom> the patch.

Ahh, gotcha.  You're referring to this:

/*
 * If we estimated the number of distinct values at more than 10% of
 * the total row count (a very arbitrary limit), then assume that
 * stadistinct should scale with the row count rather than be a fixed
 * value.
 */
if (stats->stadistinct > 0.1 * totalrows)
stats->stadistinct = -(stats->stadistinct / totalrows);

where "totalrows" includes nulls obviously. So this expects negative
stadistinct to be scaled by the total table size, and the all-distinct
case should do the same.

Objection withdrawn.

-- 
Andrew (irc:RhodiumToad)


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Bogus ANALYZE results for an otherwise-unique column with many nulls

2016-08-05 Thread Tom Lane
Andrew Gierth  writes:
> Hm. I am wrong about this, since it's the fact that consumers are taking
> stanullfrac into account that makes the value wrong in the first place.

Also, the way that the value is calculated in the samples-not-all-distinct
case corresponds to the way I have it in the patch.  What you want to do
would correspond to leaving these edge cases alone and changing all the
other ANALYZE cases instead (*plus* changing the consumers).  I find that
a tad scary.

> But I think the fix is still wrong, because it changes the meaning of
> ALTER TABLE ... ALTER col SET (n_distinct=...)  in a non-useful way; it
> is no longer possible to nail down a useful negative n_distinct value if
> the null fraction of the column is variable.

I think that argument is bogus.  If we change the way that
get_variable_numdistinct (and other consumers) use the value, that will
break all existing custom settings of n_distinct, because they will no
longer mean what they did before.  There have been exactly zero field
complaints that people could not get the results they wanted, so I do
not think that's justified.

In short, what you want to do constitutes a redefinition of stadistinct,
while my patch doesn't.  That is far more invasive and I fear it will
break things that are not broken today.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Bogus ANALYZE results for an otherwise-unique column with many nulls

2016-08-05 Thread Andrew Gierth
> "Andrew" == Andrew Gierth  writes:
> "Tom" == Tom Lane  writes:

 Tom> What I did in the patch is to scale the formerly fixed "-1.0"
 Tom> stadistinct estimate to discount the fraction of nulls we found.

 Andrew> This seems quite dubious to me. stadistinct representing only
 Andrew> the non-null values seems to me to be substantially more useful
 Andrew> and less confusing; it should be up to consumers to take
 Andrew> stanullfrac into account (in general they already do) since in
 Andrew> many cases we explicitly do _not_ want to count nulls.

Hm. I am wrong about this, since it's the fact that consumers are taking
stanullfrac into account that makes the value wrong in the first place.
For example, if a million-row table has stanullfrac=0.9 and
stadistinct=-1, then get_variable_numdistinct is returning 1 million,
and (for example) var_eq_non_const divides 0.1 by that to give a
selectivity of 1 in 10 million, which is obviously wrong.

But I think the fix is still wrong, because it changes the meaning of
ALTER TABLE ... ALTER col SET (n_distinct=...)  in a non-useful way; it
is no longer possible to nail down a useful negative n_distinct value if
the null fraction of the column is variable. Would it not make more
sense to do the adjustment in get_variable_numdistinct, instead?

-- 
Andrew (irc:RhodiumToad)


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Bogus ANALYZE results for an otherwise-unique column with many nulls

2016-08-05 Thread Andreas Joseph Krogh
På fredag 05. august 2016 kl. 01:01:06, skrev Tom Lane >:
I wrote:
 > Looking around, there are a couple of places outside commands/analyze.c
 > that are making the same mistake, so this patch isn't complete, but it
 > illustrates what needs to be done.

 Here's a more complete patch.

 regards, tom lane
 
(Sorry for HTML-email, but we live in 2016, and I think it improves 
readability)
 
Great!
 
I have tested it (with PG-9.6 from HEAD, 
e7caacf733f3ee77a555aa715ab6fbf4434e6b52) and it sure looks like it fixes the 
problem for my query.
 
The query:
 
explain analyze SELECT  log.relation_id as company_id , sum(log.duration) AS 
durationFROM onp_crm_activity_log log  JOIN onp_crm_person logfor ON 
logfor.onp_user_id =log.logged_for AND logfor.is_resource = FALSE WHERE 1 = 1 
-- Filter out already invoiced AND NOT EXISTS( SELECT * FROM 
onp_crm_calendarentry_invoice_membership cemJOIN onp_crm_invoice_line il ON 
cem.invoice_line_id = il.idJOIN onp_crm_invoice inv ON il.invoice_id = 
inv.entity_idWHERE cem.calendar_entry_id = log.id AND NOT EXISTS( SELECT * FROM 
onp_crm_invoice creditnoteWHERE il.invoice_id = creditnote.credit_against ) ) 
GROUP BY log.relation_id ; 
 
Gave the following explain-plan (before the patch), with enable_nestloop=off 
(needed to prevent nest-loop-anti-join which caused this query to execute in 
~26 min.):
 
QUERY PLAN 
---
HashAggregate (cost=25201.62..25220.06 rows=1475 width=36) (actual time
=381.191..381.281rows=404 loops=1) Group Key: log.relation_id -> Hash Anti Join 
(cost=4782.34..25148.28 rows=10667 width=12) (actual time=103.683..361.409 rows=
148330loops=1) Hash Cond: (log.id = cem.calendar_entry_id) -> Hash Join (cost
=81.46..20312.73rows=10667 width=20) (actual time=0.100..156.432 rows=380617 
loops=1) Hash Cond: (log.logged_for = logfor.onp_user_id) -> Seq Scan on 
onp_crm_activity_loglog (cost=0.00..18696.71 rows=380771 width=24) (actual time
=0.005..49.195rows=380774 loops=1) -> Hash (cost=25.02..25.02 rows=4515 width=8
) (actualtime=0.078..0.078 rows=119 loops=1) Buckets: 8192 Batches: 1 Memory 
Usage: 69kB -> Index Scan using onp_crm_person_onp_id_idx on onp_crm_person 
logfor (cost=0.14..25.02 rows=4515 width=8) (actual time=0.005..0.063 rows=119 
loops=1) Filter: (NOT is_resource) Rows Removed by Filter: 8 -> Hash (cost
=4700.87..4700.87rows=1 width=4) (actual time=103.575..103.575 rows=232412 
loops=1) Buckets: 262144 (originally 1024) Batches: 1 (originally 1) Memory 
Usage: 10219kB -> Hash Join (cost=474.41..4700.87 rows=1 width=4) (actual time
=9.724..76.015rows=232412 loops=1) Hash Cond: (cem.invoice_line_id = il.id) -> 
Seq Scanon onp_crm_calendarentry_invoice_membership cem (cost=0.00..3354.28 rows
=232528 width=8) (actual time=0.004..17.626 rows=232528 loops=1) -> Hash (cost
=474.40..474.40rows=1 width=4) (actual time=9.710..9.710 rows=11535 loops=1) 
Buckets:16384 (originally 1024) Batches: 1 (originally 1) Memory Usage: 534kB 
-> MergeJoin (cost=415.33..474.40 rows=1 width=4) (actual time=4.149..8.467 rows
=11535 loops=1) Merge Cond: (il.invoice_id = inv.entity_id) -> Sort (cost
=415.05..415.06rows=1 width=8) (actual time=4.138..4.701 rows=11535 loops=1) 
SortKey: il.invoice_id Sort Method: quicksort Memory: 925kB -> Hash Anti Join (
cost=77.13..415.04 rows=1 width=8) (actual time=0.257..2.716 rows=11535 loops=1)
HashCond: (il.invoice_id = creditnote.credit_against) -> Seq Scan on 
onp_crm_invoice_line il (cost=0.00..294.30 rows=11630 width=8) (actual time
=0.003..0.995rows=11630 loops=1) -> Hash (cost=50.38..50.38 rows=2140 width=4) 
(actualtime=0.247..0.247 rows=37 loops=1) Buckets: 4096 Batches: 1 Memory Usage:
34kB -> Index Only Scan using origo_invoice_credit_against_idx on 
onp_crm_invoice creditnote (cost=0.28..50.38 rows=2140 width=4) (actual time
=0.013..0.182rows=2140 loops=1) Heap Fetches: 0 -> Index Only Scan using 
origo_invoice_id_status_sent_idxon onp_crm_invoice inv (cost=0.28..53.98 rows=
2140width=8) (actual time=0.008..1.274 rows=11576 loops=1) Heap Fetches: 0 
Planningtime: 0.955 ms Execution time: 381.884 ms (35 rows) 
 
AFAIU, this is the problematic estimate: -> Hash Anti Join (cost=77.13..415.04 
rows=1 width=8) (actual time=0.257..2.716 rows=11535 loops=1)
Right?
 
now (after the patch, and without needing to disable nest_loop) gives the 
following explain-plan:
QUERY PLAN 

HashAggregate (cost=44409.89..44428.47 rows=1486 width=36) (actual time
=366.502..366.594rows=404 loops=1) Group Key: log.relation_id -> Hash Anti Join 
(cost=10157.30..43650.11 

Re: [HACKERS] Bogus ANALYZE results for an otherwise-unique column with many nulls

2016-08-05 Thread Andrew Gierth
> "Tom" == Tom Lane  writes:

 Tom> What I did in the patch is to scale the formerly fixed "-1.0"
 Tom> stadistinct estimate to discount the fraction of nulls we found.

This seems quite dubious to me. stadistinct representing only the
non-null values seems to me to be substantially more useful and less
confusing; it should be up to consumers to take stanullfrac into account
(in general they already do) since in many cases we explicitly do _not_
want to count nulls.

-- 
Andrew (irc:RhodiumToad)


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Bogus ANALYZE results for an otherwise-unique column with many nulls

2016-08-04 Thread Tom Lane
I wrote:
> Looking around, there are a couple of places outside commands/analyze.c
> that are making the same mistake, so this patch isn't complete, but it
> illustrates what needs to be done.

Here's a more complete patch.

regards, tom lane

diff --git a/src/backend/commands/analyze.c b/src/backend/commands/analyze.c
index 5fcedd7..9ac7122 100644
*** a/src/backend/commands/analyze.c
--- b/src/backend/commands/analyze.c
*** compute_distinct_stats(VacAttrStatsP sta
*** 2049,2056 
  
  		if (nmultiple == 0)
  		{
! 			/* If we found no repeated values, assume it's a unique column */
! 			stats->stadistinct = -1.0;
  		}
  		else if (track_cnt < track_max && toowide_cnt == 0 &&
   nmultiple == track_cnt)
--- 2049,2059 
  
  		if (nmultiple == 0)
  		{
! 			/*
! 			 * If we found no repeated non-null values, assume it's a unique
! 			 * column; but be sure to discount for any nulls we found.
! 			 */
! 			stats->stadistinct = -1.0 * (1.0 - stats->stanullfrac);
  		}
  		else if (track_cnt < track_max && toowide_cnt == 0 &&
   nmultiple == track_cnt)
*** compute_scalar_stats(VacAttrStatsP stats
*** 2426,2433 
  
  		if (nmultiple == 0)
  		{
! 			/* If we found no repeated values, assume it's a unique column */
! 			stats->stadistinct = -1.0;
  		}
  		else if (toowide_cnt == 0 && nmultiple == ndistinct)
  		{
--- 2429,2439 
  
  		if (nmultiple == 0)
  		{
! 			/*
! 			 * If we found no repeated non-null values, assume it's a unique
! 			 * column; but be sure to discount for any nulls we found.
! 			 */
! 			stats->stadistinct = -1.0 * (1.0 - stats->stanullfrac);
  		}
  		else if (toowide_cnt == 0 && nmultiple == ndistinct)
  		{
*** compute_scalar_stats(VacAttrStatsP stats
*** 2753,2759 
  		else
  			stats->stawidth = stats->attrtype->typlen;
  		/* Assume all too-wide values are distinct, so it's a unique column */
! 		stats->stadistinct = -1.0;
  	}
  	else if (null_cnt > 0)
  	{
--- 2759,2765 
  		else
  			stats->stawidth = stats->attrtype->typlen;
  		/* Assume all too-wide values are distinct, so it's a unique column */
! 		stats->stadistinct = -1.0 * (1.0 - stats->stanullfrac);
  	}
  	else if (null_cnt > 0)
  	{
diff --git a/src/backend/tsearch/ts_typanalyze.c b/src/backend/tsearch/ts_typanalyze.c
index 0f851ea..817453c 100644
*** a/src/backend/tsearch/ts_typanalyze.c
--- b/src/backend/tsearch/ts_typanalyze.c
*** compute_tsvector_stats(VacAttrStats *sta
*** 295,301 
  		stats->stawidth = total_width / (double) nonnull_cnt;
  
  		/* Assume it's a unique column (see notes above) */
! 		stats->stadistinct = -1.0;
  
  		/*
  		 * Construct an array of the interesting hashtable items, that is,
--- 295,301 
  		stats->stawidth = total_width / (double) nonnull_cnt;
  
  		/* Assume it's a unique column (see notes above) */
! 		stats->stadistinct = -1.0 * (1.0 - stats->stanullfrac);
  
  		/*
  		 * Construct an array of the interesting hashtable items, that is,
diff --git a/src/backend/utils/adt/rangetypes_typanalyze.c b/src/backend/utils/adt/rangetypes_typanalyze.c
index fcb71d3..56504fc 100644
*** a/src/backend/utils/adt/rangetypes_typanalyze.c
--- b/src/backend/utils/adt/rangetypes_typanalyze.c
*** compute_range_stats(VacAttrStats *stats,
*** 203,209 
  		/* Do the simple null-frac and width stats */
  		stats->stanullfrac = (double) null_cnt / (double) samplerows;
  		stats->stawidth = total_width / (double) non_null_cnt;
! 		stats->stadistinct = -1.0;
  
  		/* Must copy the target values into anl_context */
  		old_cxt = MemoryContextSwitchTo(stats->anl_context);
--- 203,211 
  		/* Do the simple null-frac and width stats */
  		stats->stanullfrac = (double) null_cnt / (double) samplerows;
  		stats->stawidth = total_width / (double) non_null_cnt;
! 
! 		/* Estimate that non-null values are unique */
! 		stats->stadistinct = -1.0 * (1.0 - stats->stanullfrac);
  
  		/* Must copy the target values into anl_context */
  		old_cxt = MemoryContextSwitchTo(stats->anl_context);
diff --git a/src/backend/utils/adt/selfuncs.c b/src/backend/utils/adt/selfuncs.c
index cc2a9a1..56943f2 100644
*** a/src/backend/utils/adt/selfuncs.c
--- b/src/backend/utils/adt/selfuncs.c
*** double
*** 4738,4743 
--- 4738,4744 
  get_variable_numdistinct(VariableStatData *vardata, bool *isdefault)
  {
  	double		stadistinct;
+ 	double		stanullfrac = 0.0;
  	double		ntuples;
  
  	*isdefault = false;
*** get_variable_numdistinct(VariableStatDat
*** 4745,4751 
  	/*
  	 * Determine the stadistinct value to use.  There are cases where we can
  	 * get an estimate even without a pg_statistic entry, or can get a better
! 	 * value than is in pg_statistic.
  	 */
  	if (HeapTupleIsValid(vardata->statsTuple))
  	{
--- 4746,4753 
  	/*
  	 * Determine the stadistinct value to use.  There are cases where we can
  	 * get an estimate even without a pg_statistic 

[HACKERS] Bogus ANALYZE results for an otherwise-unique column with many nulls

2016-08-04 Thread Tom Lane
I looked into the problem described at
https://www.postgresql.org/message-id/flat/VisenaEmail.26.df42f82acae38a58.156463942b8%40tc7-visena
and I believe I've reproduced it: the requirement is that the inner join
column for the antijoin must contain a lot of NULL values, and what isn't
NULL must be unique or nearly so.  If ANALYZE doesn't come across any
duplicated values, it will set the column's stadistinct value to "-1",
which causes the planner to believe that each row of the inner table
produces a unique value, resulting in a bogus answer from
get_variable_numdistinct() and thence a bogus join size estimate.

Here's an example in the regression database, making use of the existing
unique column tenk1.unique1:

regression=# create table manynulls as select case when random() < 0.1 then 
unique1 else null end as unique1 from tenk1;
SELECT 1
regression=# analyze manynulls; 
ANALYZE
regression=# explain analyze select * from tenk1 t where not exists(select 1 
from manynulls m where m.unique1 = t.unique1);
QUERY PLAN  
   
---
 Hash Anti Join  (cost=261.00..756.50 rows=1 width=244) (actual 
time=4.632..14.729 rows=8973 loops=1)
   Hash Cond: (t.unique1 = m.unique1)
   ->  Seq Scan on tenk1 t  (cost=0.00..458.00 rows=1 width=244) (actual 
time=0.015..2.683 rows=1 loops=1)
   ->  Hash  (cost=136.00..136.00 rows=1 width=4) (actual time=4.553..4.553 
rows=1027 loops=1)
 Buckets: 16384  Batches: 1  Memory Usage: 165kB
 ->  Seq Scan on manynulls m  (cost=0.00..136.00 rows=1 width=4) 
(actual time=0.019..2.668 rows=1 loops=1)
 Planning time: 0.808 ms
 Execution time: 15.670 ms
(8 rows)

So the antijoin size estimate is way off, but it's hardly the planner's
fault because the stats are insane:

regression=# select attname,null_frac,n_distinct from pg_stats where tablename 
= 'manynulls';
 attname | null_frac | n_distinct 
-+---+
 unique1 |0.8973 | -1
(1 row)

With the patch attached below, ANALYZE produces

regression=# analyze manynulls;
ANALYZE
regression=# select attname,null_frac,n_distinct from pg_stats where tablename 
= 'manynulls';
 attname | null_frac | n_distinct 
-+---+
 unique1 |0.8973 |-0.1027
(1 row)

and now the join size estimate is dead on:

regression=# explain analyze select * from tenk1 t where not exists(select 1 
from manynulls m where m.unique1 = t.unique1);
QUERY PLAN  
   
---
 Hash Anti Join  (cost=261.00..847.69 rows=8973 width=244) (actual 
time=4.501..13.888 rows=8973 loops=1)
   Hash Cond: (t.unique1 = m.unique1)
   ->  Seq Scan on tenk1 t  (cost=0.00..458.00 rows=1 width=244) (actual 
time=0.031..4.959 rows=1 loops=1)
   ->  Hash  (cost=136.00..136.00 rows=1 width=4) (actual time=4.404..4.404 
rows=1027 loops=1)
 Buckets: 16384  Batches: 1  Memory Usage: 165kB
 ->  Seq Scan on manynulls m  (cost=0.00..136.00 rows=1 width=4) 
(actual time=0.034..2.576 rows=1 loops=1)
 Planning time: 1.388 ms
 Execution time: 14.542 ms
(8 rows)

What I did in the patch is to scale the formerly fixed "-1.0" stadistinct
estimate to discount the fraction of nulls we found.  An alternative
possibility might be to decree that a fractional stadistinct considers
only non-nulls, but then all the other paths through ANALYZE would be
wrong.  The spec for it in pg_statistic.h doesn't suggest any such
interpretation, either.

Looking around, there are a couple of places outside commands/analyze.c
that are making the same mistake, so this patch isn't complete, but it
illustrates what needs to be done.

This is a bit reminiscent of the nulls-accounting problem we fixed in
commit be4b4dc75, though that tended to result in underestimates not
overestimates of the number of distinct values.  We didn't back-patch
that fix, so probably we shouldn't back-patch this either.  On the other
hand, it is far more open-and-shut that this is wrong.  Thoughts?

regards, tom lane

diff --git a/src/backend/commands/analyze.c b/src/backend/commands/analyze.c
index 5fcedd7..9ac7122 100644
*** a/src/backend/commands/analyze.c
--- b/src/backend/commands/analyze.c
*** compute_distinct_stats(VacAttrStatsP sta
*** 2049,2056 
  
  		if (nmultiple == 0)
  		{
! 			/* If we found no repeated values, assume it's a unique column */
! 			stats->stadistinct = -1.0;
  		}
  		else if (track_cnt < track_max && toowide_cnt == 0 &&
   nmultiple == track_cnt)
--- 2049,2059 
  
  		if (nmultiple == 0)
  		{
!