Re: [HACKERS] Really bad blowups with hash outer join and nulls

2015-06-01 Thread Andrew Gierth
 Jim == Jim Nasby jim.na...@bluetreble.com writes:

 Jim Anything happen with this, or the patch Andrew posted?

No.

And my attention has just been drawn to this, which looks like the same
issue:

http://www.postgresql.org/message-id/52b47b47-0926-4e15-b25e-212df52fe...@oseberg.io

-- 
Andrew (irc:RhodiumToad)


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Really bad blowups with hash outer join and nulls

2015-04-07 Thread Jim Nasby

On 2/15/15 7:16 PM, Tomas Vondra wrote:

Hi,

On 16.2.2015 00:50, Andrew Gierth wrote:

Tom == Tom Lane t...@sss.pgh.pa.us writes:


I've now tried the attached patch to correct the bucketsize
estimates, and it does indeed stop the planner from considering the
offending path (in this case it just does the join the other way
round).

One thing I couldn't immediately see how to do was account for the
case where there are a lot of nulls in the table but a strict qual
(or an IS NOT NULL) filters them out; this patch will be overly
pessimistic about such cases. Do estimates normally try and take
things like this into account? I didn't find any other relevant
examples.


Improving the estimates is always good, but it's not going to fix the
case of non-NULL values (it shouldn't be all that difficult to create
such examples with a value whose hash starts with a bunch of zeroes).


  Tom There may also be something we can do in the executor, but it
  Tom would take closer analysis to figure out what's going wrong.  I
  Tom don't think kluging the behavior for NULL in particular is the
  Tom answer.

The point with nulls is that a hash value of 0 is currently special
in two distinct ways: it's always in batch 0 and bucket 0 regardless
of how many batches and buckets there are, and it's the result of
hashing a null. These two special cases interact in a worst-case
manner, so it seems worthwhile to avoid that.


I think you're right this is a flaw in general - all it takes is a
sufficiently common value with a hash value falling into the first batch
(i.e. either 0 or starting with a lot of 0 bits, thus giving hashno==0).

I think this might be solved by relaxing the check a bit. Currently we
do this (see nodeHash.c:735):

 if (nfreed == 0 || nfreed == ninmemory)
 {
 hashtable-growEnabled = false;
 }

which only disables nbatch growth if *all* the tuples remain in the
batch. That's rather strict, and it takes a single tuple to break this.

With each nbatch increase we expect 50:50 split, i.e. 1/2 the tuples
staying in the batch, 1/2 moved to the new one. So a much higher ratio
is suspicious and most likely mean the same hash value, so what if we
did something like this:

 if ((nfreed = 0.9 * (nfreed + ninmemory)) ||
 (nfreed = 0.1 * (nfreed + ninmemory)))
 {
 hashtable-growEnabled = false;
 }

which disables nbatch growth if either =90% tuples remained in the
first batch, or = 90% tuples were moved from it. The exact thresholds
might be set a bit differently, but I think 10:90 sounds about good.

Trivial patch implementing this attached - with it the explain analyze
looks like this:

test=# explain analyze select status, count(*) from q3 left join m3 on
(m3.id=q3.id) group by status;

QUERY PLAN
--
  HashAggregate  (cost=64717.63..64717.67 rows=4 width=4) (actual
time=514.619..514.630 rows=5 loops=1)
Group Key: m3.status
-  Hash Right Join  (cost=18100.00..62217.63 rows=50 width=4)
 (actual time=75.260..467.911 rows=500108 loops=1)
  Hash Cond: (m3.id = q3.id)
  -  Seq Scan on m3  (cost=0.00..18334.00 rows=100 width=37)
 (actual time=0.003..91.799 rows=100 loops=1)
  -  Hash  (cost=7943.00..7943.00 rows=50 width=33)
(actual time=74.916..74.916 rows=50 loops=1)
Buckets: 65536 (originally 65536)
Batches: 32 (originally 16)  Memory Usage: 8824kB
-  Seq Scan on q3
(cost=0.00..7943.00 rows=50 width=33)
(actual time=0.005..27.395 rows=50 loops=1)
  Planning time: 0.172 ms
  Execution time: 514.663 ms
(10 rows)

Without the patch this runs in ~240 seconds and the number of batches
explodes to ~131k.

In theory it might happen that threre just a few hash values and all of
them are exactly the same within the first N bits (thus falling into the
first batch), but N+1 bits are different. So the next split would
actually help. But I think we can leave that up to probability, just
like all the hash code.


Anything happen with this, or the patch Andrew posted?
--
Jim Nasby, Data Architect, Blue Treble Consulting
Data in Trouble? Get it in Treble! http://BlueTreble.com


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Really bad blowups with hash outer join and nulls

2015-02-16 Thread Tomas Vondra
On 16.2.2015 03:38, Andrew Gierth wrote:
 Tomas == Tomas Vondra tomas.von...@2ndquadrant.com
 writes:

 Tomas Improving the estimates is always good, but it's not going
 to Tomas fix the case of non-NULL values (it shouldn't be all
 that Tomas difficult to create such examples with a value whose
 hash starts Tomas with a bunch of zeroes).

 Right now, I can't get it to plan such an example, because (a) if
 there are no stats to work from then the planner makes fairly
 pessimistic assumptions about hash bucket filling, and (b) if
 there _are_ stats to work from, then a frequently-occurring
 non-null value shows up as an MCV and the planner takes that into
 account to calculate bucketsize.

 The problem could only be demonstrated for NULLs because the
 planner was ignoring NULL for the purposes of estimating
 bucketsize, which is correct for all join types except RIGHT and
 FULL (which, iirc, are more recent additions to the hashjoin
 repertoire).

Oh, right, the estimate fix is probably sufficient then.


 If you want to try testing it, you may find this useful:

 select i, hashint8(i) from unnest(array[1474049294, -1779024306,
-1329041947]) u(i);
 i  | hashint8 -+-- 1474049294 |0
 -1779024306 |0 -1329041947 |0 (3 rows)

 (those are the only three int4 values that hash to exactly 0)

 It's probably possible to construct pathological cases by finding
 a lot of different values with zeros in the high bits of the hash,
 but that's something that wouldn't be likely to happen by chance.

Yeah, it's probably possible, but it's admittedly considerably harder
than I initially thought. For example it could be possible to create
the table with no MCV values but sorted so that all the initial values
have hashvalue=0, triggering (growEnabled=false). But that's rather
unlikely to happen in practice I guess.

A more likely failure scenario is a hash join higher up the plan,
processing results of other joins etc. In that case the estimates will
be tricky, although the planner chooses quite pessimistic defaults in
those cases.

-- 
Tomas Vondrahttp://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Really bad blowups with hash outer join and nulls

2015-02-15 Thread Tom Lane
Andrew Gierth and...@tao11.riddles.org.uk writes:
 A quick test suggests that initializing the hash value to ~0 rather than
 0 has a curious effect: the number of batches still explodes, but the
 performance does not suffer the same way. (I think because almost all
 the batches end up empty.) I think this is worth doing even in the
 absence of a more general solution; nulls are common enough and
 important enough that they shouldn't be the worst-case value if it can
 be avoided.

I think that's unlikely to be a path to a good solution.

At least part of the problem here is that estimate_hash_bucketsize()
supposes that nulls can be ignored --- which is normally true, and
invalidates your claim that they're common.  But in a RIGHT JOIN
situation, they need to be considered as if they were regular keys.
That would probably be sufficient to dissuade the planner from choosing
a hash join in this example.

There may also be something we can do in the executor, but it would
take closer analysis to figure out what's going wrong.  I don't think
kluging the behavior for NULL in particular is the answer.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Really bad blowups with hash outer join and nulls

2015-02-15 Thread Andrew Gierth
 Tom == Tom Lane t...@sss.pgh.pa.us writes:

  A quick test suggests that initializing the hash value to ~0 rather
  than 0 has a curious effect: the number of batches still explodes,
  but the performance does not suffer the same way. (I think because
  almost all the batches end up empty.) I think this is worth doing
  even in the absence of a more general solution; nulls are common
  enough and important enough that they shouldn't be the worst-case
  value if it can be avoided.

 Tom I think that's unlikely to be a path to a good solution.

It wasn't really intended to be.

 Tom At least part of the problem here is that
 Tom estimate_hash_bucketsize() supposes that nulls can be ignored ---
 Tom which is normally true, and invalidates your claim that they're
 Tom common.  But in a RIGHT JOIN situation, they need to be considered
 Tom as if they were regular keys.  That would probably be sufficient
 Tom to dissuade the planner from choosing a hash join in this example.

I've now tried the attached patch to correct the bucketsize estimates,
and it does indeed stop the planner from considering the offending path
(in this case it just does the join the other way round).

One thing I couldn't immediately see how to do was account for the case
where there are a lot of nulls in the table but a strict qual (or an IS
NOT NULL) filters them out; this patch will be overly pessimistic about
such cases. Do estimates normally try and take things like this into
account? I didn't find any other relevant examples.

 Tom There may also be something we can do in the executor, but it
 Tom would take closer analysis to figure out what's going wrong.  I
 Tom don't think kluging the behavior for NULL in particular is the
 Tom answer.

The point with nulls is that a hash value of 0 is currently special in
two distinct ways: it's always in batch 0 and bucket 0 regardless of how
many batches and buckets there are, and it's the result of hashing a
null.  These two special cases interact in a worst-case manner, so it
seems worthwhile to avoid that.

-- 
Andrew (irc:RhodiumToad)

diff --git a/src/backend/optimizer/path/costsize.c b/src/backend/optimizer/path/costsize.c
index 020558b..1723fd3 100644
--- a/src/backend/optimizer/path/costsize.c
+++ b/src/backend/optimizer/path/costsize.c
@@ -2505,6 +2505,7 @@ final_cost_hashjoin(PlannerInfo *root, HashPath *path,
 	SpecialJoinInfo *sjinfo,
 	SemiAntiJoinFactors *semifactors)
 {
+	JoinType	jointype = path-jpath.jointype;
 	Path	   *outer_path = path-jpath.outerjoinpath;
 	Path	   *inner_path = path-jpath.innerjoinpath;
 	double		outer_path_rows = outer_path-rows;
@@ -2560,6 +2561,9 @@ final_cost_hashjoin(PlannerInfo *root, HashPath *path,
 		foreach(hcl, hashclauses)
 		{
 			RestrictInfo *restrictinfo = (RestrictInfo *) lfirst(hcl);
+			bool		keep_nulls = (jointype == JOIN_RIGHT
+	  || jointype == JOIN_FULL
+	  || !op_strict(restrictinfo-hashjoinoperator));
 			Selectivity thisbucketsize;
 
 			Assert(IsA(restrictinfo, RestrictInfo));
@@ -2583,7 +2587,8 @@ final_cost_hashjoin(PlannerInfo *root, HashPath *path,
 	thisbucketsize =
 		estimate_hash_bucketsize(root,
 		   get_rightop(restrictinfo-clause),
- virtualbuckets);
+ virtualbuckets,
+ keep_nulls);
 	restrictinfo-right_bucketsize = thisbucketsize;
 }
 			}
@@ -2599,7 +2604,8 @@ final_cost_hashjoin(PlannerInfo *root, HashPath *path,
 	thisbucketsize =
 		estimate_hash_bucketsize(root,
 			get_leftop(restrictinfo-clause),
- virtualbuckets);
+ virtualbuckets,
+ keep_nulls);
 	restrictinfo-left_bucketsize = thisbucketsize;
 }
 			}
diff --git a/src/backend/utils/adt/selfuncs.c b/src/backend/utils/adt/selfuncs.c
index 1ba103c..e8660bc 100644
--- a/src/backend/utils/adt/selfuncs.c
+++ b/src/backend/utils/adt/selfuncs.c
@@ -3435,9 +3435,13 @@ estimate_num_groups(PlannerInfo *root, List *groupExprs, double input_rows)
  * discourage use of a hash rather strongly if the inner relation is large,
  * which is what we want.  We do not want to hash unless we know that the
  * inner rel is well-dispersed (or the alternatives seem much worse).
+ *
+ * If keep_nulls is true, then we have to treat nulls as a real value, so
+ * adjust all results according to stanullfrac if available.
  */
 Selectivity
-estimate_hash_bucketsize(PlannerInfo *root, Node *hashkey, double nbuckets)
+estimate_hash_bucketsize(PlannerInfo *root, Node *hashkey, double nbuckets,
+		 bool keep_nulls)
 {
 	VariableStatData vardata;
 	double		estfract,
@@ -3454,13 +3458,6 @@ estimate_hash_bucketsize(PlannerInfo *root, Node *hashkey, double nbuckets)
 	/* Get number of distinct values */
 	ndistinct = get_variable_numdistinct(vardata, isdefault);
 
-	/* If ndistinct isn't real, punt and return 0.1, per comments above */
-	if (isdefault)
-	{
-		ReleaseVariableStats(vardata);
-		return (Selectivity) 0.1;
-	}
-
 	/* Get fraction that 

Re: [HACKERS] Really bad blowups with hash outer join and nulls

2015-02-15 Thread Tomas Vondra
Hi,

On 16.2.2015 00:50, Andrew Gierth wrote:
 Tom == Tom Lane t...@sss.pgh.pa.us writes:

 I've now tried the attached patch to correct the bucketsize
 estimates, and it does indeed stop the planner from considering the
 offending path (in this case it just does the join the other way
 round).
 
 One thing I couldn't immediately see how to do was account for the
 case where there are a lot of nulls in the table but a strict qual
 (or an IS NOT NULL) filters them out; this patch will be overly
 pessimistic about such cases. Do estimates normally try and take
 things like this into account? I didn't find any other relevant
 examples.

Improving the estimates is always good, but it's not going to fix the
case of non-NULL values (it shouldn't be all that difficult to create
such examples with a value whose hash starts with a bunch of zeroes).

  Tom There may also be something we can do in the executor, but it
  Tom would take closer analysis to figure out what's going wrong.  I
  Tom don't think kluging the behavior for NULL in particular is the
  Tom answer.
 
 The point with nulls is that a hash value of 0 is currently special
 in two distinct ways: it's always in batch 0 and bucket 0 regardless
 of how many batches and buckets there are, and it's the result of
 hashing a null. These two special cases interact in a worst-case
 manner, so it seems worthwhile to avoid that.

I think you're right this is a flaw in general - all it takes is a
sufficiently common value with a hash value falling into the first batch
(i.e. either 0 or starting with a lot of 0 bits, thus giving hashno==0).

I think this might be solved by relaxing the check a bit. Currently we
do this (see nodeHash.c:735):

if (nfreed == 0 || nfreed == ninmemory)
{
hashtable-growEnabled = false;
}

which only disables nbatch growth if *all* the tuples remain in the
batch. That's rather strict, and it takes a single tuple to break this.

With each nbatch increase we expect 50:50 split, i.e. 1/2 the tuples
staying in the batch, 1/2 moved to the new one. So a much higher ratio
is suspicious and most likely mean the same hash value, so what if we
did something like this:

if ((nfreed = 0.9 * (nfreed + ninmemory)) ||
(nfreed = 0.1 * (nfreed + ninmemory)))
{
hashtable-growEnabled = false;
}

which disables nbatch growth if either =90% tuples remained in the
first batch, or = 90% tuples were moved from it. The exact thresholds
might be set a bit differently, but I think 10:90 sounds about good.

Trivial patch implementing this attached - with it the explain analyze
looks like this:

test=# explain analyze select status, count(*) from q3 left join m3 on
(m3.id=q3.id) group by status;

   QUERY PLAN
--
 HashAggregate  (cost=64717.63..64717.67 rows=4 width=4) (actual
time=514.619..514.630 rows=5 loops=1)
   Group Key: m3.status
   -  Hash Right Join  (cost=18100.00..62217.63 rows=50 width=4)
(actual time=75.260..467.911 rows=500108 loops=1)
 Hash Cond: (m3.id = q3.id)
 -  Seq Scan on m3  (cost=0.00..18334.00 rows=100 width=37)
(actual time=0.003..91.799 rows=100 loops=1)
 -  Hash  (cost=7943.00..7943.00 rows=50 width=33)
   (actual time=74.916..74.916 rows=50 loops=1)
   Buckets: 65536 (originally 65536)
   Batches: 32 (originally 16)  Memory Usage: 8824kB
   -  Seq Scan on q3
   (cost=0.00..7943.00 rows=50 width=33)
   (actual time=0.005..27.395 rows=50 loops=1)
 Planning time: 0.172 ms
 Execution time: 514.663 ms
(10 rows)

Without the patch this runs in ~240 seconds and the number of batches
explodes to ~131k.

In theory it might happen that threre just a few hash values and all of
them are exactly the same within the first N bits (thus falling into the
first batch), but N+1 bits are different. So the next split would
actually help. But I think we can leave that up to probability, just
like all the hash code.


-- 
Tomas Vondrahttp://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training  Services
diff --git a/src/backend/executor/nodeHash.c b/src/backend/executor/nodeHash.c
index abd70b3..6b3f471 100644
--- a/src/backend/executor/nodeHash.c
+++ b/src/backend/executor/nodeHash.c
@@ -732,7 +732,8 @@ ExecHashIncreaseNumBatches(HashJoinTable hashtable)
 	 * group any more finely. We have to just gut it out and hope the server
 	 * has enough RAM.
 	 */
-	if (nfreed == 0 || nfreed == ninmemory)
+if ((nfreed = 0.9 * (nfreed + ninmemory)) || 
+(nfreed = 0.1 * (nfreed + ninmemory)))
 	{
 		hashtable-growEnabled = false;
 #ifdef HJDEBUG

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:

[HACKERS] Really bad blowups with hash outer join and nulls

2015-02-15 Thread Andrew Gierth
This came up today on IRC, though I suspect the general problem has been
seen before:

create table m3 (id uuid, status integer);
create table q3 (id uuid);
insert into m3
  select uuid_generate_v4(), floor(random() * 4)::integer
from generate_series(1,100);
insert into q3
  select id
from (select id, row_number() over () as rnum from m3) s
   where (rnum % 4)=0;
insert into q3 select null from generate_series(1,25);

-- at this point m3 has a million unique rows, while q3 has
-- a quarter of the ids of m3, plus as many nulls

-- note that work_mem is at the default 4MB

explain analyze select status, count(*) from q3 left join m3 on (m3.id=q3.id) 
group by status;
  QUERY PLAN
  
--
 HashAggregate  (cost=51553.62..51553.67 rows=4 width=4) (actual 
time=259580.168..259580.179 rows=5 loops=1)
   Group Key: m3.status
   -  Hash Right Join  (cost=12927.31..49596.89 rows=391347 width=4) (actual 
time=18804.502..258498.897 rows=50 loops=1)
 Hash Cond: (m3.id = q3.id)
 -  Seq Scan on m3  (cost=0.00..16753.10 rows=1038310 width=20) 
(actual time=0.011..1618.762 rows=100 loops=1)
 -  Hash  (cost=6124.47..6124.47 rows=391347 width=16) (actual 
time=1742.340..1742.340 rows=50 loops=1)
   Buckets: 131072 (originally 131072)  Batches: 65536 (originally 
8)  Memory Usage: 8837kB
   -  Seq Scan on q3  (cost=0.00..6124.47 rows=391347 width=16) 
(actual time=0.033..774.732 rows=50 loops=1)
 Planning time: 0.178 ms
 Execution time: 259583.185 ms

The memory usage of this query runs to hundreds of megs despite the
small work_mem.

On 9.3.5, which the user was running, the problem was much more extreme
(this is with the default 1MB work_mem, and a data set only a quarter of
the size):

explain analyze select status, count(*) from q3 left join m3 on (m3.id=q3.id) 
group by status;
  QUERY PLAN
   
---
 HashAggregate  (cost=15020.11..15022.11 rows=200 width=4) (actual 
time=3733245.942..3733245.952 rows=5 loops=1)
   -  Hash Right Join  (cost=3976.50..14395.11 rows=125000 width=4) (actual 
time=227870.000..3732686.692 rows=125000 loops=1)
 Hash Cond: (m3.id = q3.id)
 -  Seq Scan on m3  (cost=0.00..4189.59 rows=259659 width=20) (actual 
time=0.024..807.162 rows=25 loops=1)
 -  Hash  (cost=1803.00..1803.00 rows=125000 width=16) (actual 
time=437.991..437.991 rows=125000 loops=1)
   Buckets: 4096  Batches: 262144 (originally 8)  Memory Usage: 
1954kB
   -  Seq Scan on q3  (cost=0.00..1803.00 rows=125000 width=16) 
(actual time=0.011..191.208 rows=125000 loops=1)
 Total runtime: 3733269.725 ms
(8 rows)

This query on 9.3.5 allocated over 2.7GB of RAM on my system.

Note the explosion in the number of batches. Tracing execution showed
that as the memory limit was reached while processing the large number
of nulls, the number of batches would be doubled, then only a tiny
number of rows could be written out as being no longer in the current
batch, so the limit was then quickly reached again.

A key part of the problem here may be the fact that nulls hash to
exactly 0, and are therefore always part of the initial in-memory batch
regardless of how much splitting happens. If a large subset of the data
had some other constant value, it would likely have some non-zero bits
in its hash value and therefore stand a good chance of being written out
to a batch file before things get too desperate, avoiding the mass
explosion.

A quick test suggests that initializing the hash value to ~0 rather than
0 has a curious effect: the number of batches still explodes, but the
performance does not suffer the same way. (I think because almost all
the batches end up empty.) I think this is worth doing even in the
absence of a more general solution; nulls are common enough and
important enough that they shouldn't be the worst-case value if it can
be avoided.

(Testing this idea revealed that a regression test for rowsecurity is
missing an order by clause and changes result based on hash values)

-- 
Andrew (irc:RhodiumToad)


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Really bad blowups with hash outer join and nulls

2015-02-15 Thread Andrew Gierth
 Tomas == Tomas Vondra tomas.von...@2ndquadrant.com writes:

 Tomas Improving the estimates is always good, but it's not going to
 Tomas fix the case of non-NULL values (it shouldn't be all that
 Tomas difficult to create such examples with a value whose hash starts
 Tomas with a bunch of zeroes).

Right now, I can't get it to plan such an example, because (a) if there
are no stats to work from then the planner makes fairly pessimistic
assumptions about hash bucket filling, and (b) if there _are_ stats to
work from, then a frequently-occurring non-null value shows up as an MCV
and the planner takes that into account to calculate bucketsize.

The problem could only be demonstrated for NULLs because the planner was
ignoring NULL for the purposes of estimating bucketsize, which is
correct for all join types except RIGHT and FULL (which, iirc, are more
recent additions to the hashjoin repertoire).

If you want to try testing it, you may find this useful:

select i, hashint8(i) from unnest(array[1474049294, -1779024306, -1329041947]) 
u(i);
  i  | hashint8 
-+--
  1474049294 |0
 -1779024306 |0
 -1329041947 |0
(3 rows)

(those are the only three int4 values that hash to exactly 0)

It's probably possible to construct pathological cases by finding a lot
of different values with zeros in the high bits of the hash, but that's
something that wouldn't be likely to happen by chance.

 Tomas I think this might be solved by relaxing the check a bit.

Yeah, that looks potentially useful.

-- 
Andrew (irc:RhodiumToad)


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers