Re: [HACKERS] PATCH: postpone building buckets to the end of Hash (in HashJoin)

2016-01-19 Thread Tomas Vondra



On 01/19/2016 08:34 PM, Robert Haas wrote:

On Mon, Jan 18, 2016 at 10:57 PM, Tomas Vondra
 wrote:

If this doesn't regress performance in the case where the number of
buckets is estimated accurately to begin with, then I think this is
a great idea. Can you supply some performance tests results for that
case, and maybe some of the other cases also?


I don't see how it could regress performance, and the benchmarks I've
done confirm that. I'll do more thorough benchmarking and post the
results here, but not now as this patch is in 2016-01 CF and I want to
put all my time into reviewing patches from the open commitfest.


I've finally got to do more thorough benchmarks, and surprisingly there
seems to be a minor regression.

...


So even if we entirely skipped the bucket build, we would not recover the
difference. Not sure what's going on here.

I've also done some profiling using perf, but I haven't really found
anything suspicious there.

Any ideas what might be the cause of this?


My guess is that memory locality is not as good with this patch.  Consider this:

copyTuple->next = hashtable->buckets[bucketno];

We've only just populated copytuple via memcpy, so that cache line is
presumably in whatever place cache lines go when they are really hot.
We basically make one pass over the allocated space for the hash
table, filling in the hash tuples and linking things into bucket
chains.  OTOH, with the patch, we don't do that on the first pass,
just writing the tuples.  Then when we come back through we still have
to do that part:

hashTuple->next = hashtable->buckets[bucketno];

By the time we're doing this, especially at larger work_mem settings,
we've probably pushed those cache lines out of the faster levels of
the CPU cache, and we have to pull them back in again, and that takes
time.   If the tuples are small, we'll dirty every cache line twice;
if they're larger, we might not dirty every cache line on the second
pass, but just some fraction of them.

I could be totally off base, but this is why I was concerned about the patch.


I can totally see why this would slow-down the BuildBuckets function, 
but I don't quite see why it should make the other code significantly 
slower. Yet BuildBuckets takes just ~25ms while the total duration 
increases by ~200ms (for the 1x10 dataset).


I do understand calling BuildBuckets may affect the code executed after 
that as it keeps other data in the cache, but i would not expect ~175ms.


Yet another thing is that BuildBuckets accesses the tuples in mostly 
sequential way, so I'd expect prefetch to take care of that.


regards

--
Tomas Vondra  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] PATCH: postpone building buckets to the end of Hash (in HashJoin)

2016-01-19 Thread Robert Haas
On Mon, Jan 18, 2016 at 10:57 PM, Tomas Vondra
 wrote:
>>> If this doesn't regress performance in the case where the number of
>>> buckets is estimated accurately to begin with, then I think this is
>>> a great idea. Can you supply some performance tests results for that
>>> case, and maybe some of the other cases also?
>>
>> I don't see how it could regress performance, and the benchmarks I've
>> done confirm that. I'll do more thorough benchmarking and post the
>> results here, but not now as this patch is in 2016-01 CF and I want to
>> put all my time into reviewing patches from the open commitfest.
>
> I've finally got to do more thorough benchmarks, and surprisingly there
> seems to be a minor regression.
...
>
> So even if we entirely skipped the bucket build, we would not recover the
> difference. Not sure what's going on here.
>
> I've also done some profiling using perf, but I haven't really found
> anything suspicious there.
>
> Any ideas what might be the cause of this?

My guess is that memory locality is not as good with this patch.  Consider this:

copyTuple->next = hashtable->buckets[bucketno];

We've only just populated copytuple via memcpy, so that cache line is
presumably in whatever place cache lines go when they are really hot.
We basically make one pass over the allocated space for the hash
table, filling in the hash tuples and linking things into bucket
chains.  OTOH, with the patch, we don't do that on the first pass,
just writing the tuples.  Then when we come back through we still have
to do that part:

hashTuple->next = hashtable->buckets[bucketno];

By the time we're doing this, especially at larger work_mem settings,
we've probably pushed those cache lines out of the faster levels of
the CPU cache, and we have to pull them back in again, and that takes
time.   If the tuples are small, we'll dirty every cache line twice;
if they're larger, we might not dirty every cache line on the second
pass, but just some fraction of them.

I could be totally off base, but this is why I was concerned about the patch.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] PATCH: postpone building buckets to the end of Hash (in HashJoin)

2016-01-19 Thread Robert Haas
On Tue, Jan 19, 2016 at 4:49 PM, Tomas Vondra
 wrote:
> I can totally see why this would slow-down the BuildBuckets function, but I
> don't quite see why it should make the other code significantly slower. Yet
> BuildBuckets takes just ~25ms while the total duration increases by ~200ms
> (for the 1x10 dataset).
>
> I do understand calling BuildBuckets may affect the code executed after that
> as it keeps other data in the cache, but i would not expect ~175ms.

I don't know.  There could be some other explanation, but I don't have
a guess as to what it is.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] PATCH: postpone building buckets to the end of Hash (in HashJoin)

2016-01-18 Thread Tomas Vondra

Hi,

On 12/17/2015 10:28 PM, Tomas Vondra wrote:

Hi,

On 12/17/2015 07:20 PM, Robert Haas wrote:
...


If this doesn't regress performance in the case where the number of
buckets is estimated accurately to begin with, then I think this is
a great idea. Can you supply some performance tests results for that
case, and maybe some of the other cases also?


I don't see how it could regress performance, and the benchmarks I've
done confirm that. I'll do more thorough benchmarking and post the
results here, but not now as this patch is in 2016-01 CF and I want to
put all my time into reviewing patches from the open commitfest.


I've finally got to do more thorough benchmarks, and surprisingly there 
seems to be a minor regression. The scripts I've used are attached, 
along with results, but in short it joins two tables, with different 
combinations:


  1M x 10M
  10M x 100M
  5M x 250M

with different work_mem sizes (so that the smallest value uses a bunch 
of batches while the largest value uses a single batch).


The 1x10 and 10x100 are designed to fit into RAM on the machine 
(including the temporary files in batching mode), while 5x250 is 
designed to specifically force the temp files to disk (but it's on fast 
SSD array, so not a big deal).


Average timings for current master and the first two patches of the 
series (0001 postpones the build of buckets and 0002 always starts 
without batching) look like this (the timings are in milliseconds):


size   work_memmasterpostpone   no-batch
 -
1x10  8  5735  5760 6107
 32  5939  5955 6124
128  4852  5038 5175
 -
   5x250 64158512158429   159008
256144046144584   144994
   1024111478117529   116933
 -
  10x100 64 68389 6827868530
256 66693 6641566605
   1024 48606 5065450490

So compared to master, the differences look like this:

size   work_mempostpone   no-batch
  -
1x10  8   0.44%  6.50%
 32   0.28%  3.13%
128   3.83%  6.65%
  -
   5x250 64  -0.05%  0.31%
256   0.37%  0.66%
   1024   5.43%  4.89%
  -
   10x10064  -0.16%  0.21%
256  -0.42% -0.13%
   1024   4.21%  3.88%

So for the first patch (postponing building of buckets) with batching, 
the difference is rather negligible, possibly within noise (~0.5%). When 
the hash join runs with a single batch, the difference however gets much 
more significant 4-5%.


I'm not going to discuss the second patch for now, but naturally it has 
the same issue and apparently also some additional ones (at least on the 
1x10 data set).


I have spent a fair amount of time profiling this, and I can't wrap my 
head around where the difference comes from, though. Essentially we 
don't do any additional stuff, we've just moved some of the stuff to a 
different place in the code.


It might change the memory access pattern a bit, but the original 
patterns are not exactly sequential or so, as we're moving random 
updates to array of pointers. Or rather moving them to the 
BuildBuckets() method, so if something consumes more time, it should be 
in this method, I believe.


So I've measured how much time the function takes for the 1x10 and 
10x100 datasets, and it's ~25ms and ~290ms respectively. That's way less 
than the actual difference in query runtime, which is ~190ms and ~2050ms.


So even if we entirely skipped the bucket build, we would not recover 
the difference. Not sure what's going on here.


I've also done some profiling using perf, but I haven't really found 
anything suspicious there.


Any ideas what might be the cause of this?

regards

--
Tomas Vondra  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


hash-scripts.tgz
Description: application/compressed-tar


hashes-delayed.ods
Description: application/vnd.oasis.opendocument.spreadsheet

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] PATCH: postpone building buckets to the end of Hash (in HashJoin)

2015-12-31 Thread Tomas Vondra

Hi,

attached is v2 of the patch, with a bugfix and two significant improvements:

1) bugfix: forgotten memset() in ExecHashIncreaseNumBatches()

   Caused segfaults whenever we started with a single batch and then
   had to increase the number of batches.

2) 0002: postpone the batching (not just build of buckets)

   When ExecChooseHashTableSize finds out we'll probably need batching,
   we start with nbatch=1 anyway, and only switch to batching (with the
   estimated number of batches) after filling work_mem.

   This helps in two basic cases - firstly, when we do over-estimates
   we may end up doing batching even when we could do with a single
   batch (i.e. without writing to temp files at all). That's really
   expensive, and this helps with that - not entirely, because we can
   only distinguish "no batching vs. batching" case and not the number
   of batches, but that's the more interesting case anyway (the
   difference between batching with 16 or 32 batches is not large, at
   least in my experience).

   The other use case is the patch adding bloom filters, because it
   allows with properly sizing the bloom filter, which needs the number
   of distinct values. So while the filter might be sized using the
   estimated number of tuples passed to the Hash, it's likely to be
   larger than needed and thus more expensive. So sizing the bloom
   filter is crucial, and this change allows first accumulating enough
   data to estimate the size of bloom filter first. More discussion in
   the other patch.

3) 0003: eliminate the MaxAlloc limit

   Currently the size of buckets is limited my MaxAlloc, which means it
   can't exceed 512MB (assuming I've done my math correctly), which
   means ~67M buckets. This removes the MaxAlloc limit and just keeps
   the (INT_MAX/2) limit, so ~2B rows. I don't think it's worth
   extending further at this point.

   There was a discussion about this, quoting the f2fc98fb message:

  Limit the size of the hashtable pointer array to not more than
  MaxAllocSize, per reports from Kouhei Kaigai and others of
  "invalid memory alloc request size" failures.  There was
  discussion of allowing the array to get larger than that by using
  the "huge" palloc API, but so far no proof that that is actually
  a good idea, and at this point in the 9.5 cycle major changes
  from old behavior don't seem like the way to go.

   I believe the objections were correct, because it'd really waste a
   lot of memory in case of over-estimates. I do however think that
   doing (1) and (2) fixes this issue, because with those changes we
   never allocate the buckets based on the initial estimate. We pretty
   much get the exact number of buckets necessary.

I haven't done any performance testing yet (well, I did, but not 
reliable enough for sharing here). I'll post more data early January, 
once the machine completes other tests it's running right now.


I've also been thinking about what other optimizations might be 
possible, and the one thing I'm wondering about is adding HyperLogLog 
counter. The thing is that we do size buckets based on number of rows, 
but that may be nonsense - to size the hash table, we actually need 
number of distinct values. So when the Hash contains duplicate rows (say 
10 rows for each value), we end up with only ~10% of buckets containing 
any data (about 10 tuples in a list).


If we knew the number of distinct values, we could make the buckets 
smaller and thus reduce the memory requirements significantly.


I mention this because the patch with bloom filters actually uses HLL to 
size the bloom filters, so this would not be really introducing any new 
code.



regards

--
Tomas Vondra  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
>From 3aaf27f722e6c90702af8f51fd11a95b8c173c38 Mon Sep 17 00:00:00 2001
From: Tomas Vondra 
Date: Sun, 27 Dec 2015 18:25:51 +0100
Subject: [PATCH 1/5] delayed build of hash buckets v2

Removes forgotten memset in ExecHashIncreaseNumBatches() causing
crashes when we start without batching and find out we need to
do batching later.
---
 src/backend/commands/explain.c  |  8 +---
 src/backend/executor/nodeHash.c | 94 +
 src/include/executor/hashjoin.h |  5 ---
 3 files changed, 32 insertions(+), 75 deletions(-)

diff --git a/src/backend/commands/explain.c b/src/backend/commands/explain.c
index 12dae77..8fd9c9f 100644
--- a/src/backend/commands/explain.c
+++ b/src/backend/commands/explain.c
@@ -2116,21 +2116,17 @@ show_hash_info(HashState *hashstate, ExplainState *es)
 		if (es->format != EXPLAIN_FORMAT_TEXT)
 		{
 			ExplainPropertyLong("Hash Buckets", hashtable->nbuckets, es);
-			ExplainPropertyLong("Original Hash Buckets",
-hashtable->nbuckets_original, es);
 			ExplainPropertyLong("Hash Batches", hashtable->nbatch, es);
 			ExplainPropertyLong("Original Hash Batches",
 	

Re: [HACKERS] PATCH: postpone building buckets to the end of Hash (in HashJoin)

2015-12-17 Thread Robert Haas
On Mon, Dec 14, 2015 at 3:04 PM, Tomas Vondra
 wrote:
> attached is v1 of one of the hashjoin improvements mentioned in September in
> the lengthy thread [1].
>
> The main objection against simply removing the MaxAllocSize check (and
> switching to MemoryContextAllocHuge) is that if the number of rows is
> overestimated, we may consume significantly more memory than necessary.
>
> We run into this issue because we allocate the buckets at the very
> beginning, based on the estimate. I've noticed we don't really need to do
> that - we don't really use the buckets until after the Hash node completes,
> and we don't even use it when incrementing the number of batches (we use the
> dense allocation for that).
>
> So this patch removes this - it postpones allocating the buckets to the end
> of MultiExecHash(), and at that point we know pretty well what is the
> optimal number of buckets.
>
> This makes tracking nbuckets_optimal/log2_nbuckets_optimal unnecessary, as
> we can simply use nbuckets/log2_nbuckets for that purpose. I've also removed
> nbuckets_original, but maybe that's not a good idea and we want to keep that
> information (OTOH we never use that number of buckets).
>
> This patch does not change the estimation in ExecChooseHashTableSize() at
> all, because we still need to do that to get nbucket/nbatch. Maybe this is
> another opportunity for improvement in case of overestimates, because in
> that case it may happen that we do batching even when we could do without
> it. So we might start with nbuckets=1024 and nbatches=1, and only switch to
> the estimated number of batches if really needed.
>
> This patch also does not mess with the allocation, i.e. it still uses the
> MaxAllocSize limit (which amounts to ~256MB due to the doubling, IIRC), but
> it should make it easier to do that change.

If this doesn't regress performance in the case where the number of
buckets is estimated accurately to begin with, then I think this is a
great idea.  Can you supply some performance tests results for that
case, and maybe some of the other cases also?

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] PATCH: postpone building buckets to the end of Hash (in HashJoin)

2015-12-17 Thread Tomas Vondra

Hi,

On 12/17/2015 07:20 PM, Robert Haas wrote:
...


If this doesn't regress performance in the case where the number of
buckets is estimated accurately to begin with, then I think this is
a great idea. Can you supply some performance tests results for that
case, and maybe some of the other cases also?


I don't see how it could regress performance, and the benchmarks I've 
done confirm that. I'll do more thorough benchmarking and post the 
results here, but not now as this patch is in 2016-01 CF and I want to 
put all my time into reviewing patches from the open commitfest.


regards

--
Tomas Vondra  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] PATCH: postpone building buckets to the end of Hash (in HashJoin)

2015-12-14 Thread Tomas Vondra

Hi,

attached is v1 of one of the hashjoin improvements mentioned in 
September in the lengthy thread [1].


The main objection against simply removing the MaxAllocSize check (and 
switching to MemoryContextAllocHuge) is that if the number of rows is 
overestimated, we may consume significantly more memory than necessary.


We run into this issue because we allocate the buckets at the very 
beginning, based on the estimate. I've noticed we don't really need to 
do that - we don't really use the buckets until after the Hash node 
completes, and we don't even use it when incrementing the number of 
batches (we use the dense allocation for that).


So this patch removes this - it postpones allocating the buckets to the 
end of MultiExecHash(), and at that point we know pretty well what is 
the optimal number of buckets.


This makes tracking nbuckets_optimal/log2_nbuckets_optimal unnecessary, 
as we can simply use nbuckets/log2_nbuckets for that purpose. I've also 
removed nbuckets_original, but maybe that's not a good idea and we want 
to keep that information (OTOH we never use that number of buckets).


This patch does not change the estimation in ExecChooseHashTableSize() 
at all, because we still need to do that to get nbucket/nbatch. Maybe 
this is another opportunity for improvement in case of overestimates, 
because in that case it may happen that we do batching even when we 
could do without it. So we might start with nbuckets=1024 and 
nbatches=1, and only switch to the estimated number of batches if really 
needed.


This patch also does not mess with the allocation, i.e. it still uses 
the MaxAllocSize limit (which amounts to ~256MB due to the doubling, 
IIRC), but it should make it easier to do that change.


[1] http://www.postgresql.org/message-id/flat/19746.1443983...@sss.pgh.pa.us

regards

--
Tomas Vondra  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
diff --git a/src/backend/commands/explain.c b/src/backend/commands/explain.c
index 12dae77..8fd9c9f 100644
--- a/src/backend/commands/explain.c
+++ b/src/backend/commands/explain.c
@@ -2116,21 +2116,17 @@ show_hash_info(HashState *hashstate, ExplainState *es)
 		if (es->format != EXPLAIN_FORMAT_TEXT)
 		{
 			ExplainPropertyLong("Hash Buckets", hashtable->nbuckets, es);
-			ExplainPropertyLong("Original Hash Buckets",
-hashtable->nbuckets_original, es);
 			ExplainPropertyLong("Hash Batches", hashtable->nbatch, es);
 			ExplainPropertyLong("Original Hash Batches",
 hashtable->nbatch_original, es);
 			ExplainPropertyLong("Peak Memory Usage", spacePeakKb, es);
 		}
-		else if (hashtable->nbatch_original != hashtable->nbatch ||
- hashtable->nbuckets_original != hashtable->nbuckets)
+		else if (hashtable->nbatch_original != hashtable->nbatch)
 		{
 			appendStringInfoSpaces(es->str, es->indent * 2);
 			appendStringInfo(es->str,
-			 "Buckets: %d (originally %d)  Batches: %d (originally %d)  Memory Usage: %ldkB\n",
+			 "Buckets: %d Batches: %d (originally %d)  Memory Usage: %ldkB\n",
 			 hashtable->nbuckets,
-			 hashtable->nbuckets_original,
 			 hashtable->nbatch,
 			 hashtable->nbatch_original,
 			 spacePeakKb);
diff --git a/src/backend/executor/nodeHash.c b/src/backend/executor/nodeHash.c
index 5e05ec3..b0cf97d 100644
--- a/src/backend/executor/nodeHash.c
+++ b/src/backend/executor/nodeHash.c
@@ -39,7 +39,7 @@
 
 
 static void ExecHashIncreaseNumBatches(HashJoinTable hashtable);
-static void ExecHashIncreaseNumBuckets(HashJoinTable hashtable);
+static void ExecHashBuildBuckets(HashJoinTable hashtable);
 static void ExecHashBuildSkewHash(HashJoinTable hashtable, Hash *node,
 	  int mcvsToUse);
 static void ExecHashSkewTableInsert(HashJoinTable hashtable,
@@ -129,9 +129,8 @@ MultiExecHash(HashState *node)
 		}
 	}
 
-	/* resize the hash table if needed (NTUP_PER_BUCKET exceeded) */
-	if (hashtable->nbuckets != hashtable->nbuckets_optimal)
-		ExecHashIncreaseNumBuckets(hashtable);
+	/* Construct the actual hash table (using the optimal number of buckets). */
+	ExecHashBuildBuckets(hashtable);
 
 	/* Account for the buckets in spaceUsed (reported in EXPLAIN ANALYZE) */
 	hashtable->spaceUsed += hashtable->nbuckets * sizeof(HashJoinTuple);
@@ -283,10 +282,7 @@ ExecHashTableCreate(Hash *node, List *hashOperators, bool keepNulls)
 	 */
 	hashtable = (HashJoinTable) palloc(sizeof(HashJoinTableData));
 	hashtable->nbuckets = nbuckets;
-	hashtable->nbuckets_original = nbuckets;
-	hashtable->nbuckets_optimal = nbuckets;
 	hashtable->log2_nbuckets = log2_nbuckets;
-	hashtable->log2_nbuckets_optimal = log2_nbuckets;
 	hashtable->buckets = NULL;
 	hashtable->keepNulls = keepNulls;
 	hashtable->skewEnabled = false;
@@ -372,18 +368,11 @@ ExecHashTableCreate(Hash *node, List *hashOperators, bool keepNulls)
 	}
 
 	/*
-	 * Prepare context for the first-scan space allocations; allocate the
-	 * hashbucket array therein, and set each bucket