subject:"\[HACKERS\] bad estimation together with large work

Re: [HACKERS] bad estimation together with large work_mem generates terrible slow hash joins

2014-10-13 Thread Kevin Grittner

Kevin Grittner  wrote:

> Since both Heikki and Robert spent time on this patch earlier,
> I'll give either of them a shot at committing it if they want;
> otherwise I'll do it.

Done.  Thanks, Tomas!

--
Kevin Grittner
EDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] bad estimation together with large work_mem generates terrible slow hash joins

2014-10-10 Thread Tomas Vondra

Hi!

On 9.10.2014 22:28, Kevin Grittner wrote:
> Tomas Vondra  wrote:
>>
>> The only case I've been able to come up with is when the hash table
>> fits into work_mem only thanks to not counting the buckets. The new
>> code will start batching in this case.
> 
> Hmm. If you look at the timings in my posting from 2014-10-01 I had
> cases where the patched version started with one batch and went to
> two, while the unpatched used just one batch, and the patched version
> was still more than twice as fast. I'm sure the "on disk" batch was
> fully cached, however; it might work out differently if disk speed
> actually came into the picture.

I think the case with large outer table and memory pressure, forcing the
batches to be actually written to disk (instead of just being in the
page cache) is the main concern here.


>> That is mostly luck, however, because it depends on the cardinality
>> estimates, and the best way to fix it is increasing work_mem (which is
>> safer thanks to reducing the overhead).
>>
>> Also, Robert proposed a way to mitigate this, if we realize we'd
>> have to do batching during the initial sizing, we can peek whether
>> reducing the number of buckets (to 1/2 or maybe 1/4) would help. I
>> believe this is a good approach, and will look into that after
>> pgconf.eu (i.e. early November), unless someone else is interested.
> 
> Sure, but it would be good to confirm that it's actually needed first.

Yeah. But with cleverly crafted test case it's possible to confirm
almost everything ;-)

>>> When given a generous work_mem setting the patched version often uses
>>> more of what it is allowed than the unpatched version (which is
>>> probably one of the reasons it tends to do better). If someone has
>>> set a high work_mem and has gotten by only because the configured
>>> amount is not all being used when it would benefit performance, they
>>> may need to adjust work_mem down to avoid memory problems. That
>>> doesn't seem to me to be a reason to reject the patch.
>>
>> I'm not entirely sure I understand this paragraph. What do you mean by
>> "configured amount is not all being used when it would benefit
>> performance"? Can you give an example?
> 
> Again, the earlier post showed that while the unpatched used 3516kB
> whether it had work_mem set to 4MB or 1GB, the patched version used
> 3073kB when work_mem was set to 4MB and 4540kB when work_mem was
> set to 1GB.  The extra memory allowed the patched version to stay
> at 1 batch, improving performance over the other setting.

OK, so with 4MB the new version was batching, while the original code
keeps a single batch (likely due to not counting buckets into work_mem).
I believe that's expected behavior.

Also, in the e-mail from Oct 2 that got lost [1], I pointed out that by
testing only with small hash tables the results are likely influenced by
high CPU cache hit ratio. I think benchmarking with larger inner
relation (thus increasing the memory occupied by the hash table) might
be interesting.

[1]
http://www.postgresql.org/message-id/c84680e818f6a0f4a26223cd750ff252.squir...@2.emaily.eu

> 
>> The only thing I can think of is the batching behavior described above.
>>
>>> This is in "Waiting on Author" status only because I never got an
>>> answer about why the debug code used printf() rather the elog() at
>>> a DEBUG level.  Other than that, I would say this patch is Ready
>>> for Committer.  Tomas?  You there?
>>
>> I think I responded to that on October 2, quoting:
...
> Ah, I never received that email.  That tends to happen every now
> and then.  :-(
> 
>> I believe it's safe to switch the logging to elog(). IMHO the printf
>> logging is there from some very early version of the code, before elog
>> was introduced. Or something like that.
> 
> I guess it might be best to remain consistent in this patch and
> change that in a separate patch.  I just wanted to make sure you
> didn't see any reason not to do so.

OK, I'll put that on my TODO list for the next CF.

> With that addressed I will move this to Ready for Committer.  Since
> both Heikki and Robert spent time on this patch earlier, I'll give
> either of them a shot at committing it if they want; otherwise I'll
> do it.

Great, thanks!
Tomas



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] bad estimation together with large work_mem generates terrible slow hash joins

2014-10-09 Thread Kevin Grittner

Tomas Vondra  wrote:
> On 9.10.2014 16:55, Kevin Grittner wrote:

>> I've tried various other tests using \timing rather than EXPLAIN, and
>> the patched version looks even better in those cases. I have seen up
>> to 4x the performance for a query using the patched version, higher
>> variability in run time without the patch, and have yet to devise a
>> benchmark where the patched version came out slower (although I admit
>> to not being as good at creating such cases as some here).
>
> Nice. Thanks for the testing!
>
> The only case I've been able to come up with is when the hash table fits
> into work_mem only thanks to not counting the buckets. The new code will
> start batching in this case.

Hmm.  If you look at the timings in my posting from 2014-10-01 I
had cases where the patched version started with one batch and went
to two, while the unpatched used just one batch, and the patched
version was still more than twice as fast.  I'm sure the "on disk"
batch was fully cached, however; it might work out differently if
disk speed actually came into the picture.

> That is mostly luck, however, because it depends on the cardinality
> estimates, and the best way to fix it is increasing work_mem (which is
> safer thanks to reducing the overhead).
>
> Also, Robert proposed a way to mitigate this, if we realize we'd have to
> do batching during the initial sizing, we can peek whether reducing the
> number of buckets (to 1/2 or maybe 1/4) would help. I believe this is a
> good approach, and will look into that after pgconf.eu (i.e. early
> November), unless someone else is interested.

Sure, but it would be good to confirm that it's actually needed first.

>> When given a generous work_mem setting the patched version often uses
>> more of what it is allowed than the unpatched version (which is
>> probably one of the reasons it tends to do better). If someone has
>> set a high work_mem and has gotten by only because the configured
>> amount is not all being used when it would benefit performance, they
>> may need to adjust work_mem down to avoid memory problems. That
>> doesn't seem to me to be a reason to reject the patch.
>
> I'm not entirely sure I understand this paragraph. What do you mean by
> "configured amount is not all being used when it would benefit
> performance"? Can you give an example?

Again, the earlier post showed that while the unpatched used 3516kB
whether it had work_mem set to 4MB or 1GB, the patched version used
3073kB when work_mem was set to 4MB and 4540kB when work_mem was
set to 1GB.  The extra memory allowed the patched version to stay
at 1 batch, improving performance over the other setting.

> The only thing I can think of is the batching behavior described above.
>
>> This is in "Waiting on Author" status only because I never got an
>> answer about why the debug code used printf() rather the elog() at
>> a DEBUG level.  Other than that, I would say this patch is Ready
>> for Committer.  Tomas?  You there?
>
> I think I responded to that on October 2, quoting:
>
> ===
> On 2.10.2014 09:50, Tomas Vondra wrote:
>> On 2.10.2014, 2:20, Kevin Grittner wrote:
>>>
>>> The patch applied and built without problem, and pass `make
>>> check-world`. The only thing that caught my eye was the addition of
>>> debug code using printf() instead of logging at a DEBUG level. Is
>>> there any reason for that?
>>
>> Not really. IIRC the main reason it that the other code in
>> nodeHash.c uses the same approach.
>===

Ah, I never received that email.  That tends to happen every now
and then.  :-(

> I believe it's safe to switch the logging to elog(). IMHO the printf
> logging is there from some very early version of the code, before elog
> was introduced. Or something like that.

I guess it might be best to remain consistent in this patch and
change that in a separate patch.  I just wanted to make sure you
didn't see any reason not to do so.

With that addressed I will move this to Ready for Committer.  Since
both Heikki and Robert spent time on this patch earlier, I'll give
either of them a shot at committing it if they want; otherwise I'll
do it.

--
Kevin Grittner
EDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] bad estimation together with large work_mem generates terrible slow hash joins

2014-10-09 Thread Tomas Vondra

On 9.10.2014 16:55, Kevin Grittner wrote:
>
> I've tried various other tests using \timing rather than EXPLAIN, and
> the patched version looks even better in those cases. I have seen up
> to 4x the performance for a query using the patched version, higher
> variability in run time without the patch, and have yet to devise a
> benchmark where the patched version came out slower (although I admit
> to not being as good at creating such cases as some here).

Nice. Thanks for the testing!

The only case I've been able to come up with is when the hash table fits
into work_mem only thanks to not counting the buckets. The new code will
start batching in this case.

That is mostly luck, however, because it depends on the cardinality
estimates, and the best way to fix it is increasing work_mem (which is
safer thanks to reducing the overhead).

Also, Robert proposed a way to mitigate this, if we realize we'd have to
do batching during the initial sizing, we can peek whether reducing the
number of buckets (to 1/2 or maybe 1/4) would help. I believe this is a
good approach, and will look into that after pgconf.eu (i.e. early
November), unless someone else is interested.

> When given a generous work_mem setting the patched version often uses
> more of what it is allowed than the unpatched version (which is
> probably one of the reasons it tends to do better). If someone has
> set a high work_mem and has gotten by only because the configured
> amount is not all being used when it would benefit performance, they
> may need to adjust work_mem down to avoid memory problems. That
> doesn't seem to me to be a reason to reject the patch.

I'm not entirely sure I understand this paragraph. What do you mean by
"configured amount is not all being used when it would benefit
performance"? Can you give an example?

The only thing I can think of is the batching behavior described above.

> This is in "Waiting on Author" status only because I never got an
> answer about why the debug code used printf() rather the elog() at
> a DEBUG level.  Other than that, I would say this patch is Ready
> for Committer.  Tomas?  You there?

I think I responded to that on October 2, quoting:

===
On 2.10.2014 09:50, Tomas Vondra wrote:
> On 2.10.2014, 2:20, Kevin Grittner wrote:
>>
>> The patch applied and built without problem, and pass `make 
>> check-world`. The only thing that caught my eye was the addition of
>> debug code using printf() instead of logging at a DEBUG level. Is
>> there any reason for that?
>
> Not really. IIRC the main reason it that the other code in
> nodeHash.c uses the same approach.
===

I believe it's safe to switch the logging to elog(). IMHO the printf
logging is there from some very early version of the code, before elog
was introduced. Or something like that.

Tomas

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] bad estimation together with large work_mem generates terrible slow hash joins

2014-10-09 Thread Kevin Grittner

Heikki Linnakangas  wrote:
> On 10/02/2014 03:20 AM, Kevin Grittner wrote:
>> My only concern from the benchmarks is that it seemed like there
>> was a statistically significant increase in planning time:
>>
>> unpatched plan time average: 0.450 ms
>> patched plan time average:  0.536 ms
>>
>> That *might* just be noise, but it seems likely to be real.  For
>> the improvement in run time, I'd put up with an extra 86us in
>> planning time per hash join; but if there's any way to shave some
>> of that off, all the better.
>
> The patch doesn't modify the planner at all, so it would be rather
> surprising if it increased planning time. I'm willing to just write that
> off as noise.

Fair enough.  I have seen much larger variations caused by how
the executable code happened to line up relative to cacheline
boundaries; and we've never made any effort to manage that.

I've tried various other tests using \timing rather than EXPLAIN,
and the patched version looks even better in those cases.  I have
seen up to 4x the performance for a query using the patched
version, higher variability in run time without the patch, and have
yet to devise a benchmark where the patched version came out slower
(although I admit to not being as good at creating such cases as
some here).

When given a generous work_mem setting the patched version often
uses more of what it is allowed than the unpatched version (which
is probably one of the reasons it tends to do better).  If someone
has set a high work_mem and has gotten by only because the
configured amount is not all being used when it would benefit
performance, they may need to adjust work_mem down to avoid memory
problems.  That doesn't seem to me to be a reason to reject the
patch.

This is in "Waiting on Author" status only because I never got an
answer about why the debug code used printf() rather the elog() at
a DEBUG level.  Other than that, I would say this patch is Ready
for Committer.  Tomas?  You there?

--
Kevin Grittner
EDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] bad estimation together with large work_mem generates terrible slow hash joins

2014-10-02 Thread Tomas Vondra

Dne 2 Říjen 2014, 2:20, Kevin Grittner napsal(a):
> Tomas Vondra  wrote:
>> On 12.9.2014 23:22, Robert Haas wrote:
>
>>> My first thought is to revert to NTUP_PER_BUCKET=1, but it's
>>> certainly arguable. Either method, though, figures to be better than
>>> doing nothing, so let's do something.
>>
>> OK, but can we commit the remaining part first? Because it'll certainly
>> interact with this proposed part, and it's easier to tweak when the code
>> is already there. Instead of rebasing over and over.
>
> The patch applied and built without problem, and pass `make
> check-world`.  The only thing that caught my eye was the addition
> of debug code using printf() instead of logging at a DEBUG level.
> Is there any reason for that?

Not really. IIRC the main reason it that the other code in nodeHash.c uses
the same approach.

> I still need to work through the (rather voluminous) email threads
> and the code to be totally comfortable with all the issues, but
> if Robert and Heikki are comfortable with it, I'm not gonna argue.

;-)

> Preliminary benchmarks look outstanding!  I tried to control
> carefully to ensure consistent results (by dropping, recreating,
> vacuuming, analyzing, and checkpointing before each test), but
> there were surprising outliers in the unpatched version.  It turned
> out that it was because slight differences in the random samples
> caused different numbers of buckets for both unpatched and patched,
> but the patched version dealt with the difference gracefully while
> the unpatched version's performance fluctuated randomly.

Can we get a bit more details on the test cases? I did a number of tests
while working on the patch, and I generally tested two cases:

(a) well-estimated queries (i.e. with nbucket matching NTUP_PER_BUCKET)

(b) mis-estimated queries, with various levels of accuracy (say, 10x -
1000x misestimates)

>From the description, it seems you only tested (b) - is that correct?

The thing is that while the resize is very fast and happens only once,
it's not perfectly free. In my tests, this was always more than
compensated by the speedups (even in the weird cases reported by Stephen
Frost), so I think we're safe here.

Also, I definitely recommend testing with different hash table sizes (say,
work_mem=256MB and the hash table just enough to fit in without batching).
The thing is the effect of CPU caches is very different for small and
large hash tables. (This is not about work_mem alone, but about how much
memory is used by the hash table - according to the results you posted it
never gets over ~4.5MB)

You tested against current HEAD, right? This patch was split into three
parts, two of which were already commited (45f6240a and 8cce08f1). The
logic of the patch was "this might take a of time/memory, but it's
compensated by these other changes". Maybe running the tests on the
original code would be interesting?

Although, if this last part of the patch actually improves the performance
on it's own, we're fine - it'll improve it even more compared to the old
code (especially before lowering NTUP_PER_BUCKET 10 to 1).

> My only concern from the benchmarks is that it seemed like there
> was a statistically significant increase in planning time:
>
> unpatched plan time average: 0.450 ms
> patched plan time average:   0.536 ms
>
> That *might* just be noise, but it seems likely to be real.  For
> the improvement in run time, I'd put up with an extra 86us in
> planning time per hash join; but if there's any way to shave some
> of that off, all the better.

I agree with Heikki that this is probably noise, because the patch does
not mess with planner at all.

The only thing I can think of is adding a few fields into
HashJoinTableData. Maybe this makes the structure too large to fit into a
cacheline, or something?

Tomas

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] bad estimation together with large work_mem generates terrible slow hash joins

2014-10-02 Thread Heikki Linnakangas


On 10/02/2014 03:20 AM, Kevin Grittner wrote:

My only concern from the benchmarks is that it seemed like there
was a statistically significant increase in planning time:

unpatched plan time average: 0.450 ms
patched plan time average:   0.536 ms

That *might* just be noise, but it seems likely to be real.  For
the improvement in run time, I'd put up with an extra 86us in
planning time per hash join; but if there's any way to shave some
of that off, all the better.


The patch doesn't modify the planner at all, so it would be rather 
surprising if it increased planning time. I'm willing to just write that 
off as noise.


- Heikki


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] bad estimation together with large work_mem generates terrible slow hash joins

2014-10-01 Thread Kevin Grittner

Tomas Vondra  wrote:
> On 12.9.2014 23:22, Robert Haas wrote:

>> My first thought is to revert to NTUP_PER_BUCKET=1, but it's
>> certainly arguable. Either method, though, figures to be better than
>> doing nothing, so let's do something.
>
> OK, but can we commit the remaining part first? Because it'll certainly
> interact with this proposed part, and it's easier to tweak when the code
> is already there. Instead of rebasing over and over.

The patch applied and built without problem, and pass `make
check-world`.  The only thing that caught my eye was the addition
of debug code using printf() instead of logging at a DEBUG level.
Is there any reason for that?

I still need to work through the (rather voluminous) email threads
and the code to be totally comfortable with all the issues, but
if Robert and Heikki are comfortable with it, I'm not gonna argue.

Preliminary benchmarks look outstanding!  I tried to control
carefully to ensure consistent results (by dropping, recreating,
vacuuming, analyzing, and checkpointing before each test), but
there were surprising outliers in the unpatched version.  It turned
out that it was because slight differences in the random samples
caused different numbers of buckets for both unpatched and patched,
but the patched version dealt with the difference gracefully while
the unpatched version's performance fluctuated randomly.

My thinking is that we should get this much committed and then
discuss further optimizations.  I tried to throw something at it
that would be something of a "worst case" because with the new code
it would start with one batch and go to two.

Five runs per test, alternating between unpatched and patched.

First I tried with the default work_mem size of 4MB:

Planning time / Execution time (ms)

unpatched, work_mem = 4MB

0.694 / 10392.930
0.327 / 10388.787
0.412 / 10533.036
0.650 / 17186.719
0.338 / 10707.423

Fast: Buckets: 2048  Batches: 1  Memory Usage: 3516kB
Slow: Buckets: 1024  Batches: 1  Memory Usage: 3516kB

patched, work_mem = 4MB

0.764 / 5021.792
0.655 / 4991.887
0.436 / 5068.777
0.410 / 5057.902
0.444 / 5021.543

1st & 5th:  Buckets: 131072 (originally 1024)  Batches: 2 (originally 1)  
Memory Usage: 3073kB
all others: Buckets: 131072 (originally 2048)  Batches: 2 (originally 1)  
Memory Usage: 3073kB

Then, just to see how both dealt with extra memory I did it again with 1GB:

unpatched, work_mem = 1GB

0.328 / 10407.836
0.319 / 10465.348
0.324 / 16606.948
0.569 / 10408.671
0.542 / 16996.209

Memory usage same as before.  Guess which runs started with 1024 buckets.  ;-)

0.556 / 3974.741
0.325 / 4012.613
0.334 / 3941.734
0.834 / 3894.391
0.603 / 3959.440

2nd & 4th:  Buckets: 131072 (originally 2048)  Batches: 1 (originally 1)  
Memory Usage: 4540kB
all others: Buckets: 131072 (originally 1024)  Batches: 1 (originally 1)  
Memory Usage: 4540kB

My only concern from the benchmarks is that it seemed like there
was a statistically significant increase in planning time:

unpatched plan time average: 0.450 ms
patched plan time average:   0.536 ms

That *might* just be noise, but it seems likely to be real.  For
the improvement in run time, I'd put up with an extra 86us in
planning time per hash join; but if there's any way to shave some
of that off, all the better.

--
Kevin Grittner
EDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] bad estimation together with large work_mem generates terrible slow hash joins

2014-09-12 Thread Tomas Vondra

On 12.9.2014 23:22, Robert Haas wrote:
> On Fri, Sep 12, 2014 at 4:55 PM, Tomas Vondra  wrote:
>>> I'm actually quite surprised that you find batching to be a
>>> better strategy than skimping on buckets, because I would have
>>> expect the opposite, almost categorically. Batching means having
>>> to write out the tuples we can't process right away and read them
>>> back in. If that involves disk I/O, I think the cost of that I/O
>>> is going to be FAR more than the overhead we would have incurred
>>> by searching slightly longer bucket chains. If it didn't, then
>>> you could've set work_mem higher and avoided batching in the
>>> first place.
>>
>> No, I don't find batching to be a better strategy. I just think
>> this really needs more discussion than a simple "let's use
>> NTUP_PER_BUCKET=4 to avoid batching" follow-up patch.
>>
>> For example, let's say we switch to NTUP_PER_BUCKET=4 to avoid
>> batching, and then discover we need to start batching anyway.
>> Should we keep the settings, or should we revert NTUP_PER_BUCKET=1?
>> Or maybe not doing that for nbatch=2, but for nbatch=16?
> 
> My first thought is to revert to NTUP_PER_BUCKET=1, but it's
> certainly arguable. Either method, though, figures to be better than
> doing nothing, so let's do something.

OK, but can we commit the remaining part first? Because it'll certainly
interact with this proposed part, and it's easier to tweak when the code
is already there. Instead of rebasing over and over.

Tomas


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] bad estimation together with large work_mem generates terrible slow hash joins

2014-09-12 Thread Robert Haas

On Fri, Sep 12, 2014 at 4:55 PM, Tomas Vondra  wrote:
>> I'm actually quite surprised that you find batching to be a better
>> strategy than skimping on buckets, because I would have expect the
>> opposite, almost categorically. Batching means having to write out
>> the tuples we can't process right away and read them back in. If
>> that involves disk I/O, I think the cost of that I/O is going to be
>> FAR more than the overhead we would have incurred by searching
>> slightly longer bucket chains. If it didn't, then you could've set
>> work_mem higher and avoided batching in the first place.
>
> No, I don't find batching to be a better strategy. I just think this
> really needs more discussion than a simple "let's use NTUP_PER_BUCKET=4
> to avoid batching" follow-up patch.
>
> For example, let's say we switch to NTUP_PER_BUCKET=4 to avoid batching,
> and then discover we need to start batching anyway. Should we keep the
> settings, or should we revert NTUP_PER_BUCKET=1? Or maybe not doing that
> for nbatch=2, but for nbatch=16?

My first thought is to revert to NTUP_PER_BUCKET=1, but it's certainly
arguable.  Either method, though, figures to be better than doing
nothing, so let's do something.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] bad estimation together with large work_mem generates terrible slow hash joins

2014-09-12 Thread Tomas Vondra

On 12.9.2014 22:24, Robert Haas wrote:
> On Fri, Sep 12, 2014 at 3:39 PM, Tomas Vondra  wrote:
>>
>> Yes, I like those changes and I think your reasoning is correct in both
>> cases. It certainly makes the method shorter and more readable - I was
>> too "stuck" in the original logic, so thanks for improving this.
>>
>> So +1 from me to those changes.
> 
> OK, committed.  Please rebase the rest.

Attached is v15 of the patch, rebased to current HEAD.

Tomas
diff --git a/src/backend/commands/explain.c b/src/backend/commands/explain.c
index 781a736..d128e08 100644
--- a/src/backend/commands/explain.c
+++ b/src/backend/commands/explain.c
@@ -1900,18 +1900,21 @@ show_hash_info(HashState *hashstate, ExplainState *es)
 		if (es->format != EXPLAIN_FORMAT_TEXT)
 		{
 			ExplainPropertyLong("Hash Buckets", hashtable->nbuckets, es);
+			ExplainPropertyLong("Original Hash Buckets",
+hashtable->nbuckets_original, es);
 			ExplainPropertyLong("Hash Batches", hashtable->nbatch, es);
 			ExplainPropertyLong("Original Hash Batches",
 hashtable->nbatch_original, es);
 			ExplainPropertyLong("Peak Memory Usage", spacePeakKb, es);
 		}
-		else if (hashtable->nbatch_original != hashtable->nbatch)
+		else if ((hashtable->nbatch_original != hashtable->nbatch) ||
+ (hashtable->nbuckets_original != hashtable->nbuckets))
 		{
 			appendStringInfoSpaces(es->str, es->indent * 2);
 			appendStringInfo(es->str,
-			"Buckets: %d  Batches: %d (originally %d)  Memory Usage: %ldkB\n",
-			 hashtable->nbuckets, hashtable->nbatch,
-			 hashtable->nbatch_original, spacePeakKb);
+			"Buckets: %d (originally %d)  Batches: %d (originally %d)  Memory Usage: %ldkB\n",
+			 hashtable->nbuckets, hashtable->nbuckets_original,
+			 hashtable->nbatch, hashtable->nbatch_original, spacePeakKb);
 		}
 		else
 		{
diff --git a/src/backend/executor/nodeHash.c b/src/backend/executor/nodeHash.c
index b3203ba..33994c5 100644
--- a/src/backend/executor/nodeHash.c
+++ b/src/backend/executor/nodeHash.c
@@ -39,6 +39,7 @@
 
 
 static void ExecHashIncreaseNumBatches(HashJoinTable hashtable);
+static void ExecHashIncreaseNumBuckets(HashJoinTable hashtable);
 static void ExecHashBuildSkewHash(HashJoinTable hashtable, Hash *node,
 	  int mcvsToUse);
 static void ExecHashSkewTableInsert(HashJoinTable hashtable,
@@ -117,6 +118,7 @@ MultiExecHash(HashState *node)
 /* It's a skew tuple, so put it into that hash table */
 ExecHashSkewTableInsert(hashtable, slot, hashvalue,
 		bucketNumber);
+hashtable->skewTuples += 1;
 			}
 			else
 			{
@@ -127,6 +129,25 @@ MultiExecHash(HashState *node)
 		}
 	}
 
+	/* resize the hash table if needed (NTUP_PER_BUCKET exceeded) */
+	if (hashtable->nbuckets != hashtable->nbuckets_optimal)
+	{
+		/* We never decrease the number of buckets. */
+		Assert(hashtable->nbuckets_optimal > hashtable->nbuckets);
+
+#ifdef HJDEBUG
+		printf("Increasing nbuckets %d => %d\n",
+			   hashtable->nbuckets, hashtable->nbuckets_optimal);
+#endif
+
+		ExecHashIncreaseNumBuckets(hashtable);
+	}
+
+	/* Account for the buckets in spaceUsed (reported in EXPLAIN ANALYZE) */
+	hashtable->spaceUsed += hashtable->nbuckets * sizeof(HashJoinTuple);
+	if (hashtable->spaceUsed > hashtable->spacePeak)
+			hashtable->spacePeak = hashtable->spaceUsed;
+
 	/* must provide our own instrumentation support */
 	if (node->ps.instrument)
 		InstrStopNode(node->ps.instrument, hashtable->totalTuples);
@@ -272,7 +293,10 @@ ExecHashTableCreate(Hash *node, List *hashOperators, bool keepNulls)
 	 */
 	hashtable = (HashJoinTable) palloc(sizeof(HashJoinTableData));
 	hashtable->nbuckets = nbuckets;
+	hashtable->nbuckets_original = nbuckets;
+	hashtable->nbuckets_optimal = nbuckets;
 	hashtable->log2_nbuckets = log2_nbuckets;
+	hashtable->log2_nbuckets_optimal = log2_nbuckets;
 	hashtable->buckets = NULL;
 	hashtable->keepNulls = keepNulls;
 	hashtable->skewEnabled = false;
@@ -286,6 +310,7 @@ ExecHashTableCreate(Hash *node, List *hashOperators, bool keepNulls)
 	hashtable->nbatch_outstart = nbatch;
 	hashtable->growEnabled = true;
 	hashtable->totalTuples = 0;
+	hashtable->skewTuples = 0;
 	hashtable->innerBatchFile = NULL;
 	hashtable->outerBatchFile = NULL;
 	hashtable->spaceUsed = 0;
@@ -515,6 +540,7 @@ ExecChooseHashTableSize(double ntuples, int tupwidth, bool useskew,
 		nbatch = 2;
 		while (nbatch < minbatch)
 			nbatch <<= 1;
+
 	}
 
 	*numbuckets = nbuckets;
@@ -620,6 +646,19 @@ ExecHashIncreaseNumBatches(HashJoinTable hashtable)
 	 */
 	ninmemory = nfreed = 0;
 
+	/* If know we need to resize nbuckets, we can do it while rebatching. */
+	if (hashtable->nbuckets_optimal != hashtable->nbuckets)
+	{
+		/* we never decrease the number of buckets */
+		Assert(hashtable->nbuckets_optimal > hashtable->nbuckets);
+
+		hashtable->nbuckets = hashtable->nbuckets_optimal;
+		hashtable->log2_nbuckets = hashtable->log2_nbuckets_optimal;
+
+		hashtable->buckets = repalloc(hashtable->buckets,
+	  sizeof(HashJoinTuple) * hash

Re: [HACKERS] bad estimation together with large work_mem generates terrible slow hash joins

2014-09-12 Thread Tomas Vondra

On 12.9.2014 22:24, Robert Haas wrote:
> On Fri, Sep 12, 2014 at 3:39 PM, Tomas Vondra  wrote:
>> On 12.9.2014 18:49, Robert Haas wrote:
>>> I'm comfortable with this version if you are, but (maybe as a
>>> follow-on commit) I think we could make this even a bit smarter. If
>>> inner_rel_bytes + bucket_bytes > hash_table_bytes but
>>> inner_rel_bytes + bucket_bytes / 4 < hash_table_bytes, then reduce
>>> the number of buckets by 2x or 4x to make it fit. That would provide
>>> a bit of additional safety margin against cases where this patch
>>> might unluckily lose.
>>
>> I don't think that's a good idea. That essentially says 'we're shooting
>> for NTUP_PER_BUCKET but not really'. Moreover, I often see cases where
>> batching (because of smaller work_mem) actually significantly improves
>> performance. If we want to make this reasoning properly, deciding
>> whether to add batches or reduce buckets, we need a better heuristics -
>> that's quite complex and I'd expect it to result ain a quite complex patch.
>>
>> So -1 from me to this at this moment (it certainly needs much more
>> thought than a mere follow-on commit).
> 
> OK, but let's discuss further.  You make it sound like treating
> NTUP_PER_BUCKET as a soft limit is a bad thing, but I'm not convinced.
> I think it's perfectly reasonable to treat NTUP_PER_BUCKET as a range:
> if we can get NTUP_PER_BUCKET=1, great, but if not, NTUP_PER_BUCKET=2
> or NTUP_PER_BUCKET=4 may be perfectly reasonable.

I think this really depends on various factors. For example what may
work for small work_mem (fitting into L2 cache) may not work for large
values, etc.

> I'm actually quite surprised that you find batching to be a better 
> strategy than skimping on buckets, because I would have expect the 
> opposite, almost categorically. Batching means having to write out 
> the tuples we can't process right away and read them back in. If
> that involves disk I/O, I think the cost of that I/O is going to be
> FAR more than the overhead we would have incurred by searching
> slightly longer bucket chains. If it didn't, then you could've set
> work_mem higher and avoided batching in the first place.

No, I don't find batching to be a better strategy. I just think this
really needs more discussion than a simple "let's use NTUP_PER_BUCKET=4
to avoid batching" follow-up patch.

For example, let's say we switch to NTUP_PER_BUCKET=4 to avoid batching,
and then discover we need to start batching anyway. Should we keep the
settings, or should we revert NTUP_PER_BUCKET=1? Or maybe not doing that
for nbatch=2, but for nbatch=16?

I'm not saying it's wrong, but I think it's more complicated than it
seems from your description.

regards
Tomas


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] bad estimation together with large work_mem generates terrible slow hash joins

2014-09-12 Thread Robert Haas

On Fri, Sep 12, 2014 at 3:39 PM, Tomas Vondra  wrote:
> On 12.9.2014 18:49, Robert Haas wrote:
>> On Fri, Sep 12, 2014 at 8:28 AM, Robert Haas  wrote:
>>> On Thu, Sep 11, 2014 at 5:57 PM, Tomas Vondra  wrote:
 Attached is the patch split as suggested:

 (a) hashjoin-nbuckets-v14a-size.patch

 * NTUP_PER_BUCKET=1
 * counting buckets towards work_mem
 * changes in ExecChooseHashTableSize (due to the other changes)
>>>
>>> OK, I'm going to work on this one now.  I have some ideas about how to
>>> simplify this a bit and will post an update in a few hours once I see
>>> how that pans out.
>>
>> Here's what I came up with. I believe this has the same semantics as
>> your patch for less code, but please check, because I might be
>> wrong. The main simplifications I made were:
>>
>> (1) In master, we decide whether or not to batch based on the size
>> of the data, without knowing the number of buckets. If we want to
>> account for the bucket overhead, that doesn't work. But in your
>> patch, we basically computed the number of buckets we'd need for the
>> single-batch case, then decide whether to do a single batch, and if
>> yes, computed the same values over again with the same formula
>> phrased differently. I revised that so that the single-batch case is
>> calculated only once. If everything fits, we're all set, and don't
>> need to recalculate anything.
>>
>> (2) In your patch, once we decide we need to batch, you set nbatch
>> as now, and then check whether the computed value is still adequate
>> after subtracting buckets_byte from hash_table_bytes. I revised that
>> so that we make the initial batch estimate of dbatch by dividing
>> inner_rel_bytes by hash_table_bytes - bucket_bytes rather than
>> hash_table_bytes. I think that avoids the need to maybe increase
>> nbatch again afterwards.
>
> Yes, I like those changes and I think your reasoning is correct in both
> cases. It certainly makes the method shorter and more readable - I was
> too "stuck" in the original logic, so thanks for improving this.
>
> So +1 from me to those changes.

OK, committed.  Please rebase the rest.

>> I'm comfortable with this version if you are, but (maybe as a
>> follow-on commit) I think we could make this even a bit smarter. If
>> inner_rel_bytes + bucket_bytes > hash_table_bytes but
>> inner_rel_bytes + bucket_bytes / 4 < hash_table_bytes, then reduce
>> the number of buckets by 2x or 4x to make it fit. That would provide
>> a bit of additional safety margin against cases where this patch
>> might unluckily lose.
>
> I don't think that's a good idea. That essentially says 'we're shooting
> for NTUP_PER_BUCKET but not really'. Moreover, I often see cases where
> batching (because of smaller work_mem) actually significantly improves
> performance. If we want to make this reasoning properly, deciding
> whether to add batches or reduce buckets, we need a better heuristics -
> that's quite complex and I'd expect it to result ain a quite complex patch.
>
> So -1 from me to this at this moment (it certainly needs much more
> thought than a mere follow-on commit).

OK, but let's discuss further.  You make it sound like treating
NTUP_PER_BUCKET as a soft limit is a bad thing, but I'm not convinced.
I think it's perfectly reasonable to treat NTUP_PER_BUCKET as a range:
if we can get NTUP_PER_BUCKET=1, great, but if not, NTUP_PER_BUCKET=2
or NTUP_PER_BUCKET=4 may be perfectly reasonable.

I'm actually quite surprised that you find batching to be a better
strategy than skimping on buckets, because I would have expect the
opposite, almost categorically.  Batching means having to write out
the tuples we can't process right away and read them back in.  If that
involves disk I/O, I think the cost of that I/O is going to be FAR
more than the overhead we would have incurred by searching slightly
longer bucket chains.  If it didn't, then you could've set work_mem
higher and avoided batching in the first place.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] bad estimation together with large work_mem generates terrible slow hash joins

2014-09-12 Thread Tomas Vondra

On 12.9.2014 18:49, Robert Haas wrote:
> On Fri, Sep 12, 2014 at 8:28 AM, Robert Haas  wrote:
>> On Thu, Sep 11, 2014 at 5:57 PM, Tomas Vondra  wrote:
>>> Attached is the patch split as suggested:
>>>
>>> (a) hashjoin-nbuckets-v14a-size.patch
>>>
>>> * NTUP_PER_BUCKET=1
>>> * counting buckets towards work_mem
>>> * changes in ExecChooseHashTableSize (due to the other changes)
>>
>> OK, I'm going to work on this one now.  I have some ideas about how to
>> simplify this a bit and will post an update in a few hours once I see
>> how that pans out.
> 
> Here's what I came up with. I believe this has the same semantics as 
> your patch for less code, but please check, because I might be
> wrong. The main simplifications I made were:
> 
> (1) In master, we decide whether or not to batch based on the size
> of the data, without knowing the number of buckets. If we want to 
> account for the bucket overhead, that doesn't work. But in your 
> patch, we basically computed the number of buckets we'd need for the 
> single-batch case, then decide whether to do a single batch, and if 
> yes, computed the same values over again with the same formula
> phrased differently. I revised that so that the single-batch case is 
> calculated only once. If everything fits, we're all set, and don't 
> need to recalculate anything.
> 
> (2) In your patch, once we decide we need to batch, you set nbatch
> as now, and then check whether the computed value is still adequate
> after subtracting buckets_byte from hash_table_bytes. I revised that
> so that we make the initial batch estimate of dbatch by dividing 
> inner_rel_bytes by hash_table_bytes - bucket_bytes rather than 
> hash_table_bytes. I think that avoids the need to maybe increase 
> nbatch again afterwards.

Yes, I like those changes and I think your reasoning is correct in both
cases. It certainly makes the method shorter and more readable - I was
too "stuck" in the original logic, so thanks for improving this.

So +1 from me to those changes.

> I'm comfortable with this version if you are, but (maybe as a 
> follow-on commit) I think we could make this even a bit smarter. If 
> inner_rel_bytes + bucket_bytes > hash_table_bytes but
> inner_rel_bytes + bucket_bytes / 4 < hash_table_bytes, then reduce
> the number of buckets by 2x or 4x to make it fit. That would provide
> a bit of additional safety margin against cases where this patch
> might unluckily lose.

I don't think that's a good idea. That essentially says 'we're shooting
for NTUP_PER_BUCKET but not really'. Moreover, I often see cases where
batching (because of smaller work_mem) actually significantly improves
performance. If we want to make this reasoning properly, deciding
whether to add batches or reduce buckets, we need a better heuristics -
that's quite complex and I'd expect it to result ain a quite complex patch.

So -1 from me to this at this moment (it certainly needs much more
thought than a mere follow-on commit).

regards
Tomas


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] bad estimation together with large work_mem generates terrible slow hash joins

2014-09-12 Thread Robert Haas

On Fri, Sep 12, 2014 at 8:28 AM, Robert Haas  wrote:
> On Thu, Sep 11, 2014 at 5:57 PM, Tomas Vondra  wrote:
>> Attached is the patch split as suggested:
>>
>> (a) hashjoin-nbuckets-v14a-size.patch
>>
>> * NTUP_PER_BUCKET=1
>> * counting buckets towards work_mem
>> * changes in ExecChooseHashTableSize (due to the other changes)
>
> OK, I'm going to work on this one now.  I have some ideas about how to
> simplify this a bit and will post an update in a few hours once I see
> how that pans out.

Here's what I came up with.  I believe this has the same semantics as
your patch for less code, but please check, because I might be wrong.
The main simplifications I made were:

(1) In master, we decide whether or not to batch based on the size of
the data, without knowing the number of buckets.  If we want to
account for the bucket overhead, that doesn't work.  But in your
patch, we basically computed the number of buckets we'd need for the
single-batch case, then decide whether to do a single batch, and if
yes, computed the same values over again with the same formula phrased
differently.  I revised that so that the single-batch case is
calculated only once.  If everything fits, we're all set, and don't
need to recalculate anything.

(2) In your patch, once we decide we need to batch, you set nbatch as
now, and then check whether the computed value is still adequate after
subtracting buckets_byte from hash_table_bytes.  I revised that so
that we make the initial batch estimate of dbatch by dividing
inner_rel_bytes by hash_table_bytes - bucket_bytes rather than
hash_table_bytes.  I think that avoids the need to maybe increase
nbatch again afterwards.

I'm comfortable with this version if you are, but (maybe as a
follow-on commit) I think we could make this even a bit smarter.  If
inner_rel_bytes + bucket_bytes > hash_table_bytes but inner_rel_bytes
+ bucket_bytes / 4 < hash_table_bytes, then reduce the number of
buckets by 2x or 4x to make it fit.  That would provide a bit of
additional safety margin against cases where this patch might
unluckily lose.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
diff --git a/src/backend/executor/nodeHash.c b/src/backend/executor/nodeHash.c
index 3ef7cfb..b3203ba 100644
--- a/src/backend/executor/nodeHash.c
+++ b/src/backend/executor/nodeHash.c
@@ -386,7 +386,7 @@ ExecHashTableCreate(Hash *node, List *hashOperators, bool keepNulls)
  */
 
 /* Target bucket loading (tuples per bucket) */
-#define NTUP_PER_BUCKET			10
+#define NTUP_PER_BUCKET			1
 
 void
 ExecChooseHashTableSize(double ntuples, int tupwidth, bool useskew,
@@ -396,12 +396,13 @@ ExecChooseHashTableSize(double ntuples, int tupwidth, bool useskew,
 {
 	int			tupsize;
 	double		inner_rel_bytes;
+	long		bucket_bytes;
 	long		hash_table_bytes;
 	long		skew_table_bytes;
 	long		max_pointers;
-	int			nbatch;
+	int			nbatch = 1;
 	int			nbuckets;
-	int			i;
+	double		dbuckets;
 
 	/* Force a plausible relation size if no info */
 	if (ntuples <= 0.0)
@@ -460,56 +461,61 @@ ExecChooseHashTableSize(double ntuples, int tupwidth, bool useskew,
 
 	/*
 	 * Set nbuckets to achieve an average bucket load of NTUP_PER_BUCKET when
-	 * memory is filled.  Set nbatch to the smallest power of 2 that appears
-	 * sufficient.  The Min() steps limit the results so that the pointer
-	 * arrays we'll try to allocate do not exceed work_mem.
+	 * memory is filled, assuming a single batch.  The Min() step limits the
+	 * results so that the pointer arrays we'll try to allocate do not exceed
+	 * work_mem.
 	 */
 	max_pointers = (work_mem * 1024L) / sizeof(void *);
 	/* also ensure we avoid integer overflow in nbatch and nbuckets */
 	max_pointers = Min(max_pointers, INT_MAX / 2);
+	dbuckets = ceil(ntuples / NTUP_PER_BUCKET);
+	dbuckets = Min(dbuckets, max_pointers);
+	nbuckets = Max((int) dbuckets, 1024);
+	nbuckets = 1 << my_log2(nbuckets);
+	bucket_bytes = sizeof(HashJoinTuple) * nbuckets;
 
-	if (inner_rel_bytes > hash_table_bytes)
+	/*
+	 * If there's not enough space to store the projected number of tuples
+	 * and the required bucket headers, we will need multiple batches.
+	 */
+	if (inner_rel_bytes + bucket_bytes > hash_table_bytes)
 	{
 		/* We'll need multiple batches */
 		long		lbuckets;
 		double		dbatch;
 		int			minbatch;
+		long		bucket_size;
 
-		lbuckets = (hash_table_bytes / tupsize) / NTUP_PER_BUCKET;
+		/*
+		 * Estimate the number of buckets we'll want to have when work_mem
+		 * is entirely full.  Each bucket will contain a bucket pointer plus
+		 * NTUP_PER_BUCKET tuples, whose projected size already includes
+		 * overhead for the hash code, pointer to the next tuple, etc.
+		 */
+		bucket_size = (tupsize * NTUP_PER_BUCKET + sizeof(HashJoinTuple));
+		lbuckets = 1 << my_log2(hash_table_bytes / bucket_size);
 		lbuckets = Min(lbuckets, max_pointers);
 		nbuckets = (int) lbuckets;
+		bucket_bytes = nbuckets * sizeof(HashJoinTuple);
+
+		/*
+		 * Buckets are simple po

Re: [HACKERS] bad estimation together with large work_mem generates terrible slow hash joins

2014-09-12 Thread Robert Haas

On Thu, Sep 11, 2014 at 5:57 PM, Tomas Vondra  wrote:
> Attached is the patch split as suggested:
>
> (a) hashjoin-nbuckets-v14a-size.patch
>
> * NTUP_PER_BUCKET=1
> * counting buckets towards work_mem
> * changes in ExecChooseHashTableSize (due to the other changes)

OK, I'm going to work on this one now.  I have some ideas about how to
simplify this a bit and will post an update in a few hours once I see
how that pans out.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] bad estimation together with large work_mem generates terrible slow hash joins

2014-09-11 Thread Tomas Vondra

On 11.9.2014 16:33, Tomas Vondra wrote:
> On 11 Září 2014, 15:31, Robert Haas wrote:
>> On Wed, Sep 10, 2014 at 5:09 PM, Tomas Vondra  wrote:
>>> OK. So here's v13 of the patch, reflecting this change.
>>
>> [...] It does three things:
>>
>> (1) It changes NTUP_PER_BUCKET to 1.  Although this increases memory
>> utilization, it also improves hash table performance, and the
>> additional memory we use is less than what we saved as a result of
>> commit 45f6240a8fa9d35548eb2ef23dba2c11540aa02a.
>>
>> (2) It changes ExecChooseHashTableSize() to include the effect of the
>> memory consumed by the bucket array.  If we care about properly
>> respecting work_mem, this is clearly right for any NTUP_PER_BUCKET
>> setting, but more important for a small setting like 1 than for the
>> current value of 10.  I'm not sure the logic in this function is as
>> clean and simple as it can be just yet.
>> (3) It allows the number of batches to increase on the fly while the
>> hash join is in process.  This case arises when we initially estimate
>> that we only need a small hash table, and then it turns out that there
>> are more tuples than we expect.  Without this code, the hash table's
>> load factor gets too high and things start to suck.
>>
>> I recommend separating this patch in two patches, the first covering
>> items (1) and (2) and the second covering item (3), which really seems
>> like a separate improvement.
> 
> That probably makes sense.

Attached is the patch split as suggested:

(a) hashjoin-nbuckets-v14a-size.patch

* NTUP_PER_BUCKET=1
* counting buckets towards work_mem
* changes in ExecChooseHashTableSize (due to the other changes)

(b) hashjoin-nbuckets-v14a-resize.patch

* rest of the patch, that is ...
* tracking optimal number of buckets
* dynamic resize of the hash table
* adding info the the EXPLAIN ANALYZE output

regards
Tomas
diff --git a/src/backend/commands/explain.c b/src/backend/commands/explain.c
index 781a736..d128e08 100644
--- a/src/backend/commands/explain.c
+++ b/src/backend/commands/explain.c
@@ -1900,18 +1900,21 @@ show_hash_info(HashState *hashstate, ExplainState *es)
 		if (es->format != EXPLAIN_FORMAT_TEXT)
 		{
 			ExplainPropertyLong("Hash Buckets", hashtable->nbuckets, es);
+			ExplainPropertyLong("Original Hash Buckets",
+hashtable->nbuckets_original, es);
 			ExplainPropertyLong("Hash Batches", hashtable->nbatch, es);
 			ExplainPropertyLong("Original Hash Batches",
 hashtable->nbatch_original, es);
 			ExplainPropertyLong("Peak Memory Usage", spacePeakKb, es);
 		}
-		else if (hashtable->nbatch_original != hashtable->nbatch)
+		else if ((hashtable->nbatch_original != hashtable->nbatch) ||
+ (hashtable->nbuckets_original != hashtable->nbuckets))
 		{
 			appendStringInfoSpaces(es->str, es->indent * 2);
 			appendStringInfo(es->str,
-			"Buckets: %d  Batches: %d (originally %d)  Memory Usage: %ldkB\n",
-			 hashtable->nbuckets, hashtable->nbatch,
-			 hashtable->nbatch_original, spacePeakKb);
+			"Buckets: %d (originally %d)  Batches: %d (originally %d)  Memory Usage: %ldkB\n",
+			 hashtable->nbuckets, hashtable->nbuckets_original,
+			 hashtable->nbatch, hashtable->nbatch_original, spacePeakKb);
 		}
 		else
 		{
diff --git a/src/backend/executor/nodeHash.c b/src/backend/executor/nodeHash.c
index fa0e700a..4f00426 100644
--- a/src/backend/executor/nodeHash.c
+++ b/src/backend/executor/nodeHash.c
@@ -39,6 +39,7 @@
 
 
 static void ExecHashIncreaseNumBatches(HashJoinTable hashtable);
+static void ExecHashIncreaseNumBuckets(HashJoinTable hashtable);
 static void ExecHashBuildSkewHash(HashJoinTable hashtable, Hash *node,
 	  int mcvsToUse);
 static void ExecHashSkewTableInsert(HashJoinTable hashtable,
@@ -49,8 +50,8 @@ static void ExecHashRemoveNextSkewBucket(HashJoinTable hashtable);
 
 static void *dense_alloc(HashJoinTable hashtable, Size size);
 
-/* Memory needed for buckets. */
-#define BUCKETS_SPACE(htab)	((htab)->nbuckets * sizeof(HashJoinTuple))
+/* Memory needed for optimal number of buckets. */
+#define BUCKETS_SPACE(htab)	((htab)->nbuckets_optimal * sizeof(HashJoinTuple))
 
 /* 
  *		ExecHash
@@ -120,6 +121,7 @@ MultiExecHash(HashState *node)
 /* It's a skew tuple, so put it into that hash table */
 ExecHashSkewTableInsert(hashtable, slot, hashvalue,
 		bucketNumber);
+hashtable->skewTuples += 1;
 			}
 			else
 			{
@@ -130,6 +132,25 @@ MultiExecHash(HashState *node)
 		}
 	}
 
+	/* resize the hash table if needed (NTUP_PER_BUCKET exceeded) */
+	if (hashtable->nbuckets != hashtable->nbuckets_optimal)
+	{
+		/* We never decrease the number of buckets. */
+		Assert(hashtable->nbuckets_optimal > hashtable->nbuckets);
+
+#ifdef HJDEBUG
+		printf("Increasing nbuckets %d => %d\n",
+			   hashtable->nbuckets, hashtable->nbuckets_optimal);
+#endif
+
+		ExecHashIncreaseNumBuckets(hashtable);
+	}
+
+	/* Account for the buckets in

Re: [HACKERS] bad estimation together with large work_mem generates terrible slow hash joins

2014-09-11 Thread Tomas Vondra

On 11 Září 2014, 17:28, Tom Lane wrote:
> "Tomas Vondra"  writes:
>> On 11 Z?? 2014, 16:11, Tom Lane wrote:
>>> Ah.  Well, that would mean that we need a heuristic for deciding when
>>> to
>>> increase the number of buckets versus the number of batches ... seems
>>> like a difficult decision.
>
>> That's true, but that's not the aim of this patch. The patch simply
>> increases the number of buckets if the load happens to get too high, and
>> does not try to decide between increasing nbuckets and nbatch.
>
> On further thought, increasing nbuckets without changing the batch
> boundaries would not get us out of an out-of-work_mem situation, in fact
> it makes memory consumption worse not better (assuming you count the
> bucket headers towards work_mem ;-)).

Yes, the patch counts the bucket headers towards work_mem, while the
original code didn't. The reasoning is that due to changing
NTUP_PER_BUCKET from 10 to 1, the memory occupied by buckets is not
negligible and should be accounted for.

> So in principle, the rule seems like it ought to go "if load (defined as
> max bucket chain length, I imagine?) gets too high, but we are still
> well below work_mem, increase nbuckets; else increase nbatch".  And
> perhaps we reset nbuckets again for the next batch, not sure.  If we
> are dealing with an out-of-work_mem situation then only increasing nbatch
> would be a suitable response.

Almost.

If we expect batching at the very beginning, we size nbuckets for "full
work_mem" (see how many tuples we can get into work_mem, while not
breaking NTUP_PER_BUCKET threshold).

If we expect to be fine without batching, we start with the 'right'
nbuckets and track the optimal nbuckets as we go (without actually
resizing the hash table). Once we hit work_mem (considering the optimal
nbuckets value), we keep the value.

Only at the end, we check whether (nbuckets != nbuckets_optimal) and
resize the hash table if needed. Also, we keep this value for all batches
(it's OK because it assumes full work_mem, and it makes the batchno
evaluation trivial). So the resize happens only once and only for the
first batch.

> Because of the risk that increasing nbuckets would itself lead to a
> work_mem violation, I don't think it's sane to ignore the interaction
> entirely, even in a first patch.

This possibility of increasing the number of batches is the consequence of
counting the buckets towards work_mem. I believe that's the right thing to
do here, to make the work_mem guarantee stronger, but it's not something
this patch depends on. If this is the only concern, we can leave this out.

It's also worth mentioning that the dense allocation of tuples makes a
huge difference. palloc easily results in 100% overhead for narrow tuples
(that is, allocating 2x the amount of needed memory). For example over
here [1] is an example of a hashjoin consuming 1.4GB with work_mem=800MB.

And the dense allocation patch pretty much makes this go away (as
demonstrated in the [1]). For wider tuples, the difference is smaller, but
that when the buckets are less important.

What I'm trying to say is that it's only a matter of work_mem values.
Currently, because of the weak guarantee, people tend to set the value
lower than necessary. The dense allocation makes this unnecessary,
allowing higher work_mem values, and thus eliminating the issue.

If this is not enough, we can stop counting the buckets towards work_mem.
It'll still be more memory efficient than the old approach (as a pointer
is smaller than palloc overhead), and it won't cause additional batches.
But I don't like this - I think we should make work_mem a stronger
guarantee.

[1] http://www.postgresql.org/message-id/53beea9e.2080...@fuzzy.cz

regards
Tomas

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] bad estimation together with large work_mem generates terrible slow hash joins

2014-09-11 Thread Tom Lane

"Tomas Vondra"  writes:
> On 11 ZÃ¡ÅÃ 2014, 16:11, Tom Lane wrote:
>> Ah.  Well, that would mean that we need a heuristic for deciding when to
>> increase the number of buckets versus the number of batches ... seems
>> like a difficult decision.

> That's true, but that's not the aim of this patch. The patch simply
> increases the number of buckets if the load happens to get too high, and
> does not try to decide between increasing nbuckets and nbatch.

On further thought, increasing nbuckets without changing the batch
boundaries would not get us out of an out-of-work_mem situation, in fact
it makes memory consumption worse not better (assuming you count the
bucket headers towards work_mem ;-)).

So in principle, the rule seems like it ought to go "if load (defined as
max bucket chain length, I imagine?) gets too high, but we are still
well below work_mem, increase nbuckets; else increase nbatch".  And
perhaps we reset nbuckets again for the next batch, not sure.  If we
are dealing with an out-of-work_mem situation then only increasing nbatch
would be a suitable response.

Because of the risk that increasing nbuckets would itself lead to a
work_mem violation, I don't think it's sane to ignore the interaction
entirely, even in a first patch.

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] bad estimation together with large work_mem generates terrible slow hash joins

2014-09-11 Thread Robert Haas

On Thu, Sep 11, 2014 at 10:11 AM, Tom Lane  wrote:
> Robert Haas  writes:
>> On Thu, Sep 11, 2014 at 9:59 AM, Tom Lane  wrote:
>>> Robert Haas  writes:
 (3) It allows the number of batches to increase on the fly while the
 hash join is in process.
>
>>> Pardon me for not having read the patch yet, but what part of (3)
>>> wasn't there already?
>
>> EINSUFFICIENTCAFFEINE.
>
>> It allows the number of BUCKETS to increase, not the number of
>> batches.  As you say, the number of batches could already increase.
>
> Ah.  Well, that would mean that we need a heuristic for deciding when to
> increase the number of buckets versus the number of batches ... seems
> like a difficult decision.

Well, the patch takes the approach of increasing the number of buckets
when (1) there's only one batch and (2) the load factor exceeds
NTUP_PER_BUCKET.  As Tomas points out, it's just increasing the number
of buckets to the exact same number which we would have allocated had
we known that this many tuples would show up.

Now, there is a possibility that we could lose.  Let's suppose that
the tuples overflow work_mem, but just barely.  If we've accurately
estimate how many tuples there are, the patch changes nothing: either
way we're going to need two batches.  But let's say we've
underestimated the number of tuples by 3x.  Then it could be that
without the patch we'd allocate fewer buckets, never increase the
number of buckets, and avoid batching; whereas with the patch, we'll
discover that our tuple estimate was wrong, increase the number of
buckets on the fly, and then be forced by the slightly-increased
memory consumption that results from increasing the number of buckets
to do two batches.  That would suck.

But catering to that case is basically hoping that we'll fall into a
septic tank and come out smelling like a rose - that is, we're hoping
that our initial poor estimate will cause us to accidentally do the
right thing later.  I don't think that's the way to go.   It's much
more important to get the case where things are bigger than we
expected but still fit within work_mem right; and we're currently
taking a huge run-time penalty in that case, as Tomas' results show.
If we wanted to cater to the other case in the future, we could
consider the opposite approach: if work_mem is exhahusted, throw the
bucket headers away and keep reading tuples into the dense-packed
chunks added by yesterday's commit.  If we again run out of work_mem,
then we *definitely* need to increase the batch count.  If we manage
to make everything fit, then we know exactly how many bucket headers
we can afford, and can use some heuristic to decide between that and
using more batches.

I don't think that should be a requirement for this patch, though: I
think the cumulative effect of Tomas's patches is going to be a win
*much* more often than it's a loss.  In 100% of cases, the
dense-packing stuff will make a hash table containing the same tuples
significantly smaller.  Except when we've radically overestimated the
tuple count, the change to NTUP_PER_BUCKET will mean less time spent
chasing hash chains.  It does use more memory, but that's more than
paid for by the first change.  Both the dense-packing stuff and the
changes to include bucket headers in the work_mem calculation will
have the effect of making the work_mem bound on memory utilization far
more accurate than it's ever been previously.  And the change to
increase the number of buckets at runtime will mean that we're
significantly more resilient against the planner underestimating the
number of tuples.  That's clearly good.  The fact that we can
construct borderline cases where it loses shouldn't deter us from
pushing forward.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] bad estimation together with large work_mem generates terrible slow hash joins

2014-09-11 Thread Tomas Vondra

On 11 Září 2014, 16:11, Tom Lane wrote:
> Robert Haas  writes:
>> On Thu, Sep 11, 2014 at 9:59 AM, Tom Lane  wrote:
>>> Robert Haas  writes:
 (3) It allows the number of batches to increase on the fly while the
 hash join is in process.
>
>>> Pardon me for not having read the patch yet, but what part of (3)
>>> wasn't there already?
>
>> EINSUFFICIENTCAFFEINE.
>
>> It allows the number of BUCKETS to increase, not the number of
>> batches.  As you say, the number of batches could already increase.
>
> Ah.  Well, that would mean that we need a heuristic for deciding when to
> increase the number of buckets versus the number of batches ... seems
> like a difficult decision.

That's true, but that's not the aim of this patch. The patch simply
increases the number of buckets if the load happens to get too high, and
does not try to decide between increasing nbuckets and nbatch.

It's true that we can often get better performance with more batches, but
the cases I was able to inspect were caused either by (a) underestimates
resulting in inappropriate nbucket count, or (b) caching effects. This
patch solves (a), not even trying to fix (b).

regards
Tomas

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] bad estimation together with large work_mem generates terrible slow hash joins

2014-09-11 Thread Tomas Vondra

On 11 Září 2014, 15:31, Robert Haas wrote:
> On Wed, Sep 10, 2014 at 5:09 PM, Tomas Vondra  wrote:
>> OK. So here's v13 of the patch, reflecting this change.
>
> With the exception of ExecChooseHashTableSize() and a lot of stylistic
> issues along the lines of what I've already complained about, this
> patch seems pretty good to me.  It does three things:

I believe I fixed the stylistic issues you pointed out, except maybe the
bracketing (which seems fine to me, though). So if you could point the
issues out, that'd be helpful.

>
> (1) It changes NTUP_PER_BUCKET to 1.  Although this increases memory
> utilization, it also improves hash table performance, and the
> additional memory we use is less than what we saved as a result of
> commit 45f6240a8fa9d35548eb2ef23dba2c11540aa02a.
>
> (2) It changes ExecChooseHashTableSize() to include the effect of the
> memory consumed by the bucket array.  If we care about properly
> respecting work_mem, this is clearly right for any NTUP_PER_BUCKET
> setting, but more important for a small setting like 1 than for the
> current value of 10.  I'm not sure the logic in this function is as
> clean and simple as it can be just yet.

I made that as clear as I was able to (based on your feedback), but I'm
"stuck" as I'm familiar with the logic. That is not to say it can't be
improved, but this needs to be reviewed by someone else.

>
> (3) It allows the number of batches to increase on the fly while the
> hash join is in process.  This case arises when we initially estimate
> that we only need a small hash table, and then it turns out that there
> are more tuples than we expect.  Without this code, the hash table's
> load factor gets too high and things start to suck.
>
> I recommend separating this patch in two patches, the first covering
> items (1) and (2) and the second covering item (3), which really seems
> like a separate improvement.

That probably makes sense.

regards
Tomas



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] bad estimation together with large work_mem generates terrible slow hash joins

2014-09-11 Thread Tom Lane

Robert Haas  writes:
> On Thu, Sep 11, 2014 at 9:59 AM, Tom Lane  wrote:
>> Robert Haas  writes:
>>> (3) It allows the number of batches to increase on the fly while the
>>> hash join is in process.

>> Pardon me for not having read the patch yet, but what part of (3)
>> wasn't there already?

> EINSUFFICIENTCAFFEINE.

> It allows the number of BUCKETS to increase, not the number of
> batches.  As you say, the number of batches could already increase.

Ah.  Well, that would mean that we need a heuristic for deciding when to
increase the number of buckets versus the number of batches ... seems
like a difficult decision.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] bad estimation together with large work_mem generates terrible slow hash joins

2014-09-11 Thread Robert Haas

On Thu, Sep 11, 2014 at 9:59 AM, Tom Lane  wrote:
> Robert Haas  writes:
>> With the exception of ExecChooseHashTableSize() and a lot of stylistic
>> issues along the lines of what I've already complained about, this
>> patch seems pretty good to me.  It does three things:
>> ...
>> (3) It allows the number of batches to increase on the fly while the
>> hash join is in process.  This case arises when we initially estimate
>> that we only need a small hash table, and then it turns out that there
>> are more tuples than we expect.  Without this code, the hash table's
>> load factor gets too high and things start to suck.
>
> Pardon me for not having read the patch yet, but what part of (3)
> wasn't there already?

EINSUFFICIENTCAFFEINE.

It allows the number of BUCKETS to increase, not the number of
batches.  As you say, the number of batches could already increase.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] bad estimation together with large work_mem generates terrible slow hash joins

2014-09-11 Thread Tom Lane

Robert Haas  writes:
> With the exception of ExecChooseHashTableSize() and a lot of stylistic
> issues along the lines of what I've already complained about, this
> patch seems pretty good to me.  It does three things:
> ...
> (3) It allows the number of batches to increase on the fly while the
> hash join is in process.  This case arises when we initially estimate
> that we only need a small hash table, and then it turns out that there
> are more tuples than we expect.  Without this code, the hash table's
> load factor gets too high and things start to suck.

Pardon me for not having read the patch yet, but what part of (3)
wasn't there already?

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] bad estimation together with large work_mem generates terrible slow hash joins

2014-09-11 Thread Robert Haas

On Wed, Sep 10, 2014 at 5:09 PM, Tomas Vondra  wrote:
> OK. So here's v13 of the patch, reflecting this change.

With the exception of ExecChooseHashTableSize() and a lot of stylistic
issues along the lines of what I've already complained about, this
patch seems pretty good to me.  It does three things:

(1) It changes NTUP_PER_BUCKET to 1.  Although this increases memory
utilization, it also improves hash table performance, and the
additional memory we use is less than what we saved as a result of
commit 45f6240a8fa9d35548eb2ef23dba2c11540aa02a.

(2) It changes ExecChooseHashTableSize() to include the effect of the
memory consumed by the bucket array.  If we care about properly
respecting work_mem, this is clearly right for any NTUP_PER_BUCKET
setting, but more important for a small setting like 1 than for the
current value of 10.  I'm not sure the logic in this function is as
clean and simple as it can be just yet.

(3) It allows the number of batches to increase on the fly while the
hash join is in process.  This case arises when we initially estimate
that we only need a small hash table, and then it turns out that there
are more tuples than we expect.  Without this code, the hash table's
load factor gets too high and things start to suck.

I recommend separating this patch in two patches, the first covering
items (1) and (2) and the second covering item (3), which really seems
like a separate improvement.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] bad estimation together with large work_mem generates terrible slow hash joins

2014-09-10 Thread Tomas Vondra

On 10.9.2014 21:34, Robert Haas wrote:
> On Wed, Sep 10, 2014 at 3:12 PM, Tomas Vondra  wrote:
>> On 10.9.2014 20:25, Heikki Linnakangas wrote:
>>> On 09/10/2014 01:49 AM, Tomas Vondra wrote:
 I also did a few 'minor' changes to the dense allocation patch, most
 notably:

 * renamed HashChunk/HashChunkData to MemoryChunk/MemoryChunkData
The original naming seemed a bit awkward.
>>>
>>> That's too easy to confuse with regular MemoryContext and AllocChunk
>>> stuff. I renamed it to HashMemoryChunk.
>>
>> BTW this breaks the second patch, which is allocating new chunks when
>> resizing the hash table. Should I rebase the patch, or can the commiter
>> do s/MemoryChunk/HashMemoryChunk/ ?
>>
>> Assuming there are no more fixes needed, of course.
> 
> Rebasing it will save the committer time, which will increase the
> chances that one will look at your patch.  So it's highly recommended.

OK. So here's v13 of the patch, reflecting this change.

regards
Tomas
diff --git a/src/backend/commands/explain.c b/src/backend/commands/explain.c
index 781a736..d128e08 100644
--- a/src/backend/commands/explain.c
+++ b/src/backend/commands/explain.c
@@ -1900,18 +1900,21 @@ show_hash_info(HashState *hashstate, ExplainState *es)
 		if (es->format != EXPLAIN_FORMAT_TEXT)
 		{
 			ExplainPropertyLong("Hash Buckets", hashtable->nbuckets, es);
+			ExplainPropertyLong("Original Hash Buckets",
+hashtable->nbuckets_original, es);
 			ExplainPropertyLong("Hash Batches", hashtable->nbatch, es);
 			ExplainPropertyLong("Original Hash Batches",
 hashtable->nbatch_original, es);
 			ExplainPropertyLong("Peak Memory Usage", spacePeakKb, es);
 		}
-		else if (hashtable->nbatch_original != hashtable->nbatch)
+		else if ((hashtable->nbatch_original != hashtable->nbatch) ||
+ (hashtable->nbuckets_original != hashtable->nbuckets))
 		{
 			appendStringInfoSpaces(es->str, es->indent * 2);
 			appendStringInfo(es->str,
-			"Buckets: %d  Batches: %d (originally %d)  Memory Usage: %ldkB\n",
-			 hashtable->nbuckets, hashtable->nbatch,
-			 hashtable->nbatch_original, spacePeakKb);
+			"Buckets: %d (originally %d)  Batches: %d (originally %d)  Memory Usage: %ldkB\n",
+			 hashtable->nbuckets, hashtable->nbuckets_original,
+			 hashtable->nbatch, hashtable->nbatch_original, spacePeakKb);
 		}
 		else
 		{
diff --git a/src/backend/executor/nodeHash.c b/src/backend/executor/nodeHash.c
index 3ef7cfb..4f00426 100644
--- a/src/backend/executor/nodeHash.c
+++ b/src/backend/executor/nodeHash.c
@@ -39,6 +39,7 @@
 
 
 static void ExecHashIncreaseNumBatches(HashJoinTable hashtable);
+static void ExecHashIncreaseNumBuckets(HashJoinTable hashtable);
 static void ExecHashBuildSkewHash(HashJoinTable hashtable, Hash *node,
 	  int mcvsToUse);
 static void ExecHashSkewTableInsert(HashJoinTable hashtable,
@@ -49,6 +50,9 @@ static void ExecHashRemoveNextSkewBucket(HashJoinTable hashtable);
 
 static void *dense_alloc(HashJoinTable hashtable, Size size);
 
+/* Memory needed for optimal number of buckets. */
+#define BUCKETS_SPACE(htab)	((htab)->nbuckets_optimal * sizeof(HashJoinTuple))
+
 /* 
  *		ExecHash
  *
@@ -117,6 +121,7 @@ MultiExecHash(HashState *node)
 /* It's a skew tuple, so put it into that hash table */
 ExecHashSkewTableInsert(hashtable, slot, hashvalue,
 		bucketNumber);
+hashtable->skewTuples += 1;
 			}
 			else
 			{
@@ -127,6 +132,25 @@ MultiExecHash(HashState *node)
 		}
 	}
 
+	/* resize the hash table if needed (NTUP_PER_BUCKET exceeded) */
+	if (hashtable->nbuckets != hashtable->nbuckets_optimal)
+	{
+		/* We never decrease the number of buckets. */
+		Assert(hashtable->nbuckets_optimal > hashtable->nbuckets);
+
+#ifdef HJDEBUG
+		printf("Increasing nbuckets %d => %d\n",
+			   hashtable->nbuckets, hashtable->nbuckets_optimal);
+#endif
+
+		ExecHashIncreaseNumBuckets(hashtable);
+	}
+
+	/* Account for the buckets in spaceUsed (reported in EXPLAIN ANALYZE) */
+	hashtable->spaceUsed += BUCKETS_SPACE(hashtable);
+	if (hashtable->spaceUsed > hashtable->spacePeak)
+			hashtable->spacePeak = hashtable->spaceUsed;
+
 	/* must provide our own instrumentation support */
 	if (node->ps.instrument)
 		InstrStopNode(node->ps.instrument, hashtable->totalTuples);
@@ -272,7 +296,10 @@ ExecHashTableCreate(Hash *node, List *hashOperators, bool keepNulls)
 	 */
 	hashtable = (HashJoinTable) palloc(sizeof(HashJoinTableData));
 	hashtable->nbuckets = nbuckets;
+	hashtable->nbuckets_original = nbuckets;
+	hashtable->nbuckets_optimal = nbuckets;
 	hashtable->log2_nbuckets = log2_nbuckets;
+	hashtable->log2_nbuckets_optimal = log2_nbuckets;
 	hashtable->buckets = NULL;
 	hashtable->keepNulls = keepNulls;
 	hashtable->skewEnabled = false;
@@ -286,6 +313,7 @@ ExecHashTableCreate(Hash *node, List *hashOperators, bool keepNulls)
 	hashtable->nbatch_outstart = nbatch;
 	hashtable->growEnabled = true;
 	hashtable->totalTuples = 0;

Re: [HACKERS] bad estimation together with large work_mem generates terrible slow hash joins

2014-09-10 Thread Robert Haas

On Wed, Sep 10, 2014 at 3:12 PM, Tomas Vondra  wrote:
> On 10.9.2014 20:25, Heikki Linnakangas wrote:
>> On 09/10/2014 01:49 AM, Tomas Vondra wrote:
>>> I also did a few 'minor' changes to the dense allocation patch, most
>>> notably:
>>>
>>> * renamed HashChunk/HashChunkData to MemoryChunk/MemoryChunkData
>>>The original naming seemed a bit awkward.
>>
>> That's too easy to confuse with regular MemoryContext and AllocChunk
>> stuff. I renamed it to HashMemoryChunk.
>
> BTW this breaks the second patch, which is allocating new chunks when
> resizing the hash table. Should I rebase the patch, or can the commiter
> do s/MemoryChunk/HashMemoryChunk/ ?
>
> Assuming there are no more fixes needed, of course.

Rebasing it will save the committer time, which will increase the
chances that one will look at your patch.  So it's highly recommended.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] bad estimation together with large work_mem generates terrible slow hash joins

2014-09-10 Thread Tomas Vondra

On 10.9.2014 20:25, Heikki Linnakangas wrote:
> On 09/10/2014 01:49 AM, Tomas Vondra wrote:
>> I also did a few 'minor' changes to the dense allocation patch, most
>> notably:
>>
>> * renamed HashChunk/HashChunkData to MemoryChunk/MemoryChunkData
>>The original naming seemed a bit awkward.
> 
> That's too easy to confuse with regular MemoryContext and AllocChunk
> stuff. I renamed it to HashMemoryChunk.

BTW this breaks the second patch, which is allocating new chunks when
resizing the hash table. Should I rebase the patch, or can the commiter
do s/MemoryChunk/HashMemoryChunk/ ?

Assuming there are no more fixes needed, of course.

Tomas



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] bad estimation together with large work_mem generates terrible slow hash joins

2014-09-10 Thread Tomas Vondra

On 10.9.2014 20:55, Heikki Linnakangas wrote:
> On 09/10/2014 09:31 PM, Robert Haas wrote:
 * the chunks size is 32kB (instead of 16kB), and we're using 1/4
 threshold for 'oversized' items

 We need the threshold to be >=8kB, to trigger the special case
 within AllocSet. The 1/4 rule is consistent with
 ALLOC_CHUNK_FRACTION.
>>>
>>> Should we care about the fact that if there are only a few tuples, we
>>> will
>>> nevertheless waste 32kB of memory for the chunk? I guess not, but I
>>> thought
>>> I'd mention it. The smallest allowed value for work_mem is 64kB.
>>
>> I think we should change the threshold here to 1/8th.  The worst case
>> memory wastage as-is ~32k/5 > 6k.
> 
> Not sure how you arrived at that number. The worst case is that each 32k
> chunk is filled up to 24k+1 bytes, i.e. 8k-1 bytes is wasted in each
> chunk. That's 25% wastage. That's not too bad IMHO - the worst case
> waste of AllocSet is 50%.
> 
> But if you want to twiddle it, feel free. Note that there's some
> interplay with allocset code, like Tomas mentioned. If the threshold is
> < 8k, palloc() will round up request sizes smaller than 8k anyway. That
> wastes some memory, nothing more serious than that, but it seems like a
> good idea to avoid it.

Yes, and pfree then keeps these blocks, which means a chance of keeping
memory that won't be reused. That's why I chose the 8kB threshold. So
I'd keep the 32kB/8kB, as proposed in the patch.

But I don't see this as a big issue - my experience is that either we
have very much smaller than 4kB, or much larger tuples than 8kB. And
even for the tuples between 4kB and 8kB, the waste is not that bad.

regards
Tomas


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] bad estimation together with large work_mem generates terrible slow hash joins

2014-09-10 Thread Robert Haas

On Wed, Sep 10, 2014 at 3:02 PM, Tomas Vondra  wrote:
> On 10.9.2014 20:31, Robert Haas wrote:
>> On Wed, Sep 10, 2014 at 2:25 PM, Heikki Linnakangas
>>  wrote:
>>> The dense-alloc-v5.patch looks good to me. I have committed that with minor
>>> cleanup (more comments below). I have not looked at the second patch.
>>
>> Gah.  I was in the middle of doing this.  Sigh.
>>
 * the chunks size is 32kB (instead of 16kB), and we're using 1/4
threshold for 'oversized' items

We need the threshold to be >=8kB, to trigger the special case
within AllocSet. The 1/4 rule is consistent with ALLOC_CHUNK_FRACTION.
>>>
>>> Should we care about the fact that if there are only a few tuples, we will
>>> nevertheless waste 32kB of memory for the chunk? I guess not, but I thought
>>> I'd mention it. The smallest allowed value for work_mem is 64kB.
>>
>> I think we should change the threshold here to 1/8th.  The worst case
>> memory wastage as-is ~32k/5 > 6k.
>
> So you'd lower the threshold to 4kB? That may lower the wastage in the
> chunks, but palloc will actually allocate 8kB anyway, wasting up to
> additional 4kB. So I don't see how lowering the threshold to 1/8th
> improves the situation ...

Ah, OK.  Well, never mind then.  :-)

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] bad estimation together with large work_mem generates terrible slow hash joins

2014-09-10 Thread Tomas Vondra

On 10.9.2014 20:31, Robert Haas wrote:
> On Wed, Sep 10, 2014 at 2:25 PM, Heikki Linnakangas
>  wrote:
>> The dense-alloc-v5.patch looks good to me. I have committed that with minor
>> cleanup (more comments below). I have not looked at the second patch.
> 
> Gah.  I was in the middle of doing this.  Sigh.
> 
>>> * the chunks size is 32kB (instead of 16kB), and we're using 1/4
>>>threshold for 'oversized' items
>>>
>>>We need the threshold to be >=8kB, to trigger the special case
>>>within AllocSet. The 1/4 rule is consistent with ALLOC_CHUNK_FRACTION.
>>
>> Should we care about the fact that if there are only a few tuples, we will
>> nevertheless waste 32kB of memory for the chunk? I guess not, but I thought
>> I'd mention it. The smallest allowed value for work_mem is 64kB.
> 
> I think we should change the threshold here to 1/8th.  The worst case
> memory wastage as-is ~32k/5 > 6k.

So you'd lower the threshold to 4kB? That may lower the wastage in the
chunks, but palloc will actually allocate 8kB anyway, wasting up to
additional 4kB. So I don't see how lowering the threshold to 1/8th
improves the situation ...

Tomas


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] bad estimation together with large work_mem generates terrible slow hash joins

2014-09-10 Thread Tomas Vondra

On 10.9.2014 20:25, Heikki Linnakangas wrote:
> On 09/10/2014 01:49 AM, Tomas Vondra wrote:
>> I also did a few 'minor' changes to the dense allocation patch, most
>> notably:
>>
>> * renamed HashChunk/HashChunkData to MemoryChunk/MemoryChunkData
>>The original naming seemed a bit awkward.
> 
> That's too easy to confuse with regular MemoryContext and AllocChunk
> stuff. I renamed it to HashMemoryChunk.

OK.

> 
>> * the chunks size is 32kB (instead of 16kB), and we're using 1/4
>>threshold for 'oversized' items
>>
>>We need the threshold to be >=8kB, to trigger the special case
>>within AllocSet. The 1/4 rule is consistent with ALLOC_CHUNK_FRACTION.
> 
> Should we care about the fact that if there are only a few tuples, we
> will nevertheless waste 32kB of memory for the chunk? I guess not, but I
> thought I'd mention it. The smallest allowed value for work_mem is 64kB.

I don't think that's a problem.

>> I also considered Heikki's suggestion that while rebatching, we can
>> reuse chunks with a single large tuple, instead of copying it
>> needlessly. My attempt however made the code much uglier (I almost used
>> a GOTO, but my hands rejected to type it ...). I'll look into that.
> 
> I tried constructing a test case where the rehashing time would be 
> significant enough for this to matter, but I couldn't find one. Even
> if the original estimate on the number of batches is way too small,
> we ramp up quickly to a reasonable size and the rehashing time is
> completely insignificant compared to all the other work. So I think
> we can drop that idea.

I don't think that had anything to do with rehashing. What you pointed
out is that when rebatching, we have to keep ~50% of the tuples, which
means 'copy into a new chunk, then throw away the old ones'.

For large tuples (you mentioned 100MB tuples, IIRC), we needlessly
allocate this amount of memory only to copy the single tuple and then
throw away the old tuple. So (a) that's an additional memcpy, and (b) it
allocates additional 100MB of memory for a short period of time.

Now, I'd guess when dealing with tuples this large, there will be more
serious trouble elsewhere, but I don't want to neglect it. Maybe it's
worth optimizing?

Tomas


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] bad estimation together with large work_mem generates terrible slow hash joins

2014-09-10 Thread Heikki Linnakangas


On 09/10/2014 09:31 PM, Robert Haas wrote:

On Wed, Sep 10, 2014 at 2:25 PM, Heikki Linnakangas
 wrote:

The dense-alloc-v5.patch looks good to me. I have committed that with minor
cleanup (more comments below). I have not looked at the second patch.


Gah.  I was in the middle of doing this.  Sigh.


Sorry.


* the chunks size is 32kB (instead of 16kB), and we're using 1/4
threshold for 'oversized' items

We need the threshold to be >=8kB, to trigger the special case
within AllocSet. The 1/4 rule is consistent with ALLOC_CHUNK_FRACTION.


Should we care about the fact that if there are only a few tuples, we will
nevertheless waste 32kB of memory for the chunk? I guess not, but I thought
I'd mention it. The smallest allowed value for work_mem is 64kB.


I think we should change the threshold here to 1/8th.  The worst case
memory wastage as-is ~32k/5 > 6k.


Not sure how you arrived at that number. The worst case is that each 32k 
chunk is filled up to 24k+1 bytes, i.e. 8k-1 bytes is wasted in each 
chunk. That's 25% wastage. That's not too bad IMHO - the worst case 
waste of AllocSet is 50%.


But if you want to twiddle it, feel free. Note that there's some 
interplay with allocset code, like Tomas mentioned. If the threshold is 
< 8k, palloc() will round up request sizes smaller than 8k anyway. That 
wastes some memory, nothing more serious than that, but it seems like a 
good idea to avoid it.


- Heikki



--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] bad estimation together with large work_mem generates terrible slow hash joins

2014-09-10 Thread Robert Haas

On Wed, Sep 10, 2014 at 2:25 PM, Heikki Linnakangas
 wrote:
> The dense-alloc-v5.patch looks good to me. I have committed that with minor
> cleanup (more comments below). I have not looked at the second patch.

Gah.  I was in the middle of doing this.  Sigh.

>> * the chunks size is 32kB (instead of 16kB), and we're using 1/4
>>threshold for 'oversized' items
>>
>>We need the threshold to be >=8kB, to trigger the special case
>>within AllocSet. The 1/4 rule is consistent with ALLOC_CHUNK_FRACTION.
>
> Should we care about the fact that if there are only a few tuples, we will
> nevertheless waste 32kB of memory for the chunk? I guess not, but I thought
> I'd mention it. The smallest allowed value for work_mem is 64kB.

I think we should change the threshold here to 1/8th.  The worst case
memory wastage as-is ~32k/5 > 6k.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] bad estimation together with large work_mem generates terrible slow hash joins

2014-09-10 Thread Heikki Linnakangas


On 09/10/2014 01:49 AM, Tomas Vondra wrote:

On 9.9.2014 16:09, Robert Haas wrote:

On Mon, Sep 8, 2014 at 5:53 PM, Tomas Vondra  wrote:

So I only posted the separate patch for those who want to do a review,
and then a "complete patch" with both parts combined. But it sure may be
a bit confusing.


Let's do this: post each new version of the patches only on this
thread, as a series of patches that can be applied one after another.


OK, attached. Apply in this order

1) dense-alloc-v5.patch
2) hashjoin-nbuckets-v12.patch


The dense-alloc-v5.patch looks good to me. I have committed that with 
minor cleanup (more comments below). I have not looked at the second patch.



I also did a few 'minor' changes to the dense allocation patch, most
notably:

* renamed HashChunk/HashChunkData to MemoryChunk/MemoryChunkData
   The original naming seemed a bit awkward.


That's too easy to confuse with regular MemoryContext and AllocChunk 
stuff. I renamed it to HashMemoryChunk.



* the chunks size is 32kB (instead of 16kB), and we're using 1/4
   threshold for 'oversized' items

   We need the threshold to be >=8kB, to trigger the special case
   within AllocSet. The 1/4 rule is consistent with ALLOC_CHUNK_FRACTION.


Should we care about the fact that if there are only a few tuples, we 
will nevertheless waste 32kB of memory for the chunk? I guess not, but I 
thought I'd mention it. The smallest allowed value for work_mem is 64kB.



I also considered Heikki's suggestion that while rebatching, we can
reuse chunks with a single large tuple, instead of copying it
needlessly. My attempt however made the code much uglier (I almost used
a GOTO, but my hands rejected to type it ...). I'll look into that.


I tried constructing a test case where the rehashing time would be 
significant enough for this to matter, but I couldn't find one. Even if 
the original estimate on the number of batches is way too small, we ramp 
up quickly to a reasonable size and the rehashing time is completely 
insignificant compared to all the other work. So I think we can drop 
that idea.


- Heikki



--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] bad estimation together with large work_mem generates terrible slow hash joins

2014-09-09 Thread Tomas Vondra

On 9.9.2014 16:09, Robert Haas wrote:
> On Mon, Sep 8, 2014 at 5:53 PM, Tomas Vondra  wrote:
>> So I only posted the separate patch for those who want to do a review,
>> and then a "complete patch" with both parts combined. But it sure may be
>> a bit confusing.
> 
> Let's do this: post each new version of the patches only on this
> thread, as a series of patches that can be applied one after another.

OK, attached. Apply in this order

1) dense-alloc-v5.patch
2) hashjoin-nbuckets-v12.patch

>>> In ExecChooseHashTableSize(), if buckets_bytes is independent of
>>> nbatch, could we just compute it before working out dbatch, and just
>>> deduct it from hash_table_bytes? That seems like it'd do the same
>>> thing for less work.
>>
>> Well, we can do that. But I really don't think that's going to make
>> measurable difference. And I think the code is more clear this way.
> 
> Really?  It seems to me that you're doing more or less the same
> calculation that's already been done a second time.  It took me 20
> minutes of staring to figure out what it was really doing.  Now
> admittedly I was a little low on caffeine, but...

I reworked this part a bit, to make it easier to understand. Mostly
following your recommendations.

> 
>>> Either way, what happens if buckets_bytes is already bigger than
>>> hash_table_bytes?
>>
>> Hm, I don't see how that could happen.
> 
> How about an Assert() and a comment, then?

Done. According to the comment, we should never really get over 25%,
except maybe for strange work_mem values, because we're keeping nbuckets
as 2^N. But even then we should not reach 50%, so I added this assert:

   Assert(buckets_bytes <= hash_table_bytes/2);

And then we subtract the buckets_bytes, and continue with the loop.

> 
>>> A few more stylistic issues that I see:
>>>
>>> +   if ((hashtable->nbatch == 1) &&
>>> +   if (hashtable->nbuckets_optimal <= (INT_MAX/2))
>>> +   if (size > (HASH_CHUNK_SIZE/8))
>>>
>>> While I'm all in favor of using parentheses generously where it's
>>> needed to add clarity, these and similar usages seem over the top to
>>> me.
>>
>> It seemed more readable for me at the time of coding it, and it still
>> seems better this way, but ...
>>
>> https://www.youtube.com/watch?v=CxK_nA2iVXw
> 
> Heh.  Well, at any rate, I think PostgreSQL style is to not use
> parentheses in that situation.

FWIW, I added HASH_CHUNK_THRESHOLD macro, specifying the threshold. I
also simplified the conditions a bit, and removed some of the parentheses.

> 
>>> +char * chunk_alloc(HashJoinTable hashtable, int size) {
>>>
>>> Eh... no.
>>>
>>> +   /* XXX maybe we should use MAXALIGN(size) here ... */
>>>
>>> +1.
>>
>> I'm not sure what's the 'no' pointing at? Maybe that the parenthesis
>> should be on the next line? And is the +1 about doing MAXALING?
> 
> The "no" is about the fact that you have the return type, the function
> name, and the opening brace on one line instead of three lines as is
> project style.  The +1 is for applying MAXALIGN to the size.

OK, fixed.

I also did a few 'minor' changes to the dense allocation patch, most
notably:

* renamed HashChunk/HashChunkData to MemoryChunk/MemoryChunkData
  The original naming seemed a bit awkward.

* renamed the function to 'dense_alloc' (instead of 'chunk_alloc')
  Seems like a better fit to what the function does.

* the chunks size is 32kB (instead of 16kB), and we're using 1/4
  threshold for 'oversized' items

  We need the threshold to be >=8kB, to trigger the special case
  within AllocSet. The 1/4 rule is consistent with ALLOC_CHUNK_FRACTION.

I also considered Heikki's suggestion that while rebatching, we can
reuse chunks with a single large tuple, instead of copying it
needlessly. My attempt however made the code much uglier (I almost used
a GOTO, but my hands rejected to type it ...). I'll look into that.


Tomas

diff --git a/src/backend/executor/nodeHash.c b/src/backend/executor/nodeHash.c
index 589b2f1..d5acfb1 100644
--- a/src/backend/executor/nodeHash.c
+++ b/src/backend/executor/nodeHash.c
@@ -47,6 +47,7 @@ static void ExecHashSkewTableInsert(HashJoinTable hashtable,
 		int bucketNumber);
 static void ExecHashRemoveNextSkewBucket(HashJoinTable hashtable);
 
+static char * dense_alloc(HashJoinTable hashtable, int tupleSize);
 
 /* 
  *		ExecHash
@@ -294,6 +295,8 @@ ExecHashTableCreate(Hash *node, List *hashOperators, bool keepNulls)
 	hashtable->spaceAllowedSkew =
 		hashtable->spaceAllowed * SKEW_WORK_MEM_PERCENT / 100;
 
+	hashtable->chunks = NULL;
+
 	/*
 	 * Get info about the hash functions to be used for each hash key. Also
 	 * remember whether the join operators are strict.
@@ -556,10 +559,10 @@ ExecHashIncreaseNumBatches(HashJoinTable hashtable)
 	int			oldnbatch = hashtable->nbatch;
 	int			curbatch = hashtable->curbatch;
 	int			nbatch;
-	int			i;
 	MemoryContext oldcxt;
 	long		ninmemory;
 	long		nfreed;
+	M

Re: [HACKERS] bad estimation together with large work_mem generates terrible slow hash joins

2014-09-09 Thread Robert Haas

On Mon, Sep 8, 2014 at 5:53 PM, Tomas Vondra  wrote:
> So I only posted the separate patch for those who want to do a review,
> and then a "complete patch" with both parts combined. But it sure may be
> a bit confusing.

Let's do this: post each new version of the patches only on this
thread, as a series of patches that can be applied one after another.

>> In ExecChooseHashTableSize(), if buckets_bytes is independent of
>> nbatch, could we just compute it before working out dbatch, and just
>> deduct it from hash_table_bytes? That seems like it'd do the same
>> thing for less work.
>
> Well, we can do that. But I really don't think that's going to make
> measurable difference. And I think the code is more clear this way.

Really?  It seems to me that you're doing more or less the same
calculation that's already been done a second time.  It took me 20
minutes of staring to figure out what it was really doing.  Now
admittedly I was a little low on caffeine, but...

>> Either way, what happens if buckets_bytes is already bigger than
>> hash_table_bytes?
>
> Hm, I don't see how that could happen.

How about an Assert() and a comment, then?

>> A few more stylistic issues that I see:
>>
>> +   if ((hashtable->nbatch == 1) &&
>> +   if (hashtable->nbuckets_optimal <= (INT_MAX/2))
>> +   if (size > (HASH_CHUNK_SIZE/8))
>>
>> While I'm all in favor of using parentheses generously where it's
>> needed to add clarity, these and similar usages seem over the top to
>> me.
>
> It seemed more readable for me at the time of coding it, and it still
> seems better this way, but ...
>
> https://www.youtube.com/watch?v=CxK_nA2iVXw

Heh.  Well, at any rate, I think PostgreSQL style is to not use
parentheses in that situation.

>> +char * chunk_alloc(HashJoinTable hashtable, int size) {
>>
>> Eh... no.
>>
>> +   /* XXX maybe we should use MAXALIGN(size) here ... */
>>
>> +1.
>
> I'm not sure what's the 'no' pointing at? Maybe that the parenthesis
> should be on the next line? And is the +1 about doing MAXALING?

The "no" is about the fact that you have the return type, the function
name, and the opening brace on one line instead of three lines as is
project style.  The +1 is for applying MAXALIGN to the size.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] bad estimation together with large work_mem generates terrible slow hash joins

2014-09-08 Thread Tomas Vondra

On 8.9.2014 22:44, Robert Haas wrote:
> On Fri, Sep 5, 2014 at 3:23 PM, Tomas Vondra  wrote:
>> as Heikki mentioned in his "commitfest status" message, this patch
>> still has no reviewers :-( Is there anyone willing to pick up that,
>> together with the 'dense allocation' patch (as those two are
>> closely related)?
>>
>> I'm looking in Robert's direction, as he's the one who came up with
>> the dense allocation idea, and also commented on the hasjoin bucket
>> resizing patch ...
>>
>> I'd ask Pavel Stehule, but as he's sitting next to me in the office
>> he's not really independent ;-) I'll do some reviews on the other
>> patches over the weekend, to balance this of course.
> 
> Will any of those patches to be, heh heh, mine?

I'll exchange some of the credit for +1. Deal? ;-)

> I am a bit confused by the relationship between the two patches you 
> posted. The "combined" patch applies, but the other one does not. If 
> this is a sequence of two patches, it seems like it would be more 
> helpful to post A and B rather than B and A+B. Perhaps the other 
> patch is on a different thread but there's not an obvious reference
> to said thread here.

Yeah, it's probably a bit confusing. The thing is the "dense allocation"
idea was discussed in a different thread, so I posted the patch there:

  http://www.postgresql.org/message-id/53cbeb8a@fuzzy.cz

The patch tweaking hash join buckets builds on the dense allocation,
because it (a) makes up for the memory used for more buckets and (b)
it's actually easier to rebuild the buckets this way.

So I only posted the separate patch for those who want to do a review,
and then a "complete patch" with both parts combined. But it sure may be
a bit confusing.

> In ExecChooseHashTableSize(), if buckets_bytes is independent of 
> nbatch, could we just compute it before working out dbatch, and just 
> deduct it from hash_table_bytes? That seems like it'd do the same 
> thing for less work.

Well, we can do that. But I really don't think that's going to make
measurable difference. And I think the code is more clear this way.

> Either way, what happens if buckets_bytes is already bigger than 
> hash_table_bytes?

Hm, I don't see how that could happen.

FWIW, I think the first buckets_bytes formula is actually wrong:

   buckets_bytes
 = sizeof(HashJoinTuple) * my_log2(ntuples / NTUP_PER_BUCKET);

and should be

   buckets_bytes =
 sizeof(HashJoinTuple) * (1 << my_log2(ntuples / NTUP_PER_BUCKET));

instead. Also, we might consider that we never use less than 1024
buckets (but that's minor issue, I guess).

But once we get into the "need batching" branch, we do this:

   lbuckets = 1 << my_log2(hash_table_bytes / (tupsize * NTUP_PER_BUCKET
+ sizeof(HashJoinTuple)));

which includes both the bucket (pointer) and tuple size, and IMHO
guarantees that bucket_bytes will never be over work_mem (which is what
hash_table_bytes is).

The only case where this might happen is (tupsize < 8), thanks to the
my_log2 (getting (50% work_mem + epsilon), doubled to >100% work_mem).

But tupsize is defined as this:

tupsize = HJTUPLE_OVERHEAD +
MAXALIGN(sizeof(MinimalTupleData)) +
MAXALIGN(tupwidth);

and HJTUPLE_OVERHEAD alone is 12B, MinimalTupleData is >11B (ignoring
alignment) etc.

So I believe this is safe ...

> A few more stylistic issues that I see:
> 
> +   if ((hashtable->nbatch == 1) &&
> +   if (hashtable->nbuckets_optimal <= (INT_MAX/2))
> +   if (size > (HASH_CHUNK_SIZE/8))
> 
> While I'm all in favor of using parentheses generously where it's
> needed to add clarity, these and similar usages seem over the top to
> me.

It seemed more readable for me at the time of coding it, and it still
seems better this way, but ...

https://www.youtube.com/watch?v=CxK_nA2iVXw

> On a related note, the first of these reads like this if (stuff) { if
> (more stuff) { do things } } which makes one wonder why we need two if
> statements.

We probably don't ...

> 
> +
> +   /* used for dense allocation of tuples (into linked chunks) */
> +   HashChunk   chunks_batch;   /*  one list for the whole batch */
> +
>  }  HashJoinTableData;
> 
> If the original coding didn't have a blank line between the last
> structure member and the closing brace, it's probably not right to
> make it that way now.  There are similar issues at the end of some
> functions you modified, and a few other places (like the new code in
> ExecChooseHashTableSize and chunk_alloc) where there are extra blank
> lines at the starts and ends of blocks.

I'll fix that. FWIW, those issues seem to be in the 'dense allocation'
patch.

> 
> +char * chunk_alloc(HashJoinTable hashtable, int size) {
> 
> Eh... no.
> 
> +   /* XXX maybe we should use MAXALIGN(size) here ... */
> 
> +1.

I'm not sure what's the 'no' pointing at? Maybe that the parenthesis
should be on the next line? And is the +1 about doing MAXALING?

regards
Tomas

-

Re: [HACKERS] bad estimation together with large work_mem generates terrible slow hash joins

2014-09-08 Thread Robert Haas

On Fri, Sep 5, 2014 at 3:23 PM, Tomas Vondra  wrote:
> as Heikki mentioned in his "commitfest status" message, this patch still
> has no reviewers :-( Is there anyone willing to pick up that, together
> with the 'dense allocation' patch (as those two are closely related)?
>
> I'm looking in Robert's direction, as he's the one who came up with the
> dense allocation idea, and also commented on the hasjoin bucket resizing
> patch ...
>
> I'd ask Pavel Stehule, but as he's sitting next to me in the office he's
> not really independent ;-) I'll do some reviews on the other patches
> over the weekend, to balance this of course.

Will any of those patches to be, heh heh, mine?

I am a bit confused by the relationship between the two patches you
posted.  The "combined" patch applies, but the other one does not.  If
this is a sequence of two patches, it seems like it would be more
helpful to post A and B rather than B and A+B.  Perhaps the other
patch is on a different thread but there's not an obvious reference to
said thread here.

In ExecChooseHashTableSize(), if buckets_bytes is independent of
nbatch, could we just compute it before working out dbatch, and just
deduct it from hash_table_bytes?  That seems like it'd do the same
thing for less work.  Either way, what happens if buckets_bytes is
already bigger than hash_table_bytes?

A few more stylistic issues that I see:

+   if ((hashtable->nbatch == 1) &&
+   if (hashtable->nbuckets_optimal <= (INT_MAX/2))
+   if (size > (HASH_CHUNK_SIZE/8))

While I'm all in favor of using parentheses generously where it's
needed to add clarity, these and similar usages seem over the top to
me.

On a related note, the first of these reads like this if (stuff) { if
(more stuff) { do things } } which makes one wonder why we need two if
statements.

+
+   /* used for dense allocation of tuples (into linked chunks) */
+   HashChunk   chunks_batch;   /*  one list for the whole batch */
+
 }  HashJoinTableData;

If the original coding didn't have a blank line between the last
structure member and the closing brace, it's probably not right to
make it that way now.  There are similar issues at the end of some
functions you modified, and a few other places (like the new code in
ExecChooseHashTableSize and chunk_alloc) where there are extra blank
lines at the starts and ends of blocks.

+char * chunk_alloc(HashJoinTable hashtable, int size) {

Eh... no.

+   /* XXX maybe we should use MAXALIGN(size) here ... */

+1.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] bad estimation together with large work_mem generates terrible slow hash joins

2014-09-05 Thread Tomas Vondra

Hi everyone,

as Heikki mentioned in his "commitfest status" message, this patch still
has no reviewers :-( Is there anyone willing to pick up that, together
with the 'dense allocation' patch (as those two are closely related)?

I'm looking in Robert's direction, as he's the one who came up with the
dense allocation idea, and also commented on the hasjoin bucket resizing
patch ...

I'd ask Pavel Stehule, but as he's sitting next to me in the office he's
not really independent ;-) I'll do some reviews on the other patches
over the weekend, to balance this of course.

regards
Tomas


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] bad estimation together with large work_mem generates terrible slow hash joins

2014-08-19 Thread Tomas Vondra

On 19.8.2014 19:05, Robert Haas wrote:
> On Sat, Aug 16, 2014 at 9:31 AM, Tomas Vondra  wrote:
>> On 12.8.2014 00:30, Tomas Vondra wrote:
>>> On 11.8.2014 20:25, Robert Haas wrote:
 It also strikes me that when there's only 1 batch, the set of bits
 that map onto the batch number is zero-width, and one zero-width bit
 range is as good as another.  In other words, if you're only planning
 to do one batch, you can easily grow the number of buckets on the fly.
 Growing the number of buckets only becomes difficult once you have
 more than one batch.
>>
>> ...
>>
>>> I was considering using reversing the bits of the hash, because that's
>>> pretty much the simplest solution. But I think you're right it might
>>> actually work like this:
>>>
>>>   * Are more batches needed?
>>>
>>>   (yes) => just use nbuckets = my_log2(work_mem / tuple_size)
>>>
>>>   (no) => go ahead, until processing all tuples or hitting work_mem
>>>
>>>   (work_mem) => meh, use the same nbuckets above
>>>
>>>   (all tuples) => compute optimal nbuckets / resize
>>>
>>>
>>> But I need to think about this a bit. So far it seems to me there's no
>>> way additional batches might benefit from increasing nbuckets further.
>>
>> I think this is a simple and solid solution, solving the batchno
>> computation issues quite nicely. Attached is v10 patch (bare and
>> combined with the dense allocation), that does this:
>>
>> 1) when we know we'll need batching, buckets are sized for full work_mem
>>(using the estimated tuple width, etc.)
>>
>> 2) without the batching, we estimate the 'right number of buckets' for
>>the estimated number of tuples, and keep track of the optimal number
>>as tuples are added to the hash table
>>
>>- if we discover we need to start batching, we keep the current
>>  optimal value (which should be the same as the max number of
>>  buckets) and don't mess with it anymore (making it possible to
>>  compute batch IDs just like before)
>>
>>- also, on the fist rebatch (nbatch=1 => nbatch=2) the hash table
>>  is resized as part of the rebatch
>>
>>- if the hash build completes without batching, we do the resize
>>
>> I believe the patch is pretty much perfect. I plan to do more thorough
>> testing on a wide range of queries in the next few days.
>>
>> I also removed the 'enable_hash_resize' GUC, because it would be more
>> complex to implement this properly after doing the resize as part of
>> rebatch etc.. So either it would make the patch more complex, or it
>> wouldn't do what the name promises.
> 
> A variety of trivial comments on this:
> 
> PostgreSQL style is un-cuddled curly braces.  Also, multi-line
> comments need to start with a line containing only /* and end with a
> line containing only */.  In one place you've added curly braces
> around a single-line block that is otherwise unmodified; please don't
> do that.  In one place, you have "becase" instead of "because".  In
> another place, you write "add if after it" but it should say "add it
> after it" or maybe better "add the new one after it".  Avoid using
> punctuation like "=>" in comments to illustrate the connection between
> sentences; instead, use a connecting word like "then" or "therefore"
> or whatever is appropriate; in this instance, a period followed by the
> start of a new sentence seems sufficient.  Revert the removal of a
> single line of whitespace near the top of nodeHash.c.
> 
> There are too many things marked XXX in this patch.  They should
> either be fixed, if they are real problems, or they should be
> commented in a way that doesn't give rise to the idea that they're
> problems if they aren't.

OK, thanks for pointing this out. Attached is v11 of the patch (both
separate and combined with the dense allocation, as before).

I fixed as many of those issues as possible. All the XXX items were
obsolete, except for one in the chunk_alloc function.

I have also removed one constant

> 
> OK, now on to some more substantive stuff:
> 
> 1. It's not clear to me what the overall effect of this patch on
> memory utilization is.  Reducing NTUP_PER_BUCKET from 10 to 1 is going
> to use, on the average, 10x as much bucket-header memory per tuple.
> Specifically, I think it means we'll use about 8 bytes of
> bucket-header memory per tuple instead of 0.8 bytes per tuple.  If the
> tuples are narrow, that could be significant; concerns have been
> expressed about that here in the past.  Increasing the number of
> buckets could also increase memory usage.  On the other hand, the
> dense allocation stuff probably saves a ton of memory, so maybe we end
> up overall, but I'm not sure.  Your thoughts, and maybe some test
> results with narrow and wide tuples, would be appreciated.

The effect of the dense allocation was briefly discussed in this thread,
along with some quick measurements:

http://www.postgresql.org/message-id/53beea9e.2080...@fuzzy.cz

The dense allocati

Re: [HACKERS] bad estimation together with large work_mem generates terrible slow hash joins

2014-08-19 Thread Robert Haas

On Sat, Aug 16, 2014 at 9:31 AM, Tomas Vondra  wrote:
> On 12.8.2014 00:30, Tomas Vondra wrote:
>> On 11.8.2014 20:25, Robert Haas wrote:
>>> It also strikes me that when there's only 1 batch, the set of bits
>>> that map onto the batch number is zero-width, and one zero-width bit
>>> range is as good as another.  In other words, if you're only planning
>>> to do one batch, you can easily grow the number of buckets on the fly.
>>> Growing the number of buckets only becomes difficult once you have
>>> more than one batch.
>
> ...
>
>> I was considering using reversing the bits of the hash, because that's
>> pretty much the simplest solution. But I think you're right it might
>> actually work like this:
>>
>>   * Are more batches needed?
>>
>>   (yes) => just use nbuckets = my_log2(work_mem / tuple_size)
>>
>>   (no) => go ahead, until processing all tuples or hitting work_mem
>>
>>   (work_mem) => meh, use the same nbuckets above
>>
>>   (all tuples) => compute optimal nbuckets / resize
>>
>>
>> But I need to think about this a bit. So far it seems to me there's no
>> way additional batches might benefit from increasing nbuckets further.
>
> I think this is a simple and solid solution, solving the batchno
> computation issues quite nicely. Attached is v10 patch (bare and
> combined with the dense allocation), that does this:
>
> 1) when we know we'll need batching, buckets are sized for full work_mem
>(using the estimated tuple width, etc.)
>
> 2) without the batching, we estimate the 'right number of buckets' for
>the estimated number of tuples, and keep track of the optimal number
>as tuples are added to the hash table
>
>- if we discover we need to start batching, we keep the current
>  optimal value (which should be the same as the max number of
>  buckets) and don't mess with it anymore (making it possible to
>  compute batch IDs just like before)
>
>- also, on the fist rebatch (nbatch=1 => nbatch=2) the hash table
>  is resized as part of the rebatch
>
>- if the hash build completes without batching, we do the resize
>
> I believe the patch is pretty much perfect. I plan to do more thorough
> testing on a wide range of queries in the next few days.
>
> I also removed the 'enable_hash_resize' GUC, because it would be more
> complex to implement this properly after doing the resize as part of
> rebatch etc.. So either it would make the patch more complex, or it
> wouldn't do what the name promises.

A variety of trivial comments on this:

PostgreSQL style is un-cuddled curly braces.  Also, multi-line
comments need to start with a line containing only /* and end with a
line containing only */.  In one place you've added curly braces
around a single-line block that is otherwise unmodified; please don't
do that.  In one place, you have "becase" instead of "because".  In
another place, you write "add if after it" but it should say "add it
after it" or maybe better "add the new one after it".  Avoid using
punctuation like "=>" in comments to illustrate the connection between
sentences; instead, use a connecting word like "then" or "therefore"
or whatever is appropriate; in this instance, a period followed by the
start of a new sentence seems sufficient.  Revert the removal of a
single line of whitespace near the top of nodeHash.c.

There are too many things marked XXX in this patch.  They should
either be fixed, if they are real problems, or they should be
commented in a way that doesn't give rise to the idea that they're
problems if they aren't.

OK, now on to some more substantive stuff:

1. It's not clear to me what the overall effect of this patch on
memory utilization is.  Reducing NTUP_PER_BUCKET from 10 to 1 is going
to use, on the average, 10x as much bucket-header memory per tuple.
Specifically, I think it means we'll use about 8 bytes of
bucket-header memory per tuple instead of 0.8 bytes per tuple.  If the
tuples are narrow, that could be significant; concerns have been
expressed about that here in the past.  Increasing the number of
buckets could also increase memory usage.  On the other hand, the
dense allocation stuff probably saves a ton of memory, so maybe we end
up overall, but I'm not sure.  Your thoughts, and maybe some test
results with narrow and wide tuples, would be appreciated.

2. But, on the positive side, modulo the memory utilization questions
mentioned above, I would expect the impact on hash join performance to
be positive.  Going from 10 tuples per bucket to just 1 should help,
and on cases where the actual load factor would have ended up much
higher because of poor estimation, increasing the number of buckets on
the fly should help even more.  I haven't tested this, though.

I haven't had a chance to completely go through this yet, so these are
just some initial thoughts.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (

Re: [HACKERS] bad estimation together with large work_mem generates terrible slow hash joins

2014-08-16 Thread Tomas Vondra



On 12.8.2014 00:30, Tomas Vondra wrote:
> On 11.8.2014 20:25, Robert Haas wrote:
>> It also strikes me that when there's only 1 batch, the set of bits
>> that map onto the batch number is zero-width, and one zero-width bit
>> range is as good as another.  In other words, if you're only planning
>> to do one batch, you can easily grow the number of buckets on the fly.
>> Growing the number of buckets only becomes difficult once you have
>> more than one batch.

...

> I was considering using reversing the bits of the hash, because that's
> pretty much the simplest solution. But I think you're right it might
> actually work like this:
>
>   * Are more batches needed?
>
>   (yes) => just use nbuckets = my_log2(work_mem / tuple_size)
>
>   (no) => go ahead, until processing all tuples or hitting work_mem
>
>   (work_mem) => meh, use the same nbuckets above
>
>   (all tuples) => compute optimal nbuckets / resize
>
>
> But I need to think about this a bit. So far it seems to me there's no
> way additional batches might benefit from increasing nbuckets further.

I think this is a simple and solid solution, solving the batchno
computation issues quite nicely. Attached is v10 patch (bare and
combined with the dense allocation), that does this:

1) when we know we'll need batching, buckets are sized for full work_mem
   (using the estimated tuple width, etc.)

2) without the batching, we estimate the 'right number of buckets' for
   the estimated number of tuples, and keep track of the optimal number
   as tuples are added to the hash table

   - if we discover we need to start batching, we keep the current
 optimal value (which should be the same as the max number of
 buckets) and don't mess with it anymore (making it possible to
 compute batch IDs just like before)

   - also, on the fist rebatch (nbatch=1 => nbatch=2) the hash table
 is resized as part of the rebatch

   - if the hash build completes without batching, we do the resize

I believe the patch is pretty much perfect. I plan to do more thorough
testing on a wide range of queries in the next few days.

I also removed the 'enable_hash_resize' GUC, because it would be more
complex to implement this properly after doing the resize as part of
rebatch etc.. So either it would make the patch more complex, or it
wouldn't do what the name promises.

regards
Tomas

diff --git a/src/backend/commands/explain.c b/src/backend/commands/explain.c
index 781a736..ee3fffb 100644
--- a/src/backend/commands/explain.c
+++ b/src/backend/commands/explain.c
@@ -1900,18 +1900,20 @@ show_hash_info(HashState *hashstate, ExplainState *es)
 		if (es->format != EXPLAIN_FORMAT_TEXT)
 		{
 			ExplainPropertyLong("Hash Buckets", hashtable->nbuckets, es);
+			ExplainPropertyLong("Original Hash Buckets",
+hashtable->nbuckets_original, es);
 			ExplainPropertyLong("Hash Batches", hashtable->nbatch, es);
 			ExplainPropertyLong("Original Hash Batches",
 hashtable->nbatch_original, es);
 			ExplainPropertyLong("Peak Memory Usage", spacePeakKb, es);
 		}
-		else if (hashtable->nbatch_original != hashtable->nbatch)
+		else if ((hashtable->nbatch_original != hashtable->nbatch) || (hashtable->nbuckets_original != hashtable->nbuckets))
 		{
 			appendStringInfoSpaces(es->str, es->indent * 2);
 			appendStringInfo(es->str,
-			"Buckets: %d  Batches: %d (originally %d)  Memory Usage: %ldkB\n",
-			 hashtable->nbuckets, hashtable->nbatch,
-			 hashtable->nbatch_original, spacePeakKb);
+			"Buckets: %d (originally %d)  Batches: %d (originally %d)  Memory Usage: %ldkB\n",
+			 hashtable->nbuckets, hashtable->nbuckets_original,
+			 hashtable->nbatch, hashtable->nbatch_original, spacePeakKb);
 		}
 		else
 		{
diff --git a/src/backend/executor/nodeHash.c b/src/backend/executor/nodeHash.c
index 589b2f1..a5d4812 100644
--- a/src/backend/executor/nodeHash.c
+++ b/src/backend/executor/nodeHash.c
@@ -37,8 +37,8 @@
 #include "utils/lsyscache.h"
 #include "utils/syscache.h"
 
-
 static void ExecHashIncreaseNumBatches(HashJoinTable hashtable);
+static void ExecHashIncreaseNumBuckets(HashJoinTable hashtable);
 static void ExecHashBuildSkewHash(HashJoinTable hashtable, Hash *node,
 	  int mcvsToUse);
 static void ExecHashSkewTableInsert(HashJoinTable hashtable,
@@ -47,6 +47,10 @@ static void ExecHashSkewTableInsert(HashJoinTable hashtable,
 		int bucketNumber);
 static void ExecHashRemoveNextSkewBucket(HashJoinTable hashtable);
 
+static char * chunk_alloc(HashJoinTable hashtable, int tupleSize);
+
+/* Memory needed for optimal number of buckets. */
+#define BUCKETS_SPACE(htab)	((htab)->nbuckets_optimal * sizeof(HashJoinTuple))
 
 /* 
  *		ExecHash
@@ -116,6 +120,7 @@ MultiExecHash(HashState *node)
 /* It's a skew tuple, so put it into that hash table */
 ExecHashSkewTableInsert(hashtable, slot, hashvalue,
 		bucketNumber);
+hashtable->skew

Re: [HACKERS] bad estimation together with large work_mem generates terrible slow hash joins

2014-08-11 Thread Tomas Vondra

On 11.8.2014 20:25, Robert Haas wrote:
> On Sat, Aug 9, 2014 at 9:13 AM, Tomas Vondra  wrote:
>> Adding least-significant bit does not work, we need get back to adding
>> the most-significant one. Not sure what's the least complex way to do
>> that, though.
>>
>> I'm thinking about computing the nbuckets limit (how many buckets may
>> fit into work_mem):
>>
>> tupsize = HJTUPLE_OVERHEAD +
>> MAXALIGN(sizeof(MinimalTupleData)) +
>> MAXALIGN(tupwidth);
>>
>> nbuckets_bits_max = my_log2(work_mem / tupsize)
>>
>> and then start with nbatches from this bit, like this:
>>
>>  31   23 max   1570
>>   |||||
>>   |  |   <- batches|   free   | <-  buckets   |
> 
> That doesn't seem unreasonable provide there's enough bit space, but,
> as you point out, the day might not be far off when there isn't.

Thinking about this a bit more, I believe the danger is not that
imminent. 32 bits mean the Hash node should handle 2^32 tuples in total
(possibly split into multiple batches).

It used to be "2^32 buckets" thanks to NTUP_PER_BUCKET=10, but the patch
lowers this to 1 so it's "tuples" now.

But while we're we're a bit closer to exhausting the 32 bits, I think
it's not really that big issue. Not only hashing a table with ~4e9 rows
is going to be a hellish experience, but if we really want to support
it, we should probably switch to 64-bit hashes.

Because adding some checks is not going to help - it may detect an
issue, but it won't fix it. And actually, even if the two values get
overlap, it does not mean the hashjoin will stop working. There'll be
dependency between batches and buckets, causing non-uniform usage of the
buckets, but it won't blow up.

So I don't think we need to worry about exhausting the 32 bits for now.

> It also strikes me that when there's only 1 batch, the set of bits
> that map onto the batch number is zero-width, and one zero-width bit
> range is as good as another.  In other words, if you're only planning
> to do one batch, you can easily grow the number of buckets on the fly.
> Growing the number of buckets only becomes difficult once you have
> more than one batch.

Yes, that's true. It's also roughly what the patch does IMHO.

If you know you'll need more batches, you can start with the maximum
number of buckets right away (because you expect there's more data than
work_mem, so shoot for

nbuckets = mylog2(work_mem/tuple_size)

or something like that. It's also true that if you're starting with a
single batch, you can do whatever you want with nbuckets until you need
to do (nbatches *= 2).

It's also true that once you're done with the first batch, all the other
batches will be just fine with the number of buckets, because the
batches be about equally sized.

But I think this is pretty much what the patch does, in the end.

> And I mention that because, in the scenario mentioned in your original
> post ("a hash table with small number of buckets (8192) containing
> large number of tuples [~3.3M]"), presumably what happens is you start
> out thinking you are going to have one batch with 8K buckets.  Then,
> as more tuples continue to pour in, you see the load factor rising and
> responding by bumping up the size of the hash table.  Now eventually
> you realize, gosh, this isn't even going to fit into work_mem, so you
> need to start batching it.  But by that point you've presumably done
> as much as you're going to do in terms of increasing the number of
> buckets; after that, you're just going to add more batches as
> necessary.  So not being able to further increase the number of
> buckets once the number of batches is no longer 1 wouldn't be a
> problem in that case.
> 
> Now if you start out planning for multiple batches, and then you
> discover you'd like to keep the same number of batches but have more
> buckets per batch, that's gonna be hard.  It's not impossible, because
> you could steal some of the unused high-order bits above the number of
> batches, and then concatenate them with the low-order bits that you
> already had, but that seems like kind of an ugly kludge, and AFAICS it
> only helps if the new tuples are narrower by enough to justify the
> cost of splitting all the buckets.  I won't say that couldn't happen,
> but I think it would be better to keep that complexity out of the
> patch unless we're absolutely certain it's justified.

I'm definitely in favor of keeping the patch as simple as possible.

I was considering using reversing the bits of the hash, because that's
pretty much the simplest solution. But I think you're right it might
actually work like this:

  * Are more batches needed?

  (yes) => just use nbuckets = my_log2(work_mem / tuple_size)

  (no) => go ahead, until processing all tuples or hitting work_mem

  (work_mem) => meh, use the same nbuckets a

Re: [HACKERS] bad estimation together with large work_mem generates terrible slow hash joins

2014-08-11 Thread Robert Haas

On Sat, Aug 9, 2014 at 9:13 AM, Tomas Vondra  wrote:
> Adding least-significant bit does not work, we need get back to adding
> the most-significant one. Not sure what's the least complex way to do
> that, though.
>
> I'm thinking about computing the nbuckets limit (how many buckets may
> fit into work_mem):
>
> tupsize = HJTUPLE_OVERHEAD +
> MAXALIGN(sizeof(MinimalTupleData)) +
> MAXALIGN(tupwidth);
>
> nbuckets_bits_max = my_log2(work_mem / tupsize)
>
> and then start with nbatches from this bit, like this:
>
>  31   23 max   1570
>   |||||
>   |  |   <- batches|   free   | <-  buckets   |

That doesn't seem unreasonable provide there's enough bit space, but,
as you point out, the day might not be far off when there isn't.

It also strikes me that when there's only 1 batch, the set of bits
that map onto the batch number is zero-width, and one zero-width bit
range is as good as another.  In other words, if you're only planning
to do one batch, you can easily grow the number of buckets on the fly.
Growing the number of buckets only becomes difficult once you have
more than one batch.

And I mention that because, in the scenario mentioned in your original
post ("a hash table with small number of buckets (8192) containing
large number of tuples [~3.3M]"), presumably what happens is you start
out thinking you are going to have one batch with 8K buckets.  Then,
as more tuples continue to pour in, you see the load factor rising and
responding by bumping up the size of the hash table.  Now eventually
you realize, gosh, this isn't even going to fit into work_mem, so you
need to start batching it.  But by that point you've presumably done
as much as you're going to do in terms of increasing the number of
buckets; after that, you're just going to add more batches as
necessary.  So not being able to further increase the number of
buckets once the number of batches is no longer 1 wouldn't be a
problem in that case.

Now if you start out planning for multiple batches, and then you
discover you'd like to keep the same number of batches but have more
buckets per batch, that's gonna be hard.  It's not impossible, because
you could steal some of the unused high-order bits above the number of
batches, and then concatenate them with the low-order bits that you
already had, but that seems like kind of an ugly kludge, and AFAICS it
only helps if the new tuples are narrower by enough to justify the
cost of splitting all the buckets.  I won't say that couldn't happen,
but I think it would be better to keep that complexity out of the
patch unless we're absolutely certain it's justified.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] bad estimation together with large work_mem generates terrible slow hash joins

2014-08-09 Thread Tomas Vondra

On 20.7.2014 18:29, Tomas Vondra wrote:
> Attached v9 of the patch. Aside from a few minor fixes, the main  change
> is that this is assumed to be combined with the "dense allocation" patch.
> 
> It also rewrites the ExecHashIncreaseNumBuckets to follow the same
> pattern as ExecHashIncreaseNumBatches (i.e. scanning chunks directly,
> instead of buckets). It's cleaner and more consistent.
> 
> hashjoin-nbuckets-growth-v9.patch contains only this patch, so you need
> to grab the hashjoin-alloc-v4.patch from a different thread and apply it
> first)
> 
> hashjoin-nbuckets-growth-v9-combined.patch contains both patches combined

I just noticed that the changes made to ExecHashGetBucketAndBatch may
not be perfectly fine - it does not "break" the hash join (i.e. the
results are still correct), but I believe it may cause more batches. But
I'm not entirely sure about it, or how to fix that.

First, an intro into ExecHashGetBucketAndBatch internals, and how the
patch changes that. If not interested, skip to the "problem" section below.

The "old" ExecHashGetBucketAndBatch did this:

  *bucketno = hashvalue & (nbuckets - 1);
  *batchno = (hashvalue >> hashtable->log2_nbuckets) & (nbatch - 1);

i.e. given the 32-bit hash value, the lowest log2_nbuckets bits are used
to determine bucket number. The rest of the hash (after removing the
bits used for buckets) is used for batch. I.e. something like this

 31   23   1570
  |||||
  |  |   <- batches|   buckets|

This worked fine, because the number of bits for buckets was fixed, and
only the number of batches was growing. This was done by adding
most-significant bits (as illustrated by the tiny arrow. So when there
were 4 bits for batch number, after adding another bit (doubling
nbatches) batch '' would be split either into '0' or '1'.

With dynamic bucket count this does not work, because the batches and
buckets need to be independend (i.e. derived from non-overlapping parts
of the hash). The simplest approach of 'moving batches around' does not
work, because that would break this assert:

   Assert(batchno > curbatch);

In other words "don't move tuples to already-processed batches". So the
batch number for a tuple needs to be sufficiently stable, and only ever
increase (never decrease).

So what I did is this:

 31   23   1570
  |||||
  |   batches ->  |   | <- buckets|

and this is what happens in ExecHashGetBucketAndBatch:

  *bucketno = hashvalue & (nbuckets - 1);
  *batchno = (hashvalue >> (32 - hashtable->log2_nbatch));

So the "bucketno" expression is exactly the same (but now the nbuckets
may change dynamically), but "batchno" is now computed from the other
end of the hash (highest bits), and grows by adding a least-significant bit.

Problem
---

The original implementation had the nice feature, that each increase of
number of batches (nbatches *= 2) split each batch in half. Half the
tuples stayed in the batch, half the tuples moved to one of the newly
created batches, thanks to adding a most-significant bit. The tuples
getting 0 bit stayed in the same batch, tuples getting 1 moved to the
new one (and it was guaranteed to be one of the new ones).

It's slightly more complicated because of the lazy behavior when
repeatedly incrementing the number of batches, but this is the
principle. Keep 1/2 the tuples, move 1/2 to another bucket.

The new implementation changes this, because the batch number grows in
the opposite direction - adds a lest-significant bit, so for example
batch '' gets split into '1' and '0'.

It's rougly equal to

   (batchno << 1) + K

where K is either 0 or 1, which is always >= than the old batchno. So it
does not violate the Assert (which is why I haven't noticed this earlier).

But it breaks the desirable behavior that 1/2 the tuples remain in the
current batch, and instead moves a lot of them to batches further down
the road, and needlessly increases the number of batches.

At least that's how I understand it ...

Possible solutions
--

Adding least-significant bit does not work, we need get back to adding
the most-significant one. Not sure what's the least complex way to do
that, though.

I'm thinking about computing the nbuckets limit (how many buckets may
fit into work_mem):

tupsize = HJTUPLE_OVERHEAD +
MAXALIGN(sizeof(MinimalTupleData)) +
MAXALIGN(tupwidth);

nbuckets_bits_max = my_log2(work_mem / tupsize)

and then start with nbatches from this bit, like this:

 31   23 max   1570
  |||||
  |  |   <- batches

Re: [HACKERS] bad estimation together with large work_mem generates terrible slow hash joins

2014-07-20 Thread Tomas Vondra

Attached v9 of the patch. Aside from a few minor fixes, the main  change
is that this is assumed to be combined with the "dense allocation" patch.

It also rewrites the ExecHashIncreaseNumBuckets to follow the same
pattern as ExecHashIncreaseNumBatches (i.e. scanning chunks directly,
instead of buckets). It's cleaner and more consistent.

hashjoin-nbuckets-growth-v9.patch contains only this patch, so you need
to grab the hashjoin-alloc-v4.patch from a different thread and apply it
first)

hashjoin-nbuckets-growth-v9-combined.patch contains both patches combined

regards
Tomas Vondra
diff --git a/src/backend/commands/explain.c b/src/backend/commands/explain.c
index 781a736..ee3fffb 100644
--- a/src/backend/commands/explain.c
+++ b/src/backend/commands/explain.c
@@ -1900,18 +1900,20 @@ show_hash_info(HashState *hashstate, ExplainState *es)
 		if (es->format != EXPLAIN_FORMAT_TEXT)
 		{
 			ExplainPropertyLong("Hash Buckets", hashtable->nbuckets, es);
+			ExplainPropertyLong("Original Hash Buckets",
+hashtable->nbuckets_original, es);
 			ExplainPropertyLong("Hash Batches", hashtable->nbatch, es);
 			ExplainPropertyLong("Original Hash Batches",
 hashtable->nbatch_original, es);
 			ExplainPropertyLong("Peak Memory Usage", spacePeakKb, es);
 		}
-		else if (hashtable->nbatch_original != hashtable->nbatch)
+		else if ((hashtable->nbatch_original != hashtable->nbatch) || (hashtable->nbuckets_original != hashtable->nbuckets))
 		{
 			appendStringInfoSpaces(es->str, es->indent * 2);
 			appendStringInfo(es->str,
-			"Buckets: %d  Batches: %d (originally %d)  Memory Usage: %ldkB\n",
-			 hashtable->nbuckets, hashtable->nbatch,
-			 hashtable->nbatch_original, spacePeakKb);
+			"Buckets: %d (originally %d)  Batches: %d (originally %d)  Memory Usage: %ldkB\n",
+			 hashtable->nbuckets, hashtable->nbuckets_original,
+			 hashtable->nbatch, hashtable->nbatch_original, spacePeakKb);
 		}
 		else
 		{
diff --git a/src/backend/executor/nodeHash.c b/src/backend/executor/nodeHash.c
index 589b2f1..014b6f1 100644
--- a/src/backend/executor/nodeHash.c
+++ b/src/backend/executor/nodeHash.c
@@ -37,8 +37,10 @@
 #include "utils/lsyscache.h"
 #include "utils/syscache.h"
 
+bool enable_hash_resize = true;
 
 static void ExecHashIncreaseNumBatches(HashJoinTable hashtable);
+static void ExecHashIncreaseNumBuckets(HashJoinTable hashtable);
 static void ExecHashBuildSkewHash(HashJoinTable hashtable, Hash *node,
 	  int mcvsToUse);
 static void ExecHashSkewTableInsert(HashJoinTable hashtable,
@@ -47,6 +49,10 @@ static void ExecHashSkewTableInsert(HashJoinTable hashtable,
 		int bucketNumber);
 static void ExecHashRemoveNextSkewBucket(HashJoinTable hashtable);
 
+static char * chunk_alloc(HashJoinTable hashtable, int tupleSize);
+
+/* Memory needed for optimal number of buckets. */
+#define BUCKETS_SPACE(htab)	((htab)->nbuckets_optimal * sizeof(void*))
 
 /* 
  *		ExecHash
@@ -116,6 +122,7 @@ MultiExecHash(HashState *node)
 /* It's a skew tuple, so put it into that hash table */
 ExecHashSkewTableInsert(hashtable, slot, hashvalue,
 		bucketNumber);
+hashtable->skewTuples += 1;
 			}
 			else
 			{
@@ -126,10 +133,34 @@ MultiExecHash(HashState *node)
 		}
 	}
 
+	/* If average number of tuples per bucket is over the defined threshold,
+	 * increase the number of buckets to get below it. */
+	if (enable_hash_resize) {
+
+		/* 
+		 * Do we actually need to resize? Only scale up, don't scale down (for example
+		 * after adding a batch, it's possible that nbuckets > nbuckets_optimal).
+		 */
+		if (hashtable->nbuckets < hashtable->nbuckets_optimal) {
+
+#ifdef HJDEBUG
+			printf("Increasing nbuckets %d => %d\n",
+   hashtable->nbuckets, hashtable->nbuckets_optimal);
+#endif
+			ExecHashIncreaseNumBuckets(hashtable);
+		}
+
+	}
+
+	/* Account for the buckets in spaceUsed (reported in EXPLAIN ANALYZE) */
+	hashtable->spaceUsed += BUCKETS_SPACE(hashtable);
+	if (hashtable->spaceUsed > hashtable->spacePeak)
+			hashtable->spacePeak = hashtable->spaceUsed;
+
 	/* must provide our own instrumentation support */
 	if (node->ps.instrument)
 		InstrStopNode(node->ps.instrument, hashtable->totalTuples);
-
+	
 	/*
 	 * We do not return the hash table directly because it's not a subtype of
 	 * Node, and so would violate the MultiExecProcNode API.  Instead, our
@@ -223,7 +254,6 @@ ExecEndHash(HashState *node)
 	ExecEndNode(outerPlan);
 }
 
-
 /* 
  *		ExecHashTableCreate
  *
@@ -239,6 +269,7 @@ ExecHashTableCreate(Hash *node, List *hashOperators, bool keepNulls)
 	int			nbatch;
 	int			num_skew_mcvs;
 	int			log2_nbuckets;
+	int			log2_nbatch;
 	int			nkeys;
 	int			i;
 	ListCell   *ho;
@@ -263,6 +294,10 @@ ExecHashTableCreate(Hash *node, List *hashOperators, bool keepNulls)
 	log2_nbuckets = my_log2(nbuckets);
 	Assert(nbuckets == (1 << log2_nbuckets));
 
+	/* nb

Re: [HACKERS] bad estimation together with large work_mem generates terrible slow hash joins

2014-07-13 Thread Tomas Vondra

On 3.7.2014 19:36, Tomas Vondra wrote:
> On 1.7.2014 01:24, Tomas Vondra wrote:
>> On 30.6.2014 23:12, Tomas Vondra wrote:
>>> Hi,
>>
>> Hopefully I got it right this time. At least it seems to be working
>> for cases that failed before (no file leaks, proper rowcounts so
>> far).
> 
> Attached v7, fixing nbatch/ntuples in an assert.
> 
> regards
> Tomas

Attached is v8 of this patch, significantly reworked based on the
related discussion (mostly in 'tweaking NTUP_PER_BUCKET').

With the patch applied, the hash table works like this:

initial sizing (ExecChooseHashTableSize)

- size nbuckets for NTUP_PER_BUCKET=1
- account for memory allocated for buckets

building the hash table
---
- while adding tuples, keep track of optimal number of buckets (for the
  current number of tuples per batch)
- don't resize until all the tuples are added (by increasing nbatch the
  number of buckets may decrease)
- after adding all tuples, consider resizing (optimal nbuckets !=
  initial nbuckets)
- do the resize

This means that for well estimated plans (reasonably exact tuple count
for the inner relation), the hash table is properly sized and no resize
is needed.

For badly estimated plans, this works about the same as the previous
patches (we're accounting for memory needed for buckets etc.).

More batches or less buckets?
-
In the related threads, I repeatedly mentioned that I don't know a good
answer to "Should I add more batches or decrease the number of buckets?"
while sizing the hash table. The current code (as in HEAD) does not face
this dillema because (a) it does not count buckets into work_mem and (b)
does not allow changing nbuckets.

I still don't have a good answer to this question, but I think the patch
can stand even without it. If you want more/less batches, use
smaller/larger work_mem - same as with the current code. Also, with the
'dense allocation' patch [1], it's possible to use larger work_mem
values (and thus compensate for counting buckets into work_mem).

Changes since v7

  (a) remove NTUP_GROW_THRESHOLD (just use NTUP_PER_BUCKET directly)
  (b) set NTUP_PER_BUCKET=1
  (c) include buckets (optimal) when considering nbatch increase
  (d) track optimal number of buckets (for NTUP_PER_BUCKET)
  (e) after loading all the tuples, resize the hash table to get
  nbuckets_optimal (if needed)
  (f) renamed GUC to enable_hash_resize (makes a bit more sense)

Comments

  (a) NTUP_GROW_THRESHOLD was overcomplicated and mostly a result of
  misunderstanding how NTUP_PER_BUCKET works (it's upper threshold,
  not average) and is not really needed.

  (b) The value "1" gives the best performance, but also requires more
  memory for the buckets. This should be more than compensated by
  densely allocating the tuples (see hashjoin-alloc-v3.patch
  in the "tweaking NTUP_PER_BUCKET" thread [1]).

  (c,d) Originally, the memory needed for buckets was not included in
'spaceUsed', but because the NTUP_PER_BUCKET=1 change this is
not really acceptable (needs much more memory than before).
So now it's included both in the initial sizing of the hash
table, and when adding more tuples (nbuckets_optimal).

Still, spaceUsed does not include palloc overhead (which may be
a significant amount of memory). This is tackled by [1].

  (e) No change here.

  (f) A bit better GUC name. Anyway, this is for easier development
  only, and won't be included in the final patch.

regards
Tomas

[1] http://www.postgresql.org/message-id/53c2decc.2010...@fuzzy.cz
diff --git a/src/backend/commands/explain.c b/src/backend/commands/explain.c
index 0d9663c..db3a953 100644
--- a/src/backend/commands/explain.c
+++ b/src/backend/commands/explain.c
@@ -1900,18 +1900,20 @@ show_hash_info(HashState *hashstate, ExplainState *es)
 		if (es->format != EXPLAIN_FORMAT_TEXT)
 		{
 			ExplainPropertyLong("Hash Buckets", hashtable->nbuckets, es);
+			ExplainPropertyLong("Original Hash Buckets",
+hashtable->nbuckets_original, es);
 			ExplainPropertyLong("Hash Batches", hashtable->nbatch, es);
 			ExplainPropertyLong("Original Hash Batches",
 hashtable->nbatch_original, es);
 			ExplainPropertyLong("Peak Memory Usage", spacePeakKb, es);
 		}
-		else if (hashtable->nbatch_original != hashtable->nbatch)
+		else if ((hashtable->nbatch_original != hashtable->nbatch) || (hashtable->nbuckets_original != hashtable->nbuckets))
 		{
 			appendStringInfoSpaces(es->str, es->indent * 2);
 			appendStringInfo(es->str,
-			"Buckets: %d  Batches: %d (originally %d)  Memory Usage: %ldkB\n",
-			 hashtable->nbuckets, hashtable->nbatch,
-			 hashtable->nbatch_original, spacePeakKb);
+			"Buckets: %d (originally %d)  Batches: %d (originally %d)  Memory Usage: %ldkB\n",
+			 hashtable->nbuckets, hashtable->nbuckets_original,
+			 hashtable->nbatch, hashtable-

Re: [HACKERS] bad estimation together with large work_mem generates terrible slow hash joins

2014-07-03 Thread Tomas Vondra

On 3.7.2014 15:42, Atri Sharma wrote:
> 
> 
> 
> On Tue, Jul 1, 2014 at 4:54 AM, Tomas Vondra  > wrote:
> 
> On 30.6.2014 23:12, Tomas Vondra wrote:
> > Hi,
> >
> > attached is v5 of the patch. The main change is that scaling the
> number
> > of buckets is done only once, after the initial hash table is
> build. The
> > main advantage of this is lower price. This also allowed some
> cleanup of
> > unecessary code.
> >
> > However, this new patch causes warning like this:
> >
> > WARNING:  temporary file leak: File 231 still referenced
> >
> > I've been eyeballing the code for a while now, but I still fail to see
> > where this comes from :-( Any ideas?
> 
> Meh, the patches are wrong - I haven't realized the tight coupling
> between buckets/batches in ExecHashGetBucketAndBatch:
> 
>   *bucketno = hashvalue & (nbuckets - 1);
>   *batchno = (hashvalue >> hashtable->log2_nbuckets) & (nbatch - 1);
> 
> The previous patches worked mostly by pure luck (the nbuckets was set
> only in the first batch), but once I moved the code at the end, it
> started failing. And by "worked" I mean "didn't throw an error, but
> probably returned bogus results with (nbatch>1)".
> 
> However, ISTM this nbuckets-nbatch coupling is not really necessary,
> it's only constructed this way to assign independent batch/bucket. So I
> went and changed the code like this:
> 
>   *bucketno = hashvalue & (nbuckets - 1);
>   *batchno = (hashvalue >> (32 - hashtable->log2_nbatch));
> 
> I.e. using the top bits for batchno, low bits for bucketno (as before).
> 
> Hopefully I got it right this time. At least it seems to be working for
> cases that failed before (no file leaks, proper rowcounts so far).
> 
> 
> Hi,
> 
> I had a look at the patch and I was wondering if there is a way we can
> set a cap on the resized size, and instead increase the number of
> batches instead? Since this patch evaluates the growth of tuples vs
> increase of space, we could look at increasing the number of batches
> itself instead of number of buckets, if the resized number of buckets
> exceeds a threshold. It shouldnt be too hard, AIUI it would involve a
> call in MultiExecHash right after the new cost evaluation the patch adds.

So you propose to fill the hash table, and when hitting NTUP_PER_BUCKET
(i.e. after adding 'nbuckets * NTUP_PER_BUCKET' tuples) increase the
number of batches?

Hmmm, that'd be easy to implement. I don't think it's possible to do
that in MultiExecHash (that's too late). But it's a matter of changing
this condition in ExecHashTableInsert:

if (hashtable->spaceUsed > hashtable->spaceAllowed)
ExecHashIncreaseNumBatches(hashtable);

to something like this

if ((hashtable->spaceUsed > hashtable->spaceAllowed) ||
(hashtable->totalTuples / hashtable->nbatch
  == NTUP_PER_BUCKET * hashtable->nbuckets))
ExecHashIncreaseNumBatches(hashtable);

But I don't think increasing number of batches is a good approach, as it
means more rescans. I think using memory is more efficient (after all,
that's what work_mem is for, right?).

regards
Tomas


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] bad estimation together with large work_mem generates terrible slow hash joins

2014-07-03 Thread Tomas Vondra

On 1.7.2014 01:24, Tomas Vondra wrote:
> On 30.6.2014 23:12, Tomas Vondra wrote:
>> Hi,
>
> Hopefully I got it right this time. At least it seems to be working for
> cases that failed before (no file leaks, proper rowcounts so far).

Attached v7, fixing nbatch/ntuples in an assert.

regards
Tomas
diff --git a/src/backend/commands/explain.c b/src/backend/commands/explain.c
index 0d9663c..db3a953 100644
--- a/src/backend/commands/explain.c
+++ b/src/backend/commands/explain.c
@@ -1900,18 +1900,20 @@ show_hash_info(HashState *hashstate, ExplainState *es)
 		if (es->format != EXPLAIN_FORMAT_TEXT)
 		{
 			ExplainPropertyLong("Hash Buckets", hashtable->nbuckets, es);
+			ExplainPropertyLong("Original Hash Buckets",
+hashtable->nbuckets_original, es);
 			ExplainPropertyLong("Hash Batches", hashtable->nbatch, es);
 			ExplainPropertyLong("Original Hash Batches",
 hashtable->nbatch_original, es);
 			ExplainPropertyLong("Peak Memory Usage", spacePeakKb, es);
 		}
-		else if (hashtable->nbatch_original != hashtable->nbatch)
+		else if ((hashtable->nbatch_original != hashtable->nbatch) || (hashtable->nbuckets_original != hashtable->nbuckets))
 		{
 			appendStringInfoSpaces(es->str, es->indent * 2);
 			appendStringInfo(es->str,
-			"Buckets: %d  Batches: %d (originally %d)  Memory Usage: %ldkB\n",
-			 hashtable->nbuckets, hashtable->nbatch,
-			 hashtable->nbatch_original, spacePeakKb);
+			"Buckets: %d (originally %d)  Batches: %d (originally %d)  Memory Usage: %ldkB\n",
+			 hashtable->nbuckets, hashtable->nbuckets_original,
+			 hashtable->nbatch, hashtable->nbatch_original, spacePeakKb);
 		}
 		else
 		{
diff --git a/src/backend/executor/nodeHash.c b/src/backend/executor/nodeHash.c
index 589b2f1..53642d1 100644
--- a/src/backend/executor/nodeHash.c
+++ b/src/backend/executor/nodeHash.c
@@ -37,8 +37,10 @@
 #include "utils/lsyscache.h"
 #include "utils/syscache.h"
 
+bool enable_hashjoin_bucket = true;
 
 static void ExecHashIncreaseNumBatches(HashJoinTable hashtable);
+static void ExecHashIncreaseNumBuckets(HashJoinTable hashtable);
 static void ExecHashBuildSkewHash(HashJoinTable hashtable, Hash *node,
 	  int mcvsToUse);
 static void ExecHashSkewTableInsert(HashJoinTable hashtable,
@@ -48,6 +50,33 @@ static void ExecHashSkewTableInsert(HashJoinTable hashtable,
 static void ExecHashRemoveNextSkewBucket(HashJoinTable hashtable);
 
 
+/*
+ * Compute appropriate size for hashtable given the estimated size of the
+ * relation to be hashed (number of rows and average row width).
+ *
+ * This is exported so that the planner's costsize.c can use it.
+ */
+
+/* Target bucket loading (tuples per bucket) */
+#define NTUP_PER_BUCKET			10
+
+/* Multiple of NTUP_PER_BUCKET triggering the increase of nbuckets.
+ * 
+ * Once we reach the threshold we double the number of buckets, and we
+ * want to get 1.0 on average (to get NTUP_PER_BUCKET on average). That
+ * means these two equations should hold
+ * 
+ *   b = 2a (growth)
+ *   (a + b)/2 = 1  (oscillate around NTUP_PER_BUCKET)
+ * 
+ * which means b=1. (a = b/2). If we wanted higher threshold, we
+ * could grow the nbuckets to (4*nbuckets), thus using (b=4a) for
+ * growth, leading to (b=1.6). Or (b=8a) giving 1. etc.
+ * 
+ * Let's start with doubling the bucket count, i.e. 1.333. */
+#define NTUP_GROW_COEFFICIENT	1.333
+#define NTUP_GROW_THRESHOLD		(NTUP_PER_BUCKET * NTUP_GROW_COEFFICIENT)
+
 /* 
  *		ExecHash
  *
@@ -116,6 +145,7 @@ MultiExecHash(HashState *node)
 /* It's a skew tuple, so put it into that hash table */
 ExecHashSkewTableInsert(hashtable, slot, hashvalue,
 		bucketNumber);
+hashtable->skewTuples += 1;
 			}
 			else
 			{
@@ -126,6 +156,23 @@ MultiExecHash(HashState *node)
 		}
 	}
 
+	/* If average number of tuples per bucket is over the defined threshold,
+	 * increase the number of buckets to get below it. */
+	if (enable_hashjoin_bucket) {
+
+		/* consider only tuples in the non-skew buckets */
+		double nonSkewTuples = (hashtable->totalTuples - hashtable->skewTuples);
+
+		if ((nonSkewTuples / hashtable->nbatch) > (hashtable->nbuckets * NTUP_GROW_THRESHOLD)) {
+
+#ifdef HJDEBUG
+			printf("Increasing nbucket to %d (average per bucket = %.1f)\n",
+   nbuckets,  (batchTuples / hashtable->nbuckets));
+#endif
+			ExecHashIncreaseNumBuckets(hashtable);
+		}
+	}
+
 	/* must provide our own instrumentation support */
 	if (node->ps.instrument)
 		InstrStopNode(node->ps.instrument, hashtable->totalTuples);
@@ -239,6 +286,7 @@ ExecHashTableCreate(Hash *node, List *hashOperators, bool keepNulls)
 	int			nbatch;
 	int			num_skew_mcvs;
 	int			log2_nbuckets;
+	int			log2_nbatch;
 	int			nkeys;
 	int			i;
 	ListCell   *ho;
@@ -263,6 +311,10 @@ ExecHashTableCreate(Hash *node, List *hashOperators, bool keepNulls)
 	log2_nbuckets = my_log2(nbuckets);
 	Assert(nbuckets == (1 << log2_nbuckets));
 
+	/* nbatch must be a power

Re: [HACKERS] bad estimation together with large work_mem generates terrible slow hash joins

2014-07-03 Thread Atri Sharma

On Tue, Jul 1, 2014 at 4:54 AM, Tomas Vondra  wrote:

> On 30.6.2014 23:12, Tomas Vondra wrote:
> > Hi,
> >
> > attached is v5 of the patch. The main change is that scaling the number
> > of buckets is done only once, after the initial hash table is build. The
> > main advantage of this is lower price. This also allowed some cleanup of
> > unecessary code.
> >
> > However, this new patch causes warning like this:
> >
> > WARNING:  temporary file leak: File 231 still referenced
> >
> > I've been eyeballing the code for a while now, but I still fail to see
> > where this comes from :-( Any ideas?
>
> Meh, the patches are wrong - I haven't realized the tight coupling
> between buckets/batches in ExecHashGetBucketAndBatch:
>
>   *bucketno = hashvalue & (nbuckets - 1);
>   *batchno = (hashvalue >> hashtable->log2_nbuckets) & (nbatch - 1);
>
> The previous patches worked mostly by pure luck (the nbuckets was set
> only in the first batch), but once I moved the code at the end, it
> started failing. And by "worked" I mean "didn't throw an error, but
> probably returned bogus results with (nbatch>1)".
>
> However, ISTM this nbuckets-nbatch coupling is not really necessary,
> it's only constructed this way to assign independent batch/bucket. So I
> went and changed the code like this:
>
>   *bucketno = hashvalue & (nbuckets - 1);
>   *batchno = (hashvalue >> (32 - hashtable->log2_nbatch));
>
> I.e. using the top bits for batchno, low bits for bucketno (as before).
>
> Hopefully I got it right this time. At least it seems to be working for
> cases that failed before (no file leaks, proper rowcounts so far).
>
>
Hi,

I had a look at the patch and I was wondering if there is a way we can set
a cap on the resized size, and instead increase the number of batches
instead? Since this patch evaluates the growth of tuples vs increase of
space, we could look at increasing the number of batches itself instead of
number of buckets, if the resized number of buckets exceeds a threshold. It
shouldnt be too hard, AIUI it would involve a call in MultiExecHash right
after the new cost evaluation the patch adds.

It isnt a target of the current patch, but I think that it is a logical
extension to the same.

Thoughts/Comments?

Regards,

Atri
-- 
Regards,

Atri
*l'apprenant*

Re: [HACKERS] bad estimation together with large work_mem generates terrible slow hash joins

2014-06-30 Thread Tomas Vondra

On 30.6.2014 23:12, Tomas Vondra wrote:
> Hi,
> 
> attached is v5 of the patch. The main change is that scaling the number
> of buckets is done only once, after the initial hash table is build. The
> main advantage of this is lower price. This also allowed some cleanup of
> unecessary code.
> 
> However, this new patch causes warning like this:
> 
> WARNING:  temporary file leak: File 231 still referenced
> 
> I've been eyeballing the code for a while now, but I still fail to see
> where this comes from :-( Any ideas?

Meh, the patches are wrong - I haven't realized the tight coupling
between buckets/batches in ExecHashGetBucketAndBatch:

  *bucketno = hashvalue & (nbuckets - 1);
  *batchno = (hashvalue >> hashtable->log2_nbuckets) & (nbatch - 1);

The previous patches worked mostly by pure luck (the nbuckets was set
only in the first batch), but once I moved the code at the end, it
started failing. And by "worked" I mean "didn't throw an error, but
probably returned bogus results with (nbatch>1)".

However, ISTM this nbuckets-nbatch coupling is not really necessary,
it's only constructed this way to assign independent batch/bucket. So I
went and changed the code like this:

  *bucketno = hashvalue & (nbuckets - 1);
  *batchno = (hashvalue >> (32 - hashtable->log2_nbatch));

I.e. using the top bits for batchno, low bits for bucketno (as before).

Hopefully I got it right this time. At least it seems to be working for
cases that failed before (no file leaks, proper rowcounts so far).

regards
Tomas
diff --git a/src/backend/commands/explain.c b/src/backend/commands/explain.c
index 0d9663c..db3a953 100644
--- a/src/backend/commands/explain.c
+++ b/src/backend/commands/explain.c
@@ -1900,18 +1900,20 @@ show_hash_info(HashState *hashstate, ExplainState *es)
 		if (es->format != EXPLAIN_FORMAT_TEXT)
 		{
 			ExplainPropertyLong("Hash Buckets", hashtable->nbuckets, es);
+			ExplainPropertyLong("Original Hash Buckets",
+hashtable->nbuckets_original, es);
 			ExplainPropertyLong("Hash Batches", hashtable->nbatch, es);
 			ExplainPropertyLong("Original Hash Batches",
 hashtable->nbatch_original, es);
 			ExplainPropertyLong("Peak Memory Usage", spacePeakKb, es);
 		}
-		else if (hashtable->nbatch_original != hashtable->nbatch)
+		else if ((hashtable->nbatch_original != hashtable->nbatch) || (hashtable->nbuckets_original != hashtable->nbuckets))
 		{
 			appendStringInfoSpaces(es->str, es->indent * 2);
 			appendStringInfo(es->str,
-			"Buckets: %d  Batches: %d (originally %d)  Memory Usage: %ldkB\n",
-			 hashtable->nbuckets, hashtable->nbatch,
-			 hashtable->nbatch_original, spacePeakKb);
+			"Buckets: %d (originally %d)  Batches: %d (originally %d)  Memory Usage: %ldkB\n",
+			 hashtable->nbuckets, hashtable->nbuckets_original,
+			 hashtable->nbatch, hashtable->nbatch_original, spacePeakKb);
 		}
 		else
 		{
diff --git a/src/backend/executor/nodeHash.c b/src/backend/executor/nodeHash.c
index 589b2f1..48cca62 100644
--- a/src/backend/executor/nodeHash.c
+++ b/src/backend/executor/nodeHash.c
@@ -37,8 +37,10 @@
 #include "utils/lsyscache.h"
 #include "utils/syscache.h"
 
+bool enable_hashjoin_bucket = true;
 
 static void ExecHashIncreaseNumBatches(HashJoinTable hashtable);
+static void ExecHashIncreaseNumBuckets(HashJoinTable hashtable);
 static void ExecHashBuildSkewHash(HashJoinTable hashtable, Hash *node,
 	  int mcvsToUse);
 static void ExecHashSkewTableInsert(HashJoinTable hashtable,
@@ -48,6 +50,33 @@ static void ExecHashSkewTableInsert(HashJoinTable hashtable,
 static void ExecHashRemoveNextSkewBucket(HashJoinTable hashtable);
 
 
+/*
+ * Compute appropriate size for hashtable given the estimated size of the
+ * relation to be hashed (number of rows and average row width).
+ *
+ * This is exported so that the planner's costsize.c can use it.
+ */
+
+/* Target bucket loading (tuples per bucket) */
+#define NTUP_PER_BUCKET			10
+
+/* Multiple of NTUP_PER_BUCKET triggering the increase of nbuckets.
+ * 
+ * Once we reach the threshold we double the number of buckets, and we
+ * want to get 1.0 on average (to get NTUP_PER_BUCKET on average). That
+ * means these two equations should hold
+ * 
+ *   b = 2a (growth)
+ *   (a + b)/2 = 1  (oscillate around NTUP_PER_BUCKET)
+ * 
+ * which means b=1. (a = b/2). If we wanted higher threshold, we
+ * could grow the nbuckets to (4*nbuckets), thus using (b=4a) for
+ * growth, leading to (b=1.6). Or (b=8a) giving 1. etc.
+ * 
+ * Let's start with doubling the bucket count, i.e. 1.333. */
+#define NTUP_GROW_COEFFICIENT	1.333
+#define NTUP_GROW_THRESHOLD		(NTUP_PER_BUCKET * NTUP_GROW_COEFFICIENT)
+
 /* 
  *		ExecHash
  *
@@ -116,6 +145,7 @@ MultiExecHash(HashState *node)
 /* It's a skew tuple, so put it into that hash table */
 ExecHashSkewTableInsert(hashtable, slot, hashvalue,
 		bucketNumber);
+hashtable->skewTuples += 1;
 			}

Re: [HACKERS] bad estimation together with large work_mem generates terrible slow hash joins

2014-06-30 Thread Tomas Vondra

Hi,

attached is v5 of the patch. The main change is that scaling the number
of buckets is done only once, after the initial hash table is build. The
main advantage of this is lower price. This also allowed some cleanup of
unecessary code.

However, this new patch causes warning like this:

WARNING:  temporary file leak: File 231 still referenced

I've been eyeballing the code for a while now, but I still fail to see
where this comes from :-( Any ideas?

regards
Tomas
diff --git a/src/backend/commands/explain.c b/src/backend/commands/explain.c
index 0d9663c..db3a953 100644
--- a/src/backend/commands/explain.c
+++ b/src/backend/commands/explain.c
@@ -1900,18 +1900,20 @@ show_hash_info(HashState *hashstate, ExplainState *es)
 		if (es->format != EXPLAIN_FORMAT_TEXT)
 		{
 			ExplainPropertyLong("Hash Buckets", hashtable->nbuckets, es);
+			ExplainPropertyLong("Original Hash Buckets",
+hashtable->nbuckets_original, es);
 			ExplainPropertyLong("Hash Batches", hashtable->nbatch, es);
 			ExplainPropertyLong("Original Hash Batches",
 hashtable->nbatch_original, es);
 			ExplainPropertyLong("Peak Memory Usage", spacePeakKb, es);
 		}
-		else if (hashtable->nbatch_original != hashtable->nbatch)
+		else if ((hashtable->nbatch_original != hashtable->nbatch) || (hashtable->nbuckets_original != hashtable->nbuckets))
 		{
 			appendStringInfoSpaces(es->str, es->indent * 2);
 			appendStringInfo(es->str,
-			"Buckets: %d  Batches: %d (originally %d)  Memory Usage: %ldkB\n",
-			 hashtable->nbuckets, hashtable->nbatch,
-			 hashtable->nbatch_original, spacePeakKb);
+			"Buckets: %d (originally %d)  Batches: %d (originally %d)  Memory Usage: %ldkB\n",
+			 hashtable->nbuckets, hashtable->nbuckets_original,
+			 hashtable->nbatch, hashtable->nbatch_original, spacePeakKb);
 		}
 		else
 		{
diff --git a/src/backend/executor/nodeHash.c b/src/backend/executor/nodeHash.c
index 589b2f1..fbd721d 100644
--- a/src/backend/executor/nodeHash.c
+++ b/src/backend/executor/nodeHash.c
@@ -37,8 +37,10 @@
 #include "utils/lsyscache.h"
 #include "utils/syscache.h"
 
+bool enable_hashjoin_bucket = true;
 
 static void ExecHashIncreaseNumBatches(HashJoinTable hashtable);
+static void ExecHashIncreaseNumBuckets(HashJoinTable hashtable);
 static void ExecHashBuildSkewHash(HashJoinTable hashtable, Hash *node,
 	  int mcvsToUse);
 static void ExecHashSkewTableInsert(HashJoinTable hashtable,
@@ -48,6 +50,33 @@ static void ExecHashSkewTableInsert(HashJoinTable hashtable,
 static void ExecHashRemoveNextSkewBucket(HashJoinTable hashtable);
 
 
+/*
+ * Compute appropriate size for hashtable given the estimated size of the
+ * relation to be hashed (number of rows and average row width).
+ *
+ * This is exported so that the planner's costsize.c can use it.
+ */
+
+/* Target bucket loading (tuples per bucket) */
+#define NTUP_PER_BUCKET			10
+
+/* Multiple of NTUP_PER_BUCKET triggering the increase of nbuckets.
+ * 
+ * Once we reach the threshold we double the number of buckets, and we
+ * want to get 1.0 on average (to get NTUP_PER_BUCKET on average). That
+ * means these two equations should hold
+ * 
+ *   b = 2a (growth)
+ *   (a + b)/2 = 1  (oscillate around NTUP_PER_BUCKET)
+ * 
+ * which means b=1. (a = b/2). If we wanted higher threshold, we
+ * could grow the nbuckets to (4*nbuckets), thus using (b=4a) for
+ * growth, leading to (b=1.6). Or (b=8a) giving 1. etc.
+ * 
+ * Let's start with doubling the bucket count, i.e. 1.333. */
+#define NTUP_GROW_COEFFICIENT	1.333
+#define NTUP_GROW_THRESHOLD		(NTUP_PER_BUCKET * NTUP_GROW_COEFFICIENT)
+
 /* 
  *		ExecHash
  *
@@ -126,6 +155,23 @@ MultiExecHash(HashState *node)
 		}
 	}
 
+	/* If average number of tuples per bucket is over the defined threshold,
+	 * increase the number of buckets to get below it. */
+	if (enable_hashjoin_bucket) {
+
+		double batchTuples = (hashtable->totalTuples / hashtable->nbatch);
+
+		if (batchTuples > (hashtable->nbuckets * NTUP_GROW_THRESHOLD)) {
+
+#ifdef HJDEBUG
+			printf("Increasing nbucket to %d (average per bucket = %.1f)\n",
+   nbuckets,  (batchTuples / hashtable->nbuckets));
+#endif
+			ExecHashIncreaseNumBuckets(hashtable);
+
+		}
+	}
+
 	/* must provide our own instrumentation support */
 	if (node->ps.instrument)
 		InstrStopNode(node->ps.instrument, hashtable->totalTuples);
@@ -271,6 +317,7 @@ ExecHashTableCreate(Hash *node, List *hashOperators, bool keepNulls)
 	 */
 	hashtable = (HashJoinTable) palloc(sizeof(HashJoinTableData));
 	hashtable->nbuckets = nbuckets;
+	hashtable->nbuckets_original = nbuckets;
 	hashtable->log2_nbuckets = log2_nbuckets;
 	hashtable->buckets = NULL;
 	hashtable->keepNulls = keepNulls;
@@ -375,17 +422,6 @@ ExecHashTableCreate(Hash *node, List *hashOperators, bool keepNulls)
 	return hashtable;
 }
 
-
-/*
- * Compute appropriate size for hashtable given the estimated size of the
- * relation to be hashed (

Re: [HACKERS] bad estimation together with large work_mem generates terrible slow hash joins

2014-06-29 Thread Tomas Vondra

On 26.6.2014 23:48, Tomas Vondra wrote:
> On 26.6.2014 20:43, Tomas Vondra wrote:
>> Attached is v2 of the patch, with some cleanups / minor improvements:
>>
>> * there's a single FIXME, related to counting tuples in the
> 
> Meh, I couldn't resist resolving this FIXME, so attached is v3 of the
> patch. This just adds a proper 'batch tuples' counter to the hash table.
> 
> All comments, measurements on different queries etc. welcome. We'll
> certainly do a lot of testing, because this was a big issue for us.

Attached is v4 of the patch, with a few minor improvements. The only
thing worth mentioning is overflow protection, similar to what's done in
the ExecChooseHashTableSize() function. Otherwise it's mostly about
improving comments.

Also attached is a v4 with GUC, making it easier to compare effect of
the patch, by simply setting "enable_hashjoin_bucket" to "off" (original
behaviour) or "on" (new behaviour).

And finally there's an SQL script demonstrating the effect of the patch
with various work_mem settings. For example what I see on my desktop is
this (averages from 3 runs):

= SMALL WORK MEM (2MB) =
  no dynamic buckets dynamic buckets
query A   5945 ms5969 ms
query B   6080 ms5943 ms
query C   6531 ms6822 ms
query D   6962 ms6618 ms

= MEDIUM WORK MEM (16MB) =
  no dynamic buckets dynamic buckets
query A   7955 ms7944 ms
query B   9970 ms7569 ms
query C   8643 ms8560 ms
query D  33908 ms7700 ms

= LARGE WORK MEM (64MB) =
  no dynamic buckets dynamic buckets
query A   10235 ms   10233 ms
query B   32229 ms9318 ms
query C   14500 ms   10554 ms
query D  213344 ms9145 ms

Where "A" is "exactly estimated" and the other queries suffer by various
underestimates. My observations from this are:

(1) For small work_mem values it does not really matter, thanks to the
caching effects (the whole hash table fits into L2 CPU cache).

(2) For medium work_mem values (not really huge, but exceeding CPU
caches), the differences are negligible, except for the last query
with most severe underestimate. In that case the new behaviour is
much faster.

(3) For large work_mem values, the speedup is pretty obvious and
dependent on the underestimate.

The question is why to choose large work_mem values when the smaller
values actually perform better. Well, the example tables are not
perfectly representative. In case the outer table is much larger and
does not fit into RAM that easily (which is the case of large fact
tables or joins), the rescans (because of more batches) are more
expensive and outweight the caching benefits.

Also, the work_mem is shared with other nodes, e.g. aggregates, and
decreasing it because of hash joins would hurt them.

regards
Tomas
diff --git a/src/backend/commands/explain.c b/src/backend/commands/explain.c
index 0d9663c..db3a953 100644
--- a/src/backend/commands/explain.c
+++ b/src/backend/commands/explain.c
@@ -1900,18 +1900,20 @@ show_hash_info(HashState *hashstate, ExplainState *es)
 		if (es->format != EXPLAIN_FORMAT_TEXT)
 		{
 			ExplainPropertyLong("Hash Buckets", hashtable->nbuckets, es);
+			ExplainPropertyLong("Original Hash Buckets",
+hashtable->nbuckets_original, es);
 			ExplainPropertyLong("Hash Batches", hashtable->nbatch, es);
 			ExplainPropertyLong("Original Hash Batches",
 hashtable->nbatch_original, es);
 			ExplainPropertyLong("Peak Memory Usage", spacePeakKb, es);
 		}
-		else if (hashtable->nbatch_original != hashtable->nbatch)
+		else if ((hashtable->nbatch_original != hashtable->nbatch) || (hashtable->nbuckets_original != hashtable->nbuckets))
 		{
 			appendStringInfoSpaces(es->str, es->indent * 2);
 			appendStringInfo(es->str,
-			"Buckets: %d  Batches: %d (originally %d)  Memory Usage: %ldkB\n",
-			 hashtable->nbuckets, hashtable->nbatch,
-			 hashtable->nbatch_original, spacePeakKb);
+			"Buckets: %d (originally %d)  Batches: %d (originally %d)  Memory Usage: %ldkB\n",
+			 hashtable->nbuckets, hashtable->nbuckets_original,
+			 hashtable->nbatch, hashtable->nbatch_original, spacePeakKb);
 		}
 		else
 		{
diff --git a/src/backend/executor/nodeHash.c b/src/backend/executor/nodeHash.c
index 589b2f1..96fdd68 100644
--- a/src/backend/executor/nodeHash.c
+++ b/src/backend/executor/nodeHash.c
@@ -39,6 +39,7 @@

 static void ExecHashIncreaseNumBatches(HashJoinTable hashtable);
+static void ExecHashIncreaseNumBuckets(HashJoinTable hashtable);
 static void ExecHashBuildSkewHash(HashJoinTable hashtable, Hash *node,
 	  int mcvsToUse);
 static void ExecHashSkewTableInsert(HashJoinTable hashtable,
@@ -271,6 +272,7 @@ ExecHashTableCreate(Hash *node, List *hashOperators, bool keepNulls)
 	 */
 	hashtable = (HashJoinTable) palloc(sizeof(HashJoinTableData));
 	hashtable->nbuckets =

Re: [HACKERS] bad estimation together with large work_mem generates terrible slow hash joins

2014-06-26 Thread Tomas Vondra

On 26.6.2014 20:43, Tomas Vondra wrote:
> Attached is v2 of the patch, with some cleanups / minor improvements:
> 
> * there's a single FIXME, related to counting tuples in the

Meh, I couldn't resist resolving this FIXME, so attached is v3 of the
patch. This just adds a proper 'batch tuples' counter to the hash table.

All comments, measurements on different queries etc. welcome. We'll
certainly do a lot of testing, because this was a big issue for us.

regards
Tomas
diff --git a/src/backend/commands/explain.c b/src/backend/commands/explain.c
index 0d9663c..db3a953 100644
--- a/src/backend/commands/explain.c
+++ b/src/backend/commands/explain.c
@@ -1900,18 +1900,20 @@ show_hash_info(HashState *hashstate, ExplainState *es)
 		if (es->format != EXPLAIN_FORMAT_TEXT)
 		{
 			ExplainPropertyLong("Hash Buckets", hashtable->nbuckets, es);
+			ExplainPropertyLong("Original Hash Buckets",
+hashtable->nbuckets_original, es);
 			ExplainPropertyLong("Hash Batches", hashtable->nbatch, es);
 			ExplainPropertyLong("Original Hash Batches",
 hashtable->nbatch_original, es);
 			ExplainPropertyLong("Peak Memory Usage", spacePeakKb, es);
 		}
-		else if (hashtable->nbatch_original != hashtable->nbatch)
+		else if ((hashtable->nbatch_original != hashtable->nbatch) || (hashtable->nbuckets_original != hashtable->nbuckets))
 		{
 			appendStringInfoSpaces(es->str, es->indent * 2);
 			appendStringInfo(es->str,
-			"Buckets: %d  Batches: %d (originally %d)  Memory Usage: %ldkB\n",
-			 hashtable->nbuckets, hashtable->nbatch,
-			 hashtable->nbatch_original, spacePeakKb);
+			"Buckets: %d (originally %d)  Batches: %d (originally %d)  Memory Usage: %ldkB\n",
+			 hashtable->nbuckets, hashtable->nbuckets_original,
+			 hashtable->nbatch, hashtable->nbatch_original, spacePeakKb);
 		}
 		else
 		{
diff --git a/src/backend/executor/nodeHash.c b/src/backend/executor/nodeHash.c
index 589b2f1..b32fb6b 100644
--- a/src/backend/executor/nodeHash.c
+++ b/src/backend/executor/nodeHash.c
@@ -39,6 +39,7 @@
 
 
 static void ExecHashIncreaseNumBatches(HashJoinTable hashtable);
+static void ExecHashIncreaseNumBuckets(HashJoinTable hashtable);
 static void ExecHashBuildSkewHash(HashJoinTable hashtable, Hash *node,
 	  int mcvsToUse);
 static void ExecHashSkewTableInsert(HashJoinTable hashtable,
@@ -271,6 +272,7 @@ ExecHashTableCreate(Hash *node, List *hashOperators, bool keepNulls)
 	 */
 	hashtable = (HashJoinTable) palloc(sizeof(HashJoinTableData));
 	hashtable->nbuckets = nbuckets;
+	hashtable->nbuckets_original = nbuckets;
 	hashtable->log2_nbuckets = log2_nbuckets;
 	hashtable->buckets = NULL;
 	hashtable->keepNulls = keepNulls;
@@ -285,6 +287,7 @@ ExecHashTableCreate(Hash *node, List *hashOperators, bool keepNulls)
 	hashtable->nbatch_outstart = nbatch;
 	hashtable->growEnabled = true;
 	hashtable->totalTuples = 0;
+	hashtable->batchTuples = 0;
 	hashtable->innerBatchFile = NULL;
 	hashtable->outerBatchFile = NULL;
 	hashtable->spaceUsed = 0;
@@ -386,6 +389,22 @@ ExecHashTableCreate(Hash *node, List *hashOperators, bool keepNulls)
 /* Target bucket loading (tuples per bucket) */
 #define NTUP_PER_BUCKET			10
 
+/* Multiple of NTUP_PER_BUCKET triggering the increase of nbuckets.
+ * 
+ * Once we reach the threshold we double the number of buckets, and we
+ * want to get 1.0 on average (to get NTUP_PER_BUCKET on average). That
+ * means these two equations should hold
+ * 
+ *   b = 2a (growth)
+ *   (a + b)/2 = 1  (oscillate around NTUP_PER_BUCKET)
+ * 
+ * which means b=1. (a = b/2). If we wanted higher threshold, we
+ * could grow the nbuckets to (4*nbuckets), thus using (b=4a) for
+ * growth, leading to (b=1.6). Or (b=8a) giving 1. etc.
+ * 
+ * Let's start with doubling the bucket count, i.e. 1.333. */
+#define NTUP_GROW_THRESHOLD 1.333
+
 void
 ExecChooseHashTableSize(double ntuples, int tupwidth, bool useskew,
 		int *numbuckets,
@@ -651,6 +670,7 @@ ExecHashIncreaseNumBatches(HashJoinTable hashtable)
 /* prevtuple doesn't change */
 hashtable->spaceUsed -=
 	HJTUPLE_OVERHEAD + HJTUPLE_MINTUPLE(tuple)->t_len;
+hashtable->batchTuples -= 1;
 pfree(tuple);
 nfreed++;
 			}
@@ -682,6 +702,92 @@ ExecHashIncreaseNumBatches(HashJoinTable hashtable)
 }
 
 /*
+ * ExecHashIncreaseNumBuckets
+ *		increase the original number of buckets in order to reduce
+ *		number of tuples per bucket
+ */
+static void
+ExecHashIncreaseNumBuckets(HashJoinTable hashtable)
+{
+	int			i;
+	int ntuples = 0;
+	int			oldnbuckets = hashtable->nbuckets;
+	HashJoinTuple  *oldbuckets = hashtable->buckets;
+	MemoryContext   oldcxt;
+
+	/* XXX Not sure if we should update the info about used space here.
+	 * The code seems to ignore the space used for 'buckets' and we're not
+	 * allocating more space for tuples (just shuffling them to the new
+	 * buckets). And the amount of memory used for buckets is quite small
+	 * (just an array of pointers, thus ~8kB per 1k buckets on 64-bit

Re: [HACKERS] bad estimation together with large work_mem generates terrible slow hash joins

2014-06-26 Thread Tomas Vondra

Attached is v2 of the patch, with some cleanups / minor improvements:

* improved comments, whitespace fixed / TODOs etc.

* tracking inital # of buckets (similar to initial # of batches)

* adding info about buckets to EXPLAIN ANALYZE, similar to batches - I
didn't want to make it overly complex, so the info about initial
bucket/batch count is added if at least one them is modified

* modified threshold triggering the growth, to get NTUP_PER_BUCKET on
average (see the NTUP_GROW_THRESHOLD comment nodeHash.c)

* there's a single FIXME, related to counting tuples in the

One thing that's important to note is the difference between # of
batches and # of buckets. While one # of batches is "global" the # of
buckets is 'within a batch'. So theoretically each batch can use
different number of buckets.

However the value is reused between batches, so it only grows. That
means this is possible:

  initial: 1024 buckets (before 1st batch)
  batch 1: 1024 buckets
  batch 2: 1024 buckets
  batch 3: 4096 buckets
  batch 4: 8192 buckets

while this is not:

  initial: 1024 buckets (before 1st batch)
  batch 1: 1024 buckets
  batch 2: 4096 buckets
  batch 3: 1024 buckets
  batch 4: 8192 buckets

However in practice I expect the first batch will to do all the work,
and the following batches will just reuse the same number of buckets.
This of course assumes the batches have similar tuple sizes etc.

So the first batch will do all the reshuffling the tables, and the
following batches will reuse the 'right' number of buckets from the start.

regards
Tomas
diff --git a/src/backend/commands/explain.c b/src/backend/commands/explain.c
index 0d9663c..db3a953 100644
--- a/src/backend/commands/explain.c
+++ b/src/backend/commands/explain.c
@@ -1900,18 +1900,20 @@ show_hash_info(HashState *hashstate, ExplainState *es)
 		if (es->format != EXPLAIN_FORMAT_TEXT)
 		{
 			ExplainPropertyLong("Hash Buckets", hashtable->nbuckets, es);
+			ExplainPropertyLong("Original Hash Buckets",
+hashtable->nbuckets_original, es);
 			ExplainPropertyLong("Hash Batches", hashtable->nbatch, es);
 			ExplainPropertyLong("Original Hash Batches",
 hashtable->nbatch_original, es);
 			ExplainPropertyLong("Peak Memory Usage", spacePeakKb, es);
 		}
-		else if (hashtable->nbatch_original != hashtable->nbatch)
+		else if ((hashtable->nbatch_original != hashtable->nbatch) || (hashtable->nbuckets_original != hashtable->nbuckets))
 		{
 			appendStringInfoSpaces(es->str, es->indent * 2);
 			appendStringInfo(es->str,
-			"Buckets: %d  Batches: %d (originally %d)  Memory Usage: %ldkB\n",
-			 hashtable->nbuckets, hashtable->nbatch,
-			 hashtable->nbatch_original, spacePeakKb);
+			"Buckets: %d (originally %d)  Batches: %d (originally %d)  Memory Usage: %ldkB\n",
+			 hashtable->nbuckets, hashtable->nbuckets_original,
+			 hashtable->nbatch, hashtable->nbatch_original, spacePeakKb);
 		}
 		else
 		{
diff --git a/src/backend/executor/nodeHash.c b/src/backend/executor/nodeHash.c
index 589b2f1..879b336 100644
--- a/src/backend/executor/nodeHash.c
+++ b/src/backend/executor/nodeHash.c
@@ -39,6 +39,7 @@
 
 
 static void ExecHashIncreaseNumBatches(HashJoinTable hashtable);
+static void ExecHashIncreaseNumBuckets(HashJoinTable hashtable);
 static void ExecHashBuildSkewHash(HashJoinTable hashtable, Hash *node,
 	  int mcvsToUse);
 static void ExecHashSkewTableInsert(HashJoinTable hashtable,
@@ -271,6 +272,7 @@ ExecHashTableCreate(Hash *node, List *hashOperators, bool keepNulls)
 	 */
 	hashtable = (HashJoinTable) palloc(sizeof(HashJoinTableData));
 	hashtable->nbuckets = nbuckets;
+	hashtable->nbuckets_original = nbuckets;
 	hashtable->log2_nbuckets = log2_nbuckets;
 	hashtable->buckets = NULL;
 	hashtable->keepNulls = keepNulls;
@@ -386,6 +388,22 @@ ExecHashTableCreate(Hash *node, List *hashOperators, bool keepNulls)
 /* Target bucket loading (tuples per bucket) */
 #define NTUP_PER_BUCKET			10
 
+/* Multiple of NTUP_PER_BUCKET triggering the increase of nbuckets.
+ * 
+ * Once we reach the threshold we double the number of buckets, and we
+ * want to get 1.0 on average (to get NTUP_PER_BUCKET on average). That
+ * means these two equations should hold
+ * 
+ *   b = 2a (growth)
+ *   (a + b)/2 = 1  (oscillate around NTUP_PER_BUCKET)
+ * 
+ * which means b=1. (a = b/2). If we wanted higher threshold, we
+ * could grow the nbuckets to (4*nbuckets), thus using (b=4a) for
+ * growth, leading to (b=1.6). Or (b=8a) giving 1. etc.
+ * 
+ * Let's start with doubling the bucket count, i.e. 1.333. */
+#define NTUP_GROW_THRESHOLD 1.333
+
 void
 ExecChooseHashTableSize(double ntuples, int tupwidth, bool useskew,
 		int *numbuckets,
@@ -682,6 +700,92 @@ ExecHashIncreaseNumBatches(HashJoinTable hashtable)
 }
 
 /*
+ * ExecHashIncreaseNumBuckets
+ *		increase the original number of buckets in order to reduce
+ *		number of tuples per bucket
+ */
+static void
+ExecHashIncreaseNumBuckets(HashJoinTable hashtable)
+{
+	int			i;

Re: [HACKERS] bad estimation together with large work_mem generates terrible slow hash joins

2014-06-26 Thread Tomas Vondra


Hi,

Dne 2014-06-26 14:10, Pavel Stehule napsal:

Hello all,

today I had to work with one slow query - when I checked different
scenarios I found a dependency on work_mem size - but it is atypical -
bigger work_mem increased query execution 31 minutes (600MB work mem)
and 1 minute (1MB).


The problem described in Pavel's emails (illustrated by the awful
explain plans) is in this part:

 ->  Hash  (cost=881801.58..881801.58 rows=61735 width=8) (actual 
time=9076.153..9076.153 rows=3310768 loops=1)


That is, the estimated number of rows is ~60k, but in reality it's 
~3.3M.
This then leads to a hash table with small number of buckets (8192) 
containing
large number of tuples (~400 in this case) in a linked list. Which 
significantly

slows down the lookups during the hash join.

This issue is actually quite common - all it takes is a significant
underestimation of the hashed relation, either because it's a complex 
join
(thus inherently difficult to estimate) or because the WHERE conditions 
are

not independent (see the example below).

The attached patch (early WIP, after ~30 minutes of hacking) addresses 
this by
increasing the number of bucket whenever the average number of tuples 
per item

exceeds 1.5x NTUP_PER_BUCKET.


=== Example 

create table table_a (id int, fk_id int);
create table table_b (fk_id int, val_a int, val_b int);

insert into table_a select i, i from generate_series(1,1000) s(i);
insert into table_b select i, mod(i,1000), mod(i,1000) from 
generate_series(1,1000) s(i);


analyze table_a;
analyze table_b;

explain analyze select count(*) from table_a join table_b on (table_a.id 
= table_b.fk_id) where val_a < 10 and val_b < 10;



without the patch:

QUERY PLAN
--
 Aggregate  (cost=385834.56..385834.57 rows=1 width=0) (actual 
time=49543.263..49543.264 rows=1 loops=1)
   ->  Hash Join  (cost=204069.89..385831.16 rows=1359 width=0) (actual 
time=923.751..49531.554 rows=10 loops=1)

 Hash Cond: (table_a.id = table_b.fk_id)
 ->  Seq Scan on table_a  (cost=0.00..144247.77 rows=977 
width=4) (actual time=0.104..967.090 rows=1000 loops=1)
 ->  Hash  (cost=204052.90..204052.90 rows=1359 width=4) (actual 
time=923.615..923.615 rows=10 loops=1)

   Buckets: 1024  Batches: 1  Memory Usage: 3516kB
   ->  Seq Scan on table_b  (cost=0.00..204052.90 rows=1359 
width=4) (actual time=0.086..910.656 rows=10 loops=1)

 Filter: ((val_a < 10) AND (val_b < 10))
 Rows Removed by Filter: 990
 Planning time: 0.271 ms
 Execution time: 49545.053 ms
(11 rows)


with the patch:

QUERY PLAN
--
 Aggregate  (cost=385834.56..385834.57 rows=1 width=0) (actual 
time=9780.346..9780.347 rows=1 loops=1)
   ->  Hash Join  (cost=204069.89..385831.16 rows=1359 width=0) (actual 
time=939.297..9772.256 rows=10 loops=1)

 Hash Cond: (table_a.id = table_b.fk_id)
 ->  Seq Scan on table_a  (cost=0.00..144247.77 rows=977 
width=4) (actual time=0.103..962.446 rows=1000 loops=1)
 ->  Hash  (cost=204052.90..204052.90 rows=1359 width=4) (actual 
time=939.172..939.172 rows=10 loops=1)

   Buckets: 8192  Batches: 1  Memory Usage: 3516kB
   ->  Seq Scan on table_b  (cost=0.00..204052.90 rows=1359 
width=4) (actual time=0.064..909.191 rows=10 loops=1)

 Filter: ((val_a < 10) AND (val_b < 10))
 Rows Removed by Filter: 990
 Planning time: 0.276 ms
 Execution time: 9782.295 ms
(11 rows)

Time: 9784.392 ms

So the duration improved significantly - from ~52 seconds to ~10 
seconds.


The patch is not perfect, it needs a bit more love, as illustrated by 
the FIXME/TODO items. Feel free to comment.


regards
Tomas

diff --git a/src/backend/executor/nodeHash.c b/src/backend/executor/nodeHash.c
index 589b2f1..d71b8b9 100644
--- a/src/backend/executor/nodeHash.c
+++ b/src/backend/executor/nodeHash.c
@@ -39,6 +39,7 @@
 
 
 static void ExecHashIncreaseNumBatches(HashJoinTable hashtable);
+static void ExecHashIncreaseNumBuckets(HashJoinTable hashtable);
 static void ExecHashBuildSkewHash(HashJoinTable hashtable, Hash *node,
 	  int mcvsToUse);
 static void ExecHashSkewTableInsert(HashJoinTable hashtable,
@@ -682,6 +683,77 @@ ExecHashIncreaseNumBatches(HashJoinTable hashtable)
 }
 
 /*
+ * ExecHashIncreaseNumBuckets
+ *		increase the original number of buckets in order to reduce
+ *		number of tuples per bucket
+ */
+static void
+ExecHashIncreaseNumBuckets(HashJoinTable hashtable)
+{
+	int			oldnbuckets = hashtable->nbuckets;
+	int			i;
+

[HACKERS] bad estimation together with large work_mem generates terrible slow hash joins

2014-06-26 Thread Pavel Stehule

Hello all,

today I had to work with one slow query - when I checked different
scenarios I found a dependency on work_mem size - but it is atypical -
bigger work_mem increased query execution 31 minutes (600MB work mem) and 1
minute (1MB).

db_kost07e2d9cdmg20b1takpqntobo6ghj=# set work_mem to '600MB';
SET
db_kost07e2d9cdmg20b1takpqntobo6ghj=# explain analyze SELECT
"f_users_aax8qg0e23asx5h"."firtsclickutmsource_id" AS "a_5078", COUNT( (
CASE WHEN ( 89388 = "f_emails_aaekn159p5aw3t8"."read_id" ) THEN
"f_emails_aaekn159p5aw3t8"."id" ELSE CAST( NULL AS INT ) END ) ) AS
"c_7a2a545bf904a48a", COUNT( "f_emails_aaekn159p5aw3t8"."id" ) AS
"c_9e301e13350cc0dc", ( MAX( ( CASE WHEN ( 89388 =
"f_emails_aaekn159p5aw3t8"."read_id" ) THEN 0 END ) ) IS NOT NULL ) AS
"def_7a2a545bf904a48a", TRUE AS "def_9e301e13350cc0dc" FROM ( (
"f_emails_aaekn159p5aw3t8" INNER JOIN "f_users_aax8qg0e23asx5h" ON (
"f_emails_aaekn159p5aw3t8"."userid_id" =  "f_users_aax8qg0e23asx5h"."id" )
) INNER JOIN "dwh_dm_abnblk9j5oagorw" ON (
"f_users_aax8qg0e23asx5h"."dt_registrationdatetime_id" =
"dwh_dm_abnblk9j5oagorw"."id" ) ) WHERE ( 2014 =
"dwh_dm_abnblk9j5oagorw"."id_year" ) GROUP BY 1;

QUERY
PLAN

--
 HashAggregate  (cost=2145957.37..2145957.96 rows=59 width=12) (actual
time=1864088.145..1864088.155 rows=31 loops=1)
   ->  Hash Join  (cost=882573.27..2142753.08 rows=213619 width=12) (actual
time=9083.326..1859069.171 rows=7070141 loops=1)
 Hash Cond: (f_emails_aaekn159p5aw3t8.userid_id =
f_users_aax8qg0e23asx5h.id)
 ->  Seq Scan on f_emails_aaekn159p5aw3t8  (cost=0.00..854559.95
rows=32278695 width=12) (actual time=0.006..14271.239 rows=32211565 loops=1)
 ->  Hash  (cost=881801.58..881801.58 rows=61735 width=8) (actual
time=9076.153..9076.153 rows=3310768 loops=1)
   Buckets: 8192  Batches: 1  Memory Usage: 129327kB
   ->  Hash Join  (cost=782.15..881801.58 rows=61735 width=8)
(actual time=1.423..8086.228 rows=3310768 loops=1)
 Hash Cond:
(f_users_aax8qg0e23asx5h.dt_registrationdatetime_id =
dwh_dm_abnblk9j5oagorw.id)
 ->  Seq Scan on f_users_aax8qg0e23asx5h
(cost=0.00..845420.42 rows=9328442 width=12) (actual time=0.015..4098.915
rows=9322096 loops=1)
 ->  Hash  (cost=777.59..777.59 rows=365 width=4)
(actual time=1.400..1.400 rows=365 loops=1)
   Buckets: 1024  Batches: 1  Memory Usage: 13kB
   ->  Bitmap Heap Scan on dwh_dm_abnblk9j5oagorw
(cost=11.08..777.59 rows=365 width=4) (actual time=0.111..1.218 rows=365
loops=1)
 Recheck Cond: (2014 = id_year)
 ->  Bitmap Index Scan on
dwh_dm_abnblk9j5oagorw_id_year_idx  (cost=0.00..10.99 rows=365 width=0)
(actual time=0.068..0.068 rows=365 loops=1)
   Index Cond: (2014 = id_year)
 Total runtime: 1864107.373 ms
(16 rows)


db_kost07e2d9cdmg20b1takpqntobo6ghj=# set work_mem to '1MB';
SET
db_kost07e2d9cdmg20b1takpqntobo6ghj=# explain analyze SELECT
"f_users_aax8qg0e23asx5h"."firtsclickutmsource_id" AS "a_5078", COUNT( (
CASE WHEN ( 89388 = "f_emails_aaekn159p5aw3t8"."read_id" ) THEN
"f_emails_aaekn159p5aw3t8"."id" ELSE CAST( NULL AS INT ) END ) ) AS
"c_7a2a545bf904a48a", COUNT( "f_emails_aaekn159p5aw3t8"."id" ) AS
"c_9e301e13350cc0dc", ( MAX( ( CASE WHEN ( 89388 =
"f_emails_aaekn159p5aw3t8"."read_id" ) THEN 0 END ) ) IS NOT NULL ) AS
"def_7a2a545bf904a48a", TRUE AS "def_9e301e13350cc0dc" FROM ( (
"f_emails_aaekn159p5aw3t8" INNER JOIN "f_users_aax8qg0e23asx5h" ON (
"f_emails_aaekn159p5aw3t8"."userid_id" =  "f_users_aax8qg0e23asx5h"."id" )
) INNER JOIN "dwh_dm_abnblk9j5oagorw" ON (
"f_users_aax8qg0e23asx5h"."dt_registrationdatetime_id" =
"dwh_dm_abnblk9j5oagorw"."id" ) ) WHERE ( 2014 =
"dwh_dm_abnblk9j5oagorw"."id_year" ) GROUP BY 1;

QUERY
PLAN

--
 HashAggregate  (cost=2275675.45..2275675.95 rows=50 width=12) (actual
time=48438.407..48438.416 rows=31 loops=1)
   ->  Hash Join  (cost=882772.88..2272526.68 rows=209918 width=12) (actual
time=9384.610..45211.963 rows=6988804 loops=1)
 Hash Cond: (f_emails_aaekn159p5aw3t8.userid_id =
f_users_aax8qg0e23asx5h.id)
 ->  Seq Scan on f_emails_aaekn159p5aw3t8  (cost=0.00..839754.64
rows=31719464 width=12) (actual time=0.034..14299.338 rows=31653392 loops=1)
 ->  Hash  (cost=881759.44..881759.44 rows=61715 width=8) (actual
time=9368.371..9368.371 rows=3310768 loops=1)
   Buckets: 4096  Batches: 256 (originally 4)  Memory Usage:
1025kB
   ->  Hash Join  (cost=782.15..881759.44 rows=61715 width=8)
(actual time=3.839..8223.097 row

59 matches

Mail list logo