Re: [HACKERS] Explanation for bug #13908: hash joins are badly broken

2016-02-07 Thread Tomas Vondra



On 02/06/2016 10:22 PM, Tom Lane wrote:

Tomas Vondra  writes:

What about using the dense allocation even for the skew buckets,
but not one context for all skew buckets but one per bucket? Then
when we delete a bucket, we simply destroy the context (and free
the chunks, just like we do with the current dense allocator).


Yeah, I was wondering about that too, but it only works if you have
quite a few tuples per skew bucket, else you waste a lot of space.
And you were right upthread that what we're collecting is keys
expected to be common in the outer table, not the inner table. So
it's entirely likely that the typical case is just one inner tuple
per skew bucket. (Should check that out though ...)


I'd argue this is true for vast majority of joins, because the joins 
tend to be on foreign keys with the "dimension" as the inner table, thus 
having exactly one row per skew bucket. The upper bound for number of 
skew buckets is the statistics target (i.e. max number of MCV items). So 
either 100 (default) or possibly up to 1 (max).


For tuples wider than 8kB, we have no problem at all because those 
allocations will be treated as separate chunks and will be freed() 
immediately, making the memory reusable for the dense allocator. If the 
tuples are narrower than 8kB, we get a rather limited amount of memory 
in the skew hash (800kB / 80MB in the extreme cases with the max number 
of MCV items).


So perhaps in this case we don't really need to worry about the 
accounting and memory usage too much.


That of course does not mean we should not try to do better in cases 
when the number of tuples per skew bucket really is high. No doubt we 
can construct such joins. If we could estimate the number of tuples per 
skew bucket, that'd allow us to handle this case differently.


FWIW there was a patch from David some time ago, identifying "unique" 
joins where the join key is unique in the inner relation. That might be 
useful for this, I guess.


regards

--
Tomas Vondra  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Explanation for bug #13908: hash joins are badly broken

2016-02-07 Thread Robert Haas
On Sat, Feb 6, 2016 at 12:47 PM, Tom Lane  wrote:
> So I'm of the opinion that a great deal more work is needed here.
> But it's not something we're going to be able to get done for 9.5.1,
> or realistically 9.5.anything.  Whereas adding additional klugery to
> ExecHashRemoveNextSkewBucket probably is doable over the weekend.

Tom, are you taking care of this, or should I be getting involved here?

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Explanation for bug #13908: hash joins are badly broken

2016-02-07 Thread Tom Lane
Robert Haas  writes:
> On Sat, Feb 6, 2016 at 12:47 PM, Tom Lane  wrote:
>> So I'm of the opinion that a great deal more work is needed here.
>> But it's not something we're going to be able to get done for 9.5.1,
>> or realistically 9.5.anything.  Whereas adding additional klugery to
>> ExecHashRemoveNextSkewBucket probably is doable over the weekend.

> Tom, are you taking care of this, or should I be getting involved here?

I'm on it, so far as fixing the stop-ship bug is concerned.  I was not
volunteering to do the larger rewrite.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Explanation for bug #13908: hash joins are badly broken

2016-02-06 Thread Tom Lane
Tomas Vondra  writes:
> I believe the attached patch should fix this by actually copying the 
> tuples into the densely allocated chunks. Haven't tested it though, will 
> do in a few hours.

BTW, I confirmed that this patch fixes the wrong-number-of-output-tuples
issue in the test case from bug #13908.  So that shows that the diagnosis
is correct.  We still need to clean up the patch, but this way does work
to fix the problem.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Explanation for bug #13908: hash joins are badly broken

2016-02-06 Thread Tomas Vondra



On 02/06/2016 08:39 PM, Andres Freund wrote:

On 2016-02-06 20:34:07 +0100, Tomas Vondra wrote:

On 02/06/2016 06:47 PM, Tom Lane wrote:

* It incorporates a bespoke reimplementation of palloc into hash
joins. This is not a maintainable/scalable way to go about reducing
memory consumption. It should have been done with an arm's-length API
to a new type of memory context, IMO (probably one that supports
palloc but not pfree, repalloc, or any chunk-header-dependent
operations).


Hmmm, interesting idea. I've been thinking about doing this using a memory
context when writing the dense allocation, but was stuck in the "must
support all operations" mode, making it impossible. Disallowing some of the
operations would make it a viable approach, I guess.


FWIW, I've done that at some point. Noticeable speedups (that's what
I cared about), but a bit annoying to use. There's many random
pfree()s around, and then there's MemoryContextContains(),
GetMemoryChunkContext(), GetMemoryChunkSpace() - which all are
pretty fundamentally incompatible with such an allocator. I ended up
having a full header when assertions are enabled, to be able to
detect usage of these functions and assert out.

I didn't concentrate on improving memory usage, but IIRC it was even
noticeable for some simpler things.


I think the hassle is not that bad when most of the fragments have the 
same life cycle. With hashjoin that's almost exactly the case, except 
when we realize we need to increase the number of buckets - in that case 
we need to split the set of accumulated tuples in two.


regards

--
Tomas Vondra  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Explanation for bug #13908: hash joins are badly broken

2016-02-06 Thread Tom Lane
Tomas Vondra  writes:
> On 02/06/2016 08:39 PM, Andres Freund wrote:
>> FWIW, I've done that at some point. Noticeable speedups (that's what
>> I cared about), but a bit annoying to use. There's many random
>> pfree()s around, and then there's MemoryContextContains(),
>> GetMemoryChunkContext(), GetMemoryChunkSpace() - which all are
>> pretty fundamentally incompatible with such an allocator. I ended up
>> having a full header when assertions are enabled, to be able to
>> detect usage of these functions and assert out.
>> 
>> I didn't concentrate on improving memory usage, but IIRC it was even
>> noticeable for some simpler things.

> I think the hassle is not that bad when most of the fragments have the 
> same life cycle. With hashjoin that's almost exactly the case, except 
> when we realize we need to increase the number of buckets - in that case 
> we need to split the set of accumulated tuples in two.

Yeah, I think that a context type that just admits "we'll crash if you try
to pfree" would only be usable for allocations that are managed by just
a very small amount of code --- but the hashjoin tuple table qualifies,
and I think there would be other use-cases, perhaps tuplesort/tuplestore.

Andres' idea of adding a chunk header only in assert builds isn't a bad
one, perhaps; though I think the near-certainty of a core dump if you try
to use the header for anything might be good enough.  pfree and repalloc
are an ironclad certainty to crash in a pretty obvious way, and we could
likely add some assert checks to MemoryContextContains and friends to make
them 99.99% certain to fail without paying the price of a chunk header.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Explanation for bug #13908: hash joins are badly broken

2016-02-06 Thread Tom Lane
Tomas Vondra  writes:
> On 02/06/2016 06:47 PM, Tom Lane wrote:
>> I note also that while the idea of ExecHashRemoveNextSkewBucket is
>> to reduce memory consumed by the skew table to make it available to
>> the main hash table, in point of fact it's unlikely that the freed
>> space will be of any use at all, since it will be in tuple-sized
>> chunks far too small for dense_alloc's requests. So the spaceUsed
>> bookkeeping being done there is an academic exercise unconnected to
>> reality, and we need to rethink what the space management plan is.

> I don't follow. Why would these three things (sizes of allocations in 
> skew buckets, chunks in dense allocator and accounting) be related?

Well, what we're trying to do is ensure that the total amount of space
used by the hashjoin table doesn't exceed spaceAllowed.  My point is
that it's kind of cheating to ignore space used-and-then-freed if your
usage pattern is such that that space isn't likely to be reusable.
A freed skew tuple represents space that would be reusable for another
skew tuple, but is probably *not* reusable for the main hash table;
so treating that space as interchangeable is wrong I think.

I'm not entirely sure where to go with that thought, but maybe the
answer is that we should just treat the skew table and main table
storage pools as entirely separate with independent limits.  That's
not what's happening right now, though.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Explanation for bug #13908: hash joins are badly broken

2016-02-06 Thread Tomas Vondra



On 02/06/2016 09:55 PM, Tom Lane wrote:

Tomas Vondra  writes:

On 02/06/2016 06:47 PM, Tom Lane wrote:

I note also that while the idea of ExecHashRemoveNextSkewBucket is
to reduce memory consumed by the skew table to make it available to
the main hash table, in point of fact it's unlikely that the freed
space will be of any use at all, since it will be in tuple-sized
chunks far too small for dense_alloc's requests. So the spaceUsed
bookkeeping being done there is an academic exercise unconnected to
reality, and we need to rethink what the space management plan is.



I don't follow. Why would these three things (sizes of allocations in
skew buckets, chunks in dense allocator and accounting) be related?


Well, what we're trying to do is ensure that the total amount of
space used by the hashjoin table doesn't exceed spaceAllowed. My
point is that it's kind of cheating to ignore space
used-and-then-freed if your usage pattern is such that that space
isn't likely to be reusable. A freed skew tuple represents space that
would be reusable for another skew tuple, but is probably *not*
reusable for the main hash table; so treating that space as
interchangeable is wrong I think.


Ah, I see. And I agree that treating those areas as equal is wrong.


I'm not entirely sure where to go with that thought, but maybe the
answer is that we should just treat the skew table and main table
storage pools as entirely separate with independent limits. That's
not what's happening right now, though.


What about using the dense allocation even for the skew buckets, but not 
one context for all skew buckets but one per bucket? Then when we delete 
a bucket, we simply destroy the context (and free the chunks, just like 
we do with the current dense allocator).


regards

--
Tomas Vondra  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Explanation for bug #13908: hash joins are badly broken

2016-02-06 Thread Tom Lane
Tomas Vondra  writes:
> What about using the dense allocation even for the skew buckets, but not 
> one context for all skew buckets but one per bucket? Then when we delete 
> a bucket, we simply destroy the context (and free the chunks, just like 
> we do with the current dense allocator).

Yeah, I was wondering about that too, but it only works if you have quite
a few tuples per skew bucket, else you waste a lot of space.  And you were
right upthread that what we're collecting is keys expected to be common in
the outer table, not the inner table.  So it's entirely likely that the
typical case is just one inner tuple per skew bucket.  (Should check that
out though ...)

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Explanation for bug #13908: hash joins are badly broken

2016-02-06 Thread Tom Lane
Tomas Vondra  writes:
> On 02/06/2016 02:34 AM, Tom Lane wrote:
>> I have sussed what's happening in bug #13908. Basically, commit
>> 45f6240a8fa9d355 ("Pack tuples in a hash join batch densely, to save
>> memory") broke things for the case where a hash join is using a skew
>> table.

> Damn, that's an embarrassing oversight :-/

> I believe the attached patch should fix this by actually copying the 
> tuples into the densely allocated chunks. Haven't tested it though, will 
> do in a few hours.

Yeah, that's one fix approach I was contemplating last night.  (I think
the patch as written leaks memory and doesn't account for space usage
properly either, but certainly this is a direction we could take.)

The other answer I was thinking about was to get rid of the assumption
that iterating over the chunk storage is a valid thing to do, and
instead scan the hashbucket chains when we need to visit all tuples.
I really do not like the patch as designed, for several reasons:

* It incorporates a bespoke reimplementation of palloc into hash joins.
This is not a maintainable/scalable way to go about reducing memory
consumption.  It should have been done with an arm's-length API to a
new type of memory context, IMO (probably one that supports palloc but
not pfree, repalloc, or any chunk-header-dependent operations).

* No arm's-length API would conceivably allow remote callers to iterate
over all allocated chunks in the way this code does, which is why we need
to get rid of that behavior.

* There's no way to delete single tuples from the hash table given this
coding, which no doubt is why you didn't migrate the skew tuples into
this representation; but it doesn't seem like a very future-proof data
structure.

* Not doing anything for the skew tuples doesn't seem very good either,
considering the whole point of that sub-module is that there are likely
to be a lot of them.  I note also that while the idea of
ExecHashRemoveNextSkewBucket is to reduce memory consumed by the skew
table to make it available to the main hash table, in point of fact it's
unlikely that the freed space will be of any use at all, since it will be
in tuple-sized chunks far too small for dense_alloc's requests.  So the
spaceUsed bookkeeping being done there is an academic exercise unconnected
to reality, and we need to rethink what the space management plan is.

So I'm of the opinion that a great deal more work is needed here.
But it's not something we're going to be able to get done for 9.5.1,
or realistically 9.5.anything.  Whereas adding additional klugery to
ExecHashRemoveNextSkewBucket probably is doable over the weekend.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Explanation for bug #13908: hash joins are badly broken

2016-02-06 Thread Tomas Vondra

On 02/06/2016 06:47 PM, Tom Lane wrote:

Tomas Vondra  writes:

On 02/06/2016 02:34 AM, Tom Lane wrote:

I have sussed what's happening in bug #13908. Basically, commit
45f6240a8fa9d355 ("Pack tuples in a hash join batch densely, to save
memory") broke things for the case where a hash join is using a skew
table.



Damn, that's an embarrassing oversight :-/



I believe the attached patch should fix this by actually copying
the tuples into the densely allocated chunks. Haven't tested it
though, will do in a few hours.


Yeah, that's one fix approach I was contemplating last night. (I
think the patch as written leaks memory and doesn't account for
space usage properly either, but certainly this is a direction we
could take.)


Yes, it definitely needs more work (to free the original tuple copy 
after moving it into the dense_alloc chunk).



The other answer I was thinking about was to get rid of the
assumption that iterating over the chunk storage is a valid thing to
do, and instead scan the hashbucket chains when we need to visit all
tuples. I really do not like the patch as designed, for several
reasons:

* It incorporates a bespoke reimplementation of palloc into hash
joins. This is not a maintainable/scalable way to go about reducing
memory consumption. It should have been done with an arm's-length API
to a new type of memory context, IMO (probably one that supports
palloc but not pfree, repalloc, or any chunk-header-dependent
operations).


Hmmm, interesting idea. I've been thinking about doing this using a 
memory context when writing the dense allocation, but was stuck in the 
"must support all operations" mode, making it impossible. Disallowing 
some of the operations would make it a viable approach, I guess.



* No arm's-length API would conceivably allow remote callers to
iterate over all allocated chunks in the way this code does, which is
why we need to get rid of that behavior.


I'm not convinced we should throw away the idea of walking the chunks. I 
think it's kinda neat and I've been playing with postponing constructing 
the buckets until the very end of Hash build - it didn't work as good as 
expected, but I'm not ready to throw in the towel yet.


But perhaps the new memory context implementation could support some 
sort of iterator ...



* There's no way to delete single tuples from the hash table given
this coding, which no doubt is why you didn't migrate the skew tuples
into this representation; but it doesn't seem like a very
future-proof data structure.


I don't recall, it may be one of the reasons why the skew buckets use 
regular allocation. But I don't see how using a new type memory context 
could solve this, as it won't support pfree either. Maybe using a 
separate context for each skew bucket?



* Not doing anything for the skew tuples doesn't seem very good
either, considering the whole point of that sub-module is that there
are likely to be a lot of them.


Maybe I'm missing something, but I thought that the values tracked in 
skew buckets are the MCVs from the outer table, in the hope that we'll 
reduce the amount of data that needs to be spilled to disk when batching 
the outer relation.


I don't see why there should be a lot of them in the inner relation 
(well, I can imagine cases like that, but in my experience those are 
rare cases).



I note also that while the idea of ExecHashRemoveNextSkewBucket is
to reduce memory consumed by the skew table to make it available to
the main hash table, in point of fact it's unlikely that the freed
space will be of any use at all, since it will be in tuple-sized
chunks far too small for dense_alloc's requests. So the spaceUsed
bookkeeping being done there is an academic exercise unconnected to
reality, and we need to rethink what the space management plan is.


I don't follow. Why would these three things (sizes of allocations in 
skew buckets, chunks in dense allocator and accounting) be related?


FWIW the dense allocator actually made the memory accounting way more 
accurate, actually, as it eliminates most of the overhead that was not 
included in spaceUsed before.



So I'm of the opinion that a great deal more work is needed here. But
it's not something we're going to be able to get done for 9.5.1, or
realistically 9.5.anything. Whereas adding additional klugery to
ExecHashRemoveNextSkewBucket probably is doable over the weekend.


Agreed.

regards

--
Tomas Vondra  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Explanation for bug #13908: hash joins are badly broken

2016-02-06 Thread Andres Freund
On 2016-02-06 20:34:07 +0100, Tomas Vondra wrote:
> On 02/06/2016 06:47 PM, Tom Lane wrote:
> >* It incorporates a bespoke reimplementation of palloc into hash
> >joins. This is not a maintainable/scalable way to go about reducing
> >memory consumption. It should have been done with an arm's-length API
> >to a new type of memory context, IMO (probably one that supports
> >palloc but not pfree, repalloc, or any chunk-header-dependent
> >operations).
> 
> Hmmm, interesting idea. I've been thinking about doing this using a memory
> context when writing the dense allocation, but was stuck in the "must
> support all operations" mode, making it impossible. Disallowing some of the
> operations would make it a viable approach, I guess.

FWIW, I've done that at some point. Noticeable speedups (that's what I
cared about), but a bit annoying to use. There's many random pfree()s
around, and then there's MemoryContextContains(),
GetMemoryChunkContext(), GetMemoryChunkSpace() - which all are pretty
fundamentally incompatible with such an allocator. I ended up having a
full header when assertions are enabled, to be able to detect usage of
these functions and assert out.

I didn't concentrate on improving memory usage, but IIRC it was even
noticeable for some simpler things.

Greetings,

Andres Freund


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Explanation for bug #13908: hash joins are badly broken

2016-02-06 Thread Tomas Vondra

On 02/06/2016 02:34 AM, Tom Lane wrote:

I have sussed what's happening in bug #13908. Basically, commit
45f6240a8fa9d355 ("Pack tuples in a hash join batch densely, to save
memory") broke things for the case where a hash join is using a skew
table. The reason is that that commit only changed the storage of
tuples going into the main hash table; tuples going into the skew
table are still allocated with a palloc apiece, without being put
into the "chunk" storage. Now, if we're loading the hash table and we
find that we've exceeded the storage allowed for skew tuples,
ExecHashRemoveNextSkewBucket wants to push some skew tuples back into
the main hash table; and it believes that linking such tuples into
the appropriate main hashbucket chain is sufficient for that. Which
it was before the aforesaid commit, and still is in simple cases.
However, if we later try to do ExecHashIncreaseNumBatches, that
function contains new code that assumes that it can find all tuples
in the main hashtable by scanning the "chunk" storage directly. Thus,
the pushed-back tuples are not scanned and are neither re-entered
into the hash table nor dumped into a batch file. So they won't get
joined.


Damn, that's an embarrassing oversight :-/

I believe the attached patch should fix this by actually copying the 
tuples into the densely allocated chunks. Haven't tested it though, will 
do in a few hours.



It looks like ExecHashIncreaseNumBuckets, if it were to run after
some executions of ExecHashRemoveNextSkewBucket, would break things
in the same way. That's not what's happening in this particular test
case, though.


ExecHashIncreaseNumBuckets assumes all the tuples can be reached by 
simply walking the chunks (from dense_alloc). So if removing skew bucket 
only updates pointers in buckets, that gets broken. But I don't think 
that's a bug in ExecHashIncreaseNumBuckets and should be resolved by 
fixing ExecHashRemoveNextSkewBucket.



I'm of the opinion that this is a stop-ship bug for 9.5.1. Barring
somebody producing a fix over the weekend, I will deal with it by
reverting the aforementioned commit.


I think it's not quite possible to revert just the one commit as the 
other hashjoin improvements in 9.5 built on top of that. So the revert 
would either be quite invasive (requiring more code changes than the 
fix), or we'd have to revert all the hashjoin goodies.


FWIW I'm willing to put some time into fixing this over the weekend.

regards

--
Tomas Vondra  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
diff --git a/src/backend/executor/nodeHash.c b/src/backend/executor/nodeHash.c
index 8a8fdf2..653faf5 100644
--- a/src/backend/executor/nodeHash.c
+++ b/src/backend/executor/nodeHash.c
@@ -1576,9 +1576,16 @@ ExecHashRemoveNextSkewBucket(HashJoinTable hashtable)
 		/* Decide whether to put the tuple in the hash table or a temp file */
 		if (batchno == hashtable->curbatch)
 		{
+			/* keep tuple in memory - copy it into the new chunk */
+			HashJoinTuple copyTuple;
+
+			copyTuple = (HashJoinTuple) dense_alloc(hashtable, tupleSize);
+			memcpy(copyTuple, hashTuple, tupleSize);
+
 			/* Move the tuple to the main hash table */
-			hashTuple->next = hashtable->buckets[bucketno];
-			hashtable->buckets[bucketno] = hashTuple;
+			copyTuple->next = hashtable->buckets[bucketno];
+			hashtable->buckets[bucketno] = copyTuple;
+
 			/* We have reduced skew space, but overall space doesn't change */
 			hashtable->spaceUsedSkew -= tupleSize;
 		}

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Explanation for bug #13908: hash joins are badly broken

2016-02-05 Thread Stephen Frost
* Tom Lane (t...@sss.pgh.pa.us) wrote:
> I'm of the opinion that this is a stop-ship bug for 9.5.1.  Barring
> somebody producing a fix over the weekend, I will deal with it by
> reverting the aforementioned commit.

Agreed.

Thanks!

Stephen


signature.asc
Description: Digital signature