Re: [HACKERS] Poor memory context performance in large hash joins

2017-03-08 Thread Tom Lane
Andres Freund  writes:
> On 2017-03-07 13:06:39 -0500, Tom Lane wrote:
>> The hashjoin issue is certainly new, and the reorderbuffer issue can't
>> go back further than 9.4.  So my inclination is to patch back to 9.4
>> and call it good.

> That works for me.  If we find further cases later, we can easily enough
> backpatch it then.

Done like that.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Poor memory context performance in large hash joins

2017-03-07 Thread Andres Freund
On 2017-03-07 13:06:39 -0500, Tom Lane wrote:
> Andres Freund  writes:
> >>> On 2017-02-24 18:04:18 -0500, Tom Lane wrote:
>  Concretely, something like the attached.  This passes regression tests
>  but I've not pushed on it any harder than that.
> 
> > I think we should go forward with something like this patch in all
> > branches, and only use Tomas' patch in master, because they're
> > considerably larger.
> 
> While I'm willing to back-patch the proposed patch to some extent,
> I'm a bit hesitant to shove it into all branches, because I don't
> think that usage patterns that create a problem occurred till recently.
> You won't get a whole lot of standalone-block allocations unless you
> have code that allocates many chunks bigger than 8K without applying a
> double-the-request-each-time type of strategy.  And even then, it doesn't
> hurt unless you try to resize those chunks, or delete them in a non-LIFO
> order.
> 
> The hashjoin issue is certainly new, and the reorderbuffer issue can't
> go back further than 9.4.  So my inclination is to patch back to 9.4
> and call it good.

That works for me.  If we find further cases later, we can easily enough
backpatch it then.

Regards,

Andres


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Poor memory context performance in large hash joins

2017-03-07 Thread Tom Lane
Andres Freund  writes:
>>> On 2017-02-24 18:04:18 -0500, Tom Lane wrote:
 Concretely, something like the attached.  This passes regression tests
 but I've not pushed on it any harder than that.

> I think we should go forward with something like this patch in all
> branches, and only use Tomas' patch in master, because they're
> considerably larger.

While I'm willing to back-patch the proposed patch to some extent,
I'm a bit hesitant to shove it into all branches, because I don't
think that usage patterns that create a problem occurred till recently.
You won't get a whole lot of standalone-block allocations unless you
have code that allocates many chunks bigger than 8K without applying a
double-the-request-each-time type of strategy.  And even then, it doesn't
hurt unless you try to resize those chunks, or delete them in a non-LIFO
order.

The hashjoin issue is certainly new, and the reorderbuffer issue can't
go back further than 9.4.  So my inclination is to patch back to 9.4
and call it good.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Poor memory context performance in large hash joins

2017-02-27 Thread Tomas Vondra

On 02/27/2017 12:55 PM, Andres Freund wrote:

On 2017-02-24 15:18:04 -0800, Andres Freund wrote:

On 2017-02-24 15:12:37 -0800, Andres Freund wrote:

On 2017-02-24 18:04:18 -0500, Tom Lane wrote:

Concretely, something like the attached.  This passes regression tests
but I've not pushed on it any harder than that.


Heh, I'd just gotten something that didn't immediately crash anymore ;)

Running your patch against Jeff's test-case, verified before that I
could easily reproduce the O(N^2) cost.


Oh, that didn't take as long as I was afraid (optimized/non-assert build):

postgres[26268][1]=# SET work_mem = '13GB';
SET
Time: 2.591 ms
postgres[26268][1]=# select count(*) from foobar2 where not exists (select 1 
from foobar t where t.titleid=foobar2.titleid);
Time: 268043.710 ms (04:28.044)


As another datapoint, I measured this patch against the problem from
https://www.postgresql.org/message-id/20170227111732.vrx5v72ighehw...@alap3.anarazel.de
(see top post in thread), and it indeed fixes the runtime issue -
there's still considerably higher memory usage and some runtime
overhead, but the quadratic behaviour is gone.

I think we should go forward with something like this patch in all
branches, and only use Tomas' patch in master, because they're
considerably larger.



So you've tried to switch hashjoin to the slab allocators? Or what have 
you compared?


regards

--
Tomas Vondra  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Poor memory context performance in large hash joins

2017-02-27 Thread Andres Freund
On 2017-02-27 19:20:56 +0100, Tomas Vondra wrote:
> On 02/27/2017 12:55 PM, Andres Freund wrote:
> > On 2017-02-24 15:18:04 -0800, Andres Freund wrote:
> > > On 2017-02-24 15:12:37 -0800, Andres Freund wrote:
> > > > On 2017-02-24 18:04:18 -0500, Tom Lane wrote:
> > > > > Concretely, something like the attached.  This passes regression tests
> > > > > but I've not pushed on it any harder than that.
> > > >
> > > > Heh, I'd just gotten something that didn't immediately crash anymore ;)
> > > >
> > > > Running your patch against Jeff's test-case, verified before that I
> > > > could easily reproduce the O(N^2) cost.
> > >
> > > Oh, that didn't take as long as I was afraid (optimized/non-assert build):
> > >
> > > postgres[26268][1]=# SET work_mem = '13GB';
> > > SET
> > > Time: 2.591 ms
> > > postgres[26268][1]=# select count(*) from foobar2 where not exists 
> > > (select 1 from foobar t where t.titleid=foobar2.titleid);
> > > Time: 268043.710 ms (04:28.044)
> >
> > As another datapoint, I measured this patch against the problem from
> > https://www.postgresql.org/message-id/20170227111732.vrx5v72ighehw...@alap3.anarazel.de
> > (see top post in thread), and it indeed fixes the runtime issue -
> > there's still considerably higher memory usage and some runtime
> > overhead, but the quadratic behaviour is gone.
> >
> > I think we should go forward with something like this patch in all
> > branches, and only use Tomas' patch in master, because they're
> > considerably larger.
> >
>
> So you've tried to switch hashjoin to the slab allocators? Or what have you
> compared?

No, sorry for not being more explicit about this.  Meant that Tom's
patch addresses the performance issue in the reorderbuffer.c to a good
degree (i.e. gets rid of the quadratic cost, even though constants are
higher than w/ your patches).  As the patch here is a lot smaller, it
seems like a better choice for the back-branches than backporting
slab.c/generation.c.

Andres


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Poor memory context performance in large hash joins

2017-02-27 Thread Andres Freund
On 2017-02-24 15:18:04 -0800, Andres Freund wrote:
> On 2017-02-24 15:12:37 -0800, Andres Freund wrote:
> > On 2017-02-24 18:04:18 -0500, Tom Lane wrote:
> > > Concretely, something like the attached.  This passes regression tests
> > > but I've not pushed on it any harder than that.
> > 
> > Heh, I'd just gotten something that didn't immediately crash anymore ;)
> > 
> > Running your patch against Jeff's test-case, verified before that I
> > could easily reproduce the O(N^2) cost.
> 
> Oh, that didn't take as long as I was afraid (optimized/non-assert build):
> 
> postgres[26268][1]=# SET work_mem = '13GB';
> SET
> Time: 2.591 ms
> postgres[26268][1]=# select count(*) from foobar2 where not exists (select 1 
> from foobar t where t.titleid=foobar2.titleid);
> Time: 268043.710 ms (04:28.044)

As another datapoint, I measured this patch against the problem from
https://www.postgresql.org/message-id/20170227111732.vrx5v72ighehw...@alap3.anarazel.de
(see top post in thread), and it indeed fixes the runtime issue -
there's still considerably higher memory usage and some runtime
overhead, but the quadratic behaviour is gone.

I think we should go forward with something like this patch in all
branches, and only use Tomas' patch in master, because they're
considerably larger.

Regards,

Andres


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Poor memory context performance in large hash joins

2017-02-24 Thread Andres Freund
On 2017-02-24 15:12:37 -0800, Andres Freund wrote:
> On 2017-02-24 18:04:18 -0500, Tom Lane wrote:
> > Concretely, something like the attached.  This passes regression tests
> > but I've not pushed on it any harder than that.
> 
> Heh, I'd just gotten something that didn't immediately crash anymore ;)
> 
> Running your patch against Jeff's test-case, verified before that I
> could easily reproduce the O(N^2) cost.

Oh, that didn't take as long as I was afraid (optimized/non-assert build):

postgres[26268][1]=# SET work_mem = '13GB';
SET
Time: 2.591 ms
postgres[26268][1]=# select count(*) from foobar2 where not exists (select 1 
from foobar t where t.titleid=foobar2.titleid);
┌───┐
│ count │
├───┤
│ 0 │
└───┘
(1 row)

Time: 268043.710 ms (04:28.044)

Profile:
  13.49%  postgres[.] ExecHashTableInsert
  11.16%  postgres[.] BufFileRead
   9.20%  postgres[.] ExecHashIncreaseNumBatches
   9.19%  libc-2.24.so[.] __memmove_avx_unaligned_erms
   6.74%  postgres[.] dense_alloc.isra.0
   5.53%  postgres[.] ExecStoreMinimalTuple
   5.14%  postgres[.] ExecHashJoinGetSavedTuple.isra.0
   4.68%  postgres[.] AllocSetAlloc
   4.12%  postgres[.] AllocSetFree
   2.31%  postgres[.] palloc
   2.06%  [kernel][k] free_pcppages_bulk


I'd previously run the test for 30min, with something like 99% of the
profile being in AllocSetFree().

- Andres


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Poor memory context performance in large hash joins

2017-02-24 Thread Andres Freund
On 2017-02-24 18:04:18 -0500, Tom Lane wrote:
> Concretely, something like the attached.  This passes regression tests
> but I've not pushed on it any harder than that.

Heh, I'd just gotten something that didn't immediately crash anymore ;)

Running your patch against Jeff's test-case, verified before that I
could easily reproduce the O(N^2) cost.

- Andres


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Poor memory context performance in large hash joins

2017-02-24 Thread Tom Lane
Jeff Janes  writes:
> On Fri, Feb 24, 2017 at 10:59 AM, Tom Lane  wrote:
>> Uh, what?  In a doubly-linked list, you can remove an element in O(1)
>> time, you don't need any searching.

> Currently it is walking the chain to identify which block holds the chunk
> in the first place, not just to get the pointer to the previous block.  But
> I see that that could be fixed by pointer arithmetic once there is a reason
> to fix it. Which there isn't a reason to as long as you need to walk the
> chain to get the prev pointer anyway.  Like this:?
> targetblock = (AllocBlock) (((char*)chunk) - ALLOC_BLOCKHDRSZ);

Right.  We're just turning around the existing address calculation.  It'd
be a good idea to provide some cross-check that we found a sane-looking
block header, but that's not that hard --- we know what ought to be in
aset, freeptr, and endptr for a single-chunk block's header, and that
seems like enough of a crosscheck to me.

Concretely, something like the attached.  This passes regression tests
but I've not pushed on it any harder than that.

One argument against this is that it adds some nonzero amount of overhead
to block management; but considering that we are calling malloc or realloc
or free alongside any one of these manipulations, that overhead should be
pretty negligible.  I'm a bit more worried about increasing the alloc
block header size by 8 bytes, but that would only really matter with a
very small allocChunkLimit.

regards, tom lane

diff --git a/src/backend/utils/mmgr/aset.c b/src/backend/utils/mmgr/aset.c
index 4dfc3ec..c250bae 100644
*** a/src/backend/utils/mmgr/aset.c
--- b/src/backend/utils/mmgr/aset.c
*** typedef AllocSetContext *AllocSet;
*** 201,207 
  typedef struct AllocBlockData
  {
  	AllocSet	aset;			/* aset that owns this block */
! 	AllocBlock	next;			/* next block in aset's blocks list */
  	char	   *freeptr;		/* start of free space in this block */
  	char	   *endptr;			/* end of space in this block */
  }	AllocBlockData;
--- 201,208 
  typedef struct AllocBlockData
  {
  	AllocSet	aset;			/* aset that owns this block */
! 	AllocBlock	prev;			/* prev block in aset's blocks list, if any */
! 	AllocBlock	next;			/* next block in aset's blocks list, if any */
  	char	   *freeptr;		/* start of free space in this block */
  	char	   *endptr;			/* end of space in this block */
  }	AllocBlockData;
*** AllocSetContextCreate(MemoryContext pare
*** 522,528 
--- 523,532 
  		block->aset = set;
  		block->freeptr = ((char *) block) + ALLOC_BLOCKHDRSZ;
  		block->endptr = ((char *) block) + blksize;
+ 		block->prev = NULL;
  		block->next = set->blocks;
+ 		if (block->next)
+ 			block->next->prev = block;
  		set->blocks = block;
  		/* Mark block as not to be released at reset time */
  		set->keeper = block;
*** AllocSetReset(MemoryContext context)
*** 604,609 
--- 608,614 
  			VALGRIND_MAKE_MEM_NOACCESS(datastart, block->freeptr - datastart);
  #endif
  			block->freeptr = datastart;
+ 			block->prev = NULL;
  			block->next = NULL;
  		}
  		else
*** AllocSetAlloc(MemoryContext context, Siz
*** 710,725 
  #endif
  
  		/*
! 		 * Stick the new block underneath the active allocation block, so that
! 		 * we don't lose the use of the space remaining therein.
  		 */
  		if (set->blocks != NULL)
  		{
  			block->next = set->blocks->next;
  			set->blocks->next = block;
  		}
  		else
  		{
  			block->next = NULL;
  			set->blocks = block;
  		}
--- 715,734 
  #endif
  
  		/*
! 		 * Stick the new block underneath the active allocation block, if any,
! 		 * so that we don't lose the use of the space remaining therein.
  		 */
  		if (set->blocks != NULL)
  		{
+ 			block->prev = set->blocks;
  			block->next = set->blocks->next;
+ 			if (block->next)
+ block->next->prev = block;
  			set->blocks->next = block;
  		}
  		else
  		{
+ 			block->prev = NULL;
  			block->next = NULL;
  			set->blocks = block;
  		}
*** AllocSetAlloc(MemoryContext context, Siz
*** 900,906 
--- 909,918 
  		VALGRIND_MAKE_MEM_NOACCESS(block->freeptr,
     blksize - ALLOC_BLOCKHDRSZ);
  
+ 		block->prev = NULL;
  		block->next = set->blocks;
+ 		if (block->next)
+ 			block->next->prev = block;
  		set->blocks = block;
  	}
  
*** AllocSetFree(MemoryContext context, void
*** 960,988 
  	{
  		/*
  		 * Big chunks are certain to have been allocated as single-chunk
! 		 * blocks.  Find the containing block and return it to malloc().
  		 */
! 		AllocBlock	block = set->blocks;
! 		AllocBlock	prevblock = NULL;
  
! 		while (block != NULL)
! 		{
! 			if (chunk == (AllocChunk) (((char *) block) + ALLOC_BLOCKHDRSZ))
! break;
! 			prevblock = block;
! 			block = block->next;
! 		}
! 		if (block == NULL)
  			elog(ERROR, "could not find block containing chunk %p", chunk);
- 		/* let's just make sure chunk is the only one in 

Re: [HACKERS] Poor memory context performance in large hash joins

2017-02-24 Thread Jeff Janes
On Fri, Feb 24, 2017 at 10:59 AM, Tom Lane  wrote:

> Jeff Janes  writes:
> > On Thu, Feb 23, 2017 at 2:28 PM, Tom Lane  wrote:
> >> Maybe it's time to convert that to a doubly-linked list.
>
> > I don't think that would help.  You would need some heuristic to guess
> > whether the chunk you are looking for is near the front, or near the end.
>
> Uh, what?  In a doubly-linked list, you can remove an element in O(1)
> time, you don't need any searching.  It basically becomes
>   item->prev->next = item->next;
>   item->next->prev = item->prev;
> modulo possible special cases for the head and tail elements.
>

Currently it is walking the chain to identify which block holds the chunk
in the first place, not just to get the pointer to the previous block.  But
I see that that could be fixed by pointer arithmetic once there is a reason
to fix it. Which there isn't a reason to as long as you need to walk the
chain to get the prev pointer anyway.  Like this:?

targetblock = (AllocBlock) (((char*)chunk) - ALLOC_BLOCKHDRSZ);


> >> Although if the
> >> hash code is producing a whole lot of requests that are only a bit
> bigger
> >> than the separate-block threshold, I'd say It's Doing It Wrong.  It
> should
> >> learn to aggregate them into larger requests.
>
> > Right now it is using compiled-in 32KB chunks.  Should it use something
> > like max(32kb,work_mem/128) instead?
>
> I'd say it should double the size of the request each time.  That's what
> we do in most places.
>

I thought the design goal there was that the space in old chunks could get
re-used into the new chunks in a reasonably fine-grained way. If the last
chunk contains just over half of all the data, that couldn't happen.

Cheers,

Jeff


Re: [HACKERS] Poor memory context performance in large hash joins

2017-02-24 Thread Tom Lane
Jeff Janes  writes:
> On Thu, Feb 23, 2017 at 2:28 PM, Tom Lane  wrote:
>> Maybe it's time to convert that to a doubly-linked list.

> I don't think that would help.  You would need some heuristic to guess
> whether the chunk you are looking for is near the front, or near the end.

Uh, what?  In a doubly-linked list, you can remove an element in O(1)
time, you don't need any searching.  It basically becomes
  item->prev->next = item->next;
  item->next->prev = item->prev;
modulo possible special cases for the head and tail elements.

>> Although if the
>> hash code is producing a whole lot of requests that are only a bit bigger
>> than the separate-block threshold, I'd say It's Doing It Wrong.  It should
>> learn to aggregate them into larger requests.

> Right now it is using compiled-in 32KB chunks.  Should it use something
> like max(32kb,work_mem/128) instead?

I'd say it should double the size of the request each time.  That's what
we do in most places.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Poor memory context performance in large hash joins

2017-02-24 Thread Jeff Janes
On Thu, Feb 23, 2017 at 10:47 PM, Andres Freund  wrote:

> On 2017-02-23 17:28:26 -0500, Tom Lane wrote:
> > Jeff Janes  writes:
> > > The number of new chunks can be almost as as large as the number of old
> > > chunks, especially if there is a very popular value.  The problem is
> that
> > > every time an old chunk is freed, the code in aset.c around line 968
> has to
> > > walk over all the newly allocated chunks in the linked list before it
> can
> > > find the old one being freed.  This is an N^2 operation, and I think
> it has
> > > horrible CPU cache hit rates as well.
> >
> > Maybe it's time to convert that to a doubly-linked list.
>
> Yes, I do think so. Given that we only have that for full blocks, not
> for small chunks, the cost seems neglegible.
>
> That would also, partially, address the performance issue
> http://archives.postgresql.org/message-id/d15dff83-0b37-28ed
> -0809-95a5cc7292ad%402ndquadrant.com
> addresses, in a more realistically backpatchable manner.
>
> Jeff, do you have a handy demonstrator?
>


Not exactly "handy", as it takes about 45 minutes to set up and uses >50GB
of disk and 16GB of RAM, but here is a demonstration.

create table foobar as select CASE when random()<(420.0/540.0) then 1 else
floor(random()*88)::int END as titleid from
generate_series(1,54000);

create table foobar2 as select distinct titleid from foobar ;

analyze;

set work_mem TO "13GB";

select count(*) from foobar2 where not exists (select 1 from foobar t where
t.titleid=foobar2.titleid);

This will run for effectively forever, unless it gets killed/dies due to
OOM first.  If I have other things consuming some of my 16GB of RAM, then
it gets OOM (which is not a complaint: it is as one should expect given
that I told it work_mem was 13GB). If I have no other demands on RAM, then
I can't tell if it would eventually OOM or not because of how long it would
take to get that far.

This is inspired by the thread
https://www.postgresql.org/message-id/flat/CACw4T0p4Lzd6VpwptxgPgoTMh2dEKTQBGu7NTaJ1%2BA0PRx1BGg%40mail.gmail.com#cacw4t0p4lzd6vpwptxgpgotmh2dektqbgu7ntaj1+a0prx1...@mail.gmail.com

I ran into this while evaluating Tom's responding patch, but you don't need
to apply that patch to run this example and see the effect.  I don't have
an example of a case that demonstrates the problem in the absence of a
degenerate hash bucket.  I think there should be non-degenerate cases that
trigger it, but I haven't been able to produce one yet.


> > Although if the
> > hash code is producing a whole lot of requests that are only a bit bigger
> > than the separate-block threshold, I'd say It's Doing It Wrong.  It
> should
> > learn to aggregate them into larger requests.
>
> That's probably right, but we can't really address that in the
> back-branches.  And to me this sounds like something we should address
> in the branches, not just in master.  Even if we'd also fix the
> hash-aggregation logic, I think such an O(n^2) behaviour in the
> allocator is a bad idea in general, and we should fix it anyway.
>

I don't know how important a back-patch would be.  This is a toy case for
me.  Presumably not for David Hinkle, except he doesn't want a hash join in
the first place. While I want one that finishes in a reasonable time. It
might depend on what happens to Tom's OOM patch.

It would be great if the allocator was made bullet-proof, but I don't think
adding reverse links (or anything else back-patchable) is going to be
enough to do that.

Cheers,

Jeff


Re: [HACKERS] Poor memory context performance in large hash joins

2017-02-24 Thread Jeff Janes
On Thu, Feb 23, 2017 at 2:28 PM, Tom Lane  wrote:

> Jeff Janes  writes:
> > The number of new chunks can be almost as as large as the number of old
> > chunks, especially if there is a very popular value.  The problem is that
> > every time an old chunk is freed, the code in aset.c around line 968 has
> to
> > walk over all the newly allocated chunks in the linked list before it can
> > find the old one being freed.  This is an N^2 operation, and I think it
> has
> > horrible CPU cache hit rates as well.
>
> Maybe it's time to convert that to a doubly-linked list.



I don't think that would help.  You would need some heuristic to guess
whether the chunk you are looking for is near the front, or near the end.
And in this case, the desired chunk starts out at the front, and then keeps
moving down the list with each iteration as new things are added to the
front, until near the end of the ExecHashIncreaseNumBatches it is freeing
them from near the end.  But in between, it is freeing them from the
middle, so I don't think a doubly-linked list would change it from N^2,
just lower the constant, even if you always knew which end to start at.  Or
am I misunderstanding what the implications for a doubly-linked-list are?

What would really help here is if it remembered the next pointer of the
just-freed chunk, and started the scan from that location the next time,
cycling around to the head pointer if it doesn't find anything.  But I
don't think that that is a very general solution.

Or, if you could pass a flag when creating the context telling it whether
memory will be freed mostly-LIFO or mostly-FIFO, and have it use a stack or
a queue accordingly.


> Although if the
> hash code is producing a whole lot of requests that are only a bit bigger
> than the separate-block threshold, I'd say It's Doing It Wrong.  It should
> learn to aggregate them into larger requests.
>

Right now it is using compiled-in 32KB chunks.  Should it use something
like max(32kb,work_mem/128) instead?

Cheers,

Jeff


Re: [HACKERS] Poor memory context performance in large hash joins

2017-02-23 Thread Andres Freund
On 2017-02-24 01:59:01 -0500, Tom Lane wrote:
> Andres Freund  writes:
> > On 2017-02-23 17:28:26 -0500, Tom Lane wrote:
> >> Maybe it's time to convert that to a doubly-linked list.
>
> > Yes, I do think so. Given that we only have that for full blocks, not
> > for small chunks, the cost seems neglegible.
> > That would also, partially, address the performance issue
> > http://archives.postgresql.org/message-id/d15dff83-0b37-28ed-0809-95a5cc7292ad%402ndquadrant.com
> > addresses, in a more realistically backpatchable manner.
>
> Yeah, I was wondering if we could get away with back-patching such a
> change.  In principle, nothing outside aset.c should know what's in the
> header of an AllocBlock, but ...

You'd need to go through a fair amount of intentional pain to be
affected by a change AllocBlockData's structure.  We could add the
->prev pointer to the end of AllocBlockData's definition to make it less
likely that one would be affected in that unlikely case - but I'm a bit
doubtful it's worth the trouble.

- Andres


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Poor memory context performance in large hash joins

2017-02-23 Thread Thomas Munro
On Fri, Feb 24, 2017 at 12:17 PM, Andres Freund  wrote:
> Jeff, do you have a handy demonstrator?

If you want to hit ExecHashIncreaseNumBatches a bunch of times, see
the "ugly" example in the attached.

-- 
Thomas Munro
http://www.enterprisedb.com


hj-test-queries.sql
Description: Binary data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Poor memory context performance in large hash joins

2017-02-23 Thread Tom Lane
Andres Freund  writes:
> On 2017-02-23 17:28:26 -0500, Tom Lane wrote:
>> Maybe it's time to convert that to a doubly-linked list.

> Yes, I do think so. Given that we only have that for full blocks, not
> for small chunks, the cost seems neglegible.
> That would also, partially, address the performance issue
> http://archives.postgresql.org/message-id/d15dff83-0b37-28ed-0809-95a5cc7292ad%402ndquadrant.com
> addresses, in a more realistically backpatchable manner.

Yeah, I was wondering if we could get away with back-patching such a
change.  In principle, nothing outside aset.c should know what's in the
header of an AllocBlock, but ...

> Jeff, do you have a handy demonstrator?

A solid test case would definitely help to convince people that it was
worth taking some risk here.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Poor memory context performance in large hash joins

2017-02-23 Thread Andres Freund
On 2017-02-23 17:28:26 -0500, Tom Lane wrote:
> Jeff Janes  writes:
> > The number of new chunks can be almost as as large as the number of old
> > chunks, especially if there is a very popular value.  The problem is that
> > every time an old chunk is freed, the code in aset.c around line 968 has to
> > walk over all the newly allocated chunks in the linked list before it can
> > find the old one being freed.  This is an N^2 operation, and I think it has
> > horrible CPU cache hit rates as well.
> 
> Maybe it's time to convert that to a doubly-linked list.

Yes, I do think so. Given that we only have that for full blocks, not
for small chunks, the cost seems neglegible.

That would also, partially, address the performance issue
http://archives.postgresql.org/message-id/d15dff83-0b37-28ed-0809-95a5cc7292ad%402ndquadrant.com
addresses, in a more realistically backpatchable manner.

Jeff, do you have a handy demonstrator?


> Although if the
> hash code is producing a whole lot of requests that are only a bit bigger
> than the separate-block threshold, I'd say It's Doing It Wrong.  It should
> learn to aggregate them into larger requests.

That's probably right, but we can't really address that in the
back-branches.  And to me this sounds like something we should address
in the branches, not just in master.  Even if we'd also fix the
hash-aggregation logic, I think such an O(n^2) behaviour in the
allocator is a bad idea in general, and we should fix it anyway.

Regards,

Andres


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Poor memory context performance in large hash joins

2017-02-23 Thread Tom Lane
Jeff Janes  writes:
> The number of new chunks can be almost as as large as the number of old
> chunks, especially if there is a very popular value.  The problem is that
> every time an old chunk is freed, the code in aset.c around line 968 has to
> walk over all the newly allocated chunks in the linked list before it can
> find the old one being freed.  This is an N^2 operation, and I think it has
> horrible CPU cache hit rates as well.

Maybe it's time to convert that to a doubly-linked list.  Although if the
hash code is producing a whole lot of requests that are only a bit bigger
than the separate-block threshold, I'd say It's Doing It Wrong.  It should
learn to aggregate them into larger requests.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Poor memory context performance in large hash joins

2017-02-23 Thread Peter Geoghegan
On Thu, Feb 23, 2017 at 2:13 PM, Jeff Janes  wrote:
> Is there a good solution to this?  Could the new chunks be put in a
> different memory context, and then destroy the old context and install the
> new one at the end of ExecHashIncreaseNumBatches? I couldn't find a destroy
> method for memory contexts, it looks like you just reset the parent instead.
> But I don't think that would work here.

Are you aware of the fact that tuplesort.c got a second memory context
for 9.6, entirely on performance grounds?


-- 
Peter Geoghegan


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers