Re: [HACKERS] Poor memory context performance in large hash joins
Andres Freundwrites: > On 2017-03-07 13:06:39 -0500, Tom Lane wrote: >> The hashjoin issue is certainly new, and the reorderbuffer issue can't >> go back further than 9.4. So my inclination is to patch back to 9.4 >> and call it good. > That works for me. If we find further cases later, we can easily enough > backpatch it then. Done like that. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Poor memory context performance in large hash joins
On 2017-03-07 13:06:39 -0500, Tom Lane wrote: > Andres Freundwrites: > >>> On 2017-02-24 18:04:18 -0500, Tom Lane wrote: > Concretely, something like the attached. This passes regression tests > but I've not pushed on it any harder than that. > > > I think we should go forward with something like this patch in all > > branches, and only use Tomas' patch in master, because they're > > considerably larger. > > While I'm willing to back-patch the proposed patch to some extent, > I'm a bit hesitant to shove it into all branches, because I don't > think that usage patterns that create a problem occurred till recently. > You won't get a whole lot of standalone-block allocations unless you > have code that allocates many chunks bigger than 8K without applying a > double-the-request-each-time type of strategy. And even then, it doesn't > hurt unless you try to resize those chunks, or delete them in a non-LIFO > order. > > The hashjoin issue is certainly new, and the reorderbuffer issue can't > go back further than 9.4. So my inclination is to patch back to 9.4 > and call it good. That works for me. If we find further cases later, we can easily enough backpatch it then. Regards, Andres -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Poor memory context performance in large hash joins
Andres Freundwrites: >>> On 2017-02-24 18:04:18 -0500, Tom Lane wrote: Concretely, something like the attached. This passes regression tests but I've not pushed on it any harder than that. > I think we should go forward with something like this patch in all > branches, and only use Tomas' patch in master, because they're > considerably larger. While I'm willing to back-patch the proposed patch to some extent, I'm a bit hesitant to shove it into all branches, because I don't think that usage patterns that create a problem occurred till recently. You won't get a whole lot of standalone-block allocations unless you have code that allocates many chunks bigger than 8K without applying a double-the-request-each-time type of strategy. And even then, it doesn't hurt unless you try to resize those chunks, or delete them in a non-LIFO order. The hashjoin issue is certainly new, and the reorderbuffer issue can't go back further than 9.4. So my inclination is to patch back to 9.4 and call it good. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Poor memory context performance in large hash joins
On 02/27/2017 12:55 PM, Andres Freund wrote: On 2017-02-24 15:18:04 -0800, Andres Freund wrote: On 2017-02-24 15:12:37 -0800, Andres Freund wrote: On 2017-02-24 18:04:18 -0500, Tom Lane wrote: Concretely, something like the attached. This passes regression tests but I've not pushed on it any harder than that. Heh, I'd just gotten something that didn't immediately crash anymore ;) Running your patch against Jeff's test-case, verified before that I could easily reproduce the O(N^2) cost. Oh, that didn't take as long as I was afraid (optimized/non-assert build): postgres[26268][1]=# SET work_mem = '13GB'; SET Time: 2.591 ms postgres[26268][1]=# select count(*) from foobar2 where not exists (select 1 from foobar t where t.titleid=foobar2.titleid); Time: 268043.710 ms (04:28.044) As another datapoint, I measured this patch against the problem from https://www.postgresql.org/message-id/20170227111732.vrx5v72ighehw...@alap3.anarazel.de (see top post in thread), and it indeed fixes the runtime issue - there's still considerably higher memory usage and some runtime overhead, but the quadratic behaviour is gone. I think we should go forward with something like this patch in all branches, and only use Tomas' patch in master, because they're considerably larger. So you've tried to switch hashjoin to the slab allocators? Or what have you compared? regards -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Poor memory context performance in large hash joins
On 2017-02-27 19:20:56 +0100, Tomas Vondra wrote: > On 02/27/2017 12:55 PM, Andres Freund wrote: > > On 2017-02-24 15:18:04 -0800, Andres Freund wrote: > > > On 2017-02-24 15:12:37 -0800, Andres Freund wrote: > > > > On 2017-02-24 18:04:18 -0500, Tom Lane wrote: > > > > > Concretely, something like the attached. This passes regression tests > > > > > but I've not pushed on it any harder than that. > > > > > > > > Heh, I'd just gotten something that didn't immediately crash anymore ;) > > > > > > > > Running your patch against Jeff's test-case, verified before that I > > > > could easily reproduce the O(N^2) cost. > > > > > > Oh, that didn't take as long as I was afraid (optimized/non-assert build): > > > > > > postgres[26268][1]=# SET work_mem = '13GB'; > > > SET > > > Time: 2.591 ms > > > postgres[26268][1]=# select count(*) from foobar2 where not exists > > > (select 1 from foobar t where t.titleid=foobar2.titleid); > > > Time: 268043.710 ms (04:28.044) > > > > As another datapoint, I measured this patch against the problem from > > https://www.postgresql.org/message-id/20170227111732.vrx5v72ighehw...@alap3.anarazel.de > > (see top post in thread), and it indeed fixes the runtime issue - > > there's still considerably higher memory usage and some runtime > > overhead, but the quadratic behaviour is gone. > > > > I think we should go forward with something like this patch in all > > branches, and only use Tomas' patch in master, because they're > > considerably larger. > > > > So you've tried to switch hashjoin to the slab allocators? Or what have you > compared? No, sorry for not being more explicit about this. Meant that Tom's patch addresses the performance issue in the reorderbuffer.c to a good degree (i.e. gets rid of the quadratic cost, even though constants are higher than w/ your patches). As the patch here is a lot smaller, it seems like a better choice for the back-branches than backporting slab.c/generation.c. Andres -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Poor memory context performance in large hash joins
On 2017-02-24 15:18:04 -0800, Andres Freund wrote: > On 2017-02-24 15:12:37 -0800, Andres Freund wrote: > > On 2017-02-24 18:04:18 -0500, Tom Lane wrote: > > > Concretely, something like the attached. This passes regression tests > > > but I've not pushed on it any harder than that. > > > > Heh, I'd just gotten something that didn't immediately crash anymore ;) > > > > Running your patch against Jeff's test-case, verified before that I > > could easily reproduce the O(N^2) cost. > > Oh, that didn't take as long as I was afraid (optimized/non-assert build): > > postgres[26268][1]=# SET work_mem = '13GB'; > SET > Time: 2.591 ms > postgres[26268][1]=# select count(*) from foobar2 where not exists (select 1 > from foobar t where t.titleid=foobar2.titleid); > Time: 268043.710 ms (04:28.044) As another datapoint, I measured this patch against the problem from https://www.postgresql.org/message-id/20170227111732.vrx5v72ighehw...@alap3.anarazel.de (see top post in thread), and it indeed fixes the runtime issue - there's still considerably higher memory usage and some runtime overhead, but the quadratic behaviour is gone. I think we should go forward with something like this patch in all branches, and only use Tomas' patch in master, because they're considerably larger. Regards, Andres -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Poor memory context performance in large hash joins
On 2017-02-24 15:12:37 -0800, Andres Freund wrote: > On 2017-02-24 18:04:18 -0500, Tom Lane wrote: > > Concretely, something like the attached. This passes regression tests > > but I've not pushed on it any harder than that. > > Heh, I'd just gotten something that didn't immediately crash anymore ;) > > Running your patch against Jeff's test-case, verified before that I > could easily reproduce the O(N^2) cost. Oh, that didn't take as long as I was afraid (optimized/non-assert build): postgres[26268][1]=# SET work_mem = '13GB'; SET Time: 2.591 ms postgres[26268][1]=# select count(*) from foobar2 where not exists (select 1 from foobar t where t.titleid=foobar2.titleid); ┌───┐ │ count │ ├───┤ │ 0 │ └───┘ (1 row) Time: 268043.710 ms (04:28.044) Profile: 13.49% postgres[.] ExecHashTableInsert 11.16% postgres[.] BufFileRead 9.20% postgres[.] ExecHashIncreaseNumBatches 9.19% libc-2.24.so[.] __memmove_avx_unaligned_erms 6.74% postgres[.] dense_alloc.isra.0 5.53% postgres[.] ExecStoreMinimalTuple 5.14% postgres[.] ExecHashJoinGetSavedTuple.isra.0 4.68% postgres[.] AllocSetAlloc 4.12% postgres[.] AllocSetFree 2.31% postgres[.] palloc 2.06% [kernel][k] free_pcppages_bulk I'd previously run the test for 30min, with something like 99% of the profile being in AllocSetFree(). - Andres -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Poor memory context performance in large hash joins
On 2017-02-24 18:04:18 -0500, Tom Lane wrote: > Concretely, something like the attached. This passes regression tests > but I've not pushed on it any harder than that. Heh, I'd just gotten something that didn't immediately crash anymore ;) Running your patch against Jeff's test-case, verified before that I could easily reproduce the O(N^2) cost. - Andres -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Poor memory context performance in large hash joins
Jeff Janeswrites: > On Fri, Feb 24, 2017 at 10:59 AM, Tom Lane wrote: >> Uh, what? In a doubly-linked list, you can remove an element in O(1) >> time, you don't need any searching. > Currently it is walking the chain to identify which block holds the chunk > in the first place, not just to get the pointer to the previous block. But > I see that that could be fixed by pointer arithmetic once there is a reason > to fix it. Which there isn't a reason to as long as you need to walk the > chain to get the prev pointer anyway. Like this:? > targetblock = (AllocBlock) (((char*)chunk) - ALLOC_BLOCKHDRSZ); Right. We're just turning around the existing address calculation. It'd be a good idea to provide some cross-check that we found a sane-looking block header, but that's not that hard --- we know what ought to be in aset, freeptr, and endptr for a single-chunk block's header, and that seems like enough of a crosscheck to me. Concretely, something like the attached. This passes regression tests but I've not pushed on it any harder than that. One argument against this is that it adds some nonzero amount of overhead to block management; but considering that we are calling malloc or realloc or free alongside any one of these manipulations, that overhead should be pretty negligible. I'm a bit more worried about increasing the alloc block header size by 8 bytes, but that would only really matter with a very small allocChunkLimit. regards, tom lane diff --git a/src/backend/utils/mmgr/aset.c b/src/backend/utils/mmgr/aset.c index 4dfc3ec..c250bae 100644 *** a/src/backend/utils/mmgr/aset.c --- b/src/backend/utils/mmgr/aset.c *** typedef AllocSetContext *AllocSet; *** 201,207 typedef struct AllocBlockData { AllocSet aset; /* aset that owns this block */ ! AllocBlock next; /* next block in aset's blocks list */ char *freeptr; /* start of free space in this block */ char *endptr; /* end of space in this block */ } AllocBlockData; --- 201,208 typedef struct AllocBlockData { AllocSet aset; /* aset that owns this block */ ! AllocBlock prev; /* prev block in aset's blocks list, if any */ ! AllocBlock next; /* next block in aset's blocks list, if any */ char *freeptr; /* start of free space in this block */ char *endptr; /* end of space in this block */ } AllocBlockData; *** AllocSetContextCreate(MemoryContext pare *** 522,528 --- 523,532 block->aset = set; block->freeptr = ((char *) block) + ALLOC_BLOCKHDRSZ; block->endptr = ((char *) block) + blksize; + block->prev = NULL; block->next = set->blocks; + if (block->next) + block->next->prev = block; set->blocks = block; /* Mark block as not to be released at reset time */ set->keeper = block; *** AllocSetReset(MemoryContext context) *** 604,609 --- 608,614 VALGRIND_MAKE_MEM_NOACCESS(datastart, block->freeptr - datastart); #endif block->freeptr = datastart; + block->prev = NULL; block->next = NULL; } else *** AllocSetAlloc(MemoryContext context, Siz *** 710,725 #endif /* ! * Stick the new block underneath the active allocation block, so that ! * we don't lose the use of the space remaining therein. */ if (set->blocks != NULL) { block->next = set->blocks->next; set->blocks->next = block; } else { block->next = NULL; set->blocks = block; } --- 715,734 #endif /* ! * Stick the new block underneath the active allocation block, if any, ! * so that we don't lose the use of the space remaining therein. */ if (set->blocks != NULL) { + block->prev = set->blocks; block->next = set->blocks->next; + if (block->next) + block->next->prev = block; set->blocks->next = block; } else { + block->prev = NULL; block->next = NULL; set->blocks = block; } *** AllocSetAlloc(MemoryContext context, Siz *** 900,906 --- 909,918 VALGRIND_MAKE_MEM_NOACCESS(block->freeptr, blksize - ALLOC_BLOCKHDRSZ); + block->prev = NULL; block->next = set->blocks; + if (block->next) + block->next->prev = block; set->blocks = block; } *** AllocSetFree(MemoryContext context, void *** 960,988 { /* * Big chunks are certain to have been allocated as single-chunk ! * blocks. Find the containing block and return it to malloc(). */ ! AllocBlock block = set->blocks; ! AllocBlock prevblock = NULL; ! while (block != NULL) ! { ! if (chunk == (AllocChunk) (((char *) block) + ALLOC_BLOCKHDRSZ)) ! break; ! prevblock = block; ! block = block->next; ! } ! if (block == NULL) elog(ERROR, "could not find block containing chunk %p", chunk); - /* let's just make sure chunk is the only one in
Re: [HACKERS] Poor memory context performance in large hash joins
On Fri, Feb 24, 2017 at 10:59 AM, Tom Lanewrote: > Jeff Janes writes: > > On Thu, Feb 23, 2017 at 2:28 PM, Tom Lane wrote: > >> Maybe it's time to convert that to a doubly-linked list. > > > I don't think that would help. You would need some heuristic to guess > > whether the chunk you are looking for is near the front, or near the end. > > Uh, what? In a doubly-linked list, you can remove an element in O(1) > time, you don't need any searching. It basically becomes > item->prev->next = item->next; > item->next->prev = item->prev; > modulo possible special cases for the head and tail elements. > Currently it is walking the chain to identify which block holds the chunk in the first place, not just to get the pointer to the previous block. But I see that that could be fixed by pointer arithmetic once there is a reason to fix it. Which there isn't a reason to as long as you need to walk the chain to get the prev pointer anyway. Like this:? targetblock = (AllocBlock) (((char*)chunk) - ALLOC_BLOCKHDRSZ); > >> Although if the > >> hash code is producing a whole lot of requests that are only a bit > bigger > >> than the separate-block threshold, I'd say It's Doing It Wrong. It > should > >> learn to aggregate them into larger requests. > > > Right now it is using compiled-in 32KB chunks. Should it use something > > like max(32kb,work_mem/128) instead? > > I'd say it should double the size of the request each time. That's what > we do in most places. > I thought the design goal there was that the space in old chunks could get re-used into the new chunks in a reasonably fine-grained way. If the last chunk contains just over half of all the data, that couldn't happen. Cheers, Jeff
Re: [HACKERS] Poor memory context performance in large hash joins
Jeff Janeswrites: > On Thu, Feb 23, 2017 at 2:28 PM, Tom Lane wrote: >> Maybe it's time to convert that to a doubly-linked list. > I don't think that would help. You would need some heuristic to guess > whether the chunk you are looking for is near the front, or near the end. Uh, what? In a doubly-linked list, you can remove an element in O(1) time, you don't need any searching. It basically becomes item->prev->next = item->next; item->next->prev = item->prev; modulo possible special cases for the head and tail elements. >> Although if the >> hash code is producing a whole lot of requests that are only a bit bigger >> than the separate-block threshold, I'd say It's Doing It Wrong. It should >> learn to aggregate them into larger requests. > Right now it is using compiled-in 32KB chunks. Should it use something > like max(32kb,work_mem/128) instead? I'd say it should double the size of the request each time. That's what we do in most places. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Poor memory context performance in large hash joins
On Thu, Feb 23, 2017 at 10:47 PM, Andres Freundwrote: > On 2017-02-23 17:28:26 -0500, Tom Lane wrote: > > Jeff Janes writes: > > > The number of new chunks can be almost as as large as the number of old > > > chunks, especially if there is a very popular value. The problem is > that > > > every time an old chunk is freed, the code in aset.c around line 968 > has to > > > walk over all the newly allocated chunks in the linked list before it > can > > > find the old one being freed. This is an N^2 operation, and I think > it has > > > horrible CPU cache hit rates as well. > > > > Maybe it's time to convert that to a doubly-linked list. > > Yes, I do think so. Given that we only have that for full blocks, not > for small chunks, the cost seems neglegible. > > That would also, partially, address the performance issue > http://archives.postgresql.org/message-id/d15dff83-0b37-28ed > -0809-95a5cc7292ad%402ndquadrant.com > addresses, in a more realistically backpatchable manner. > > Jeff, do you have a handy demonstrator? > Not exactly "handy", as it takes about 45 minutes to set up and uses >50GB of disk and 16GB of RAM, but here is a demonstration. create table foobar as select CASE when random()<(420.0/540.0) then 1 else floor(random()*88)::int END as titleid from generate_series(1,54000); create table foobar2 as select distinct titleid from foobar ; analyze; set work_mem TO "13GB"; select count(*) from foobar2 where not exists (select 1 from foobar t where t.titleid=foobar2.titleid); This will run for effectively forever, unless it gets killed/dies due to OOM first. If I have other things consuming some of my 16GB of RAM, then it gets OOM (which is not a complaint: it is as one should expect given that I told it work_mem was 13GB). If I have no other demands on RAM, then I can't tell if it would eventually OOM or not because of how long it would take to get that far. This is inspired by the thread https://www.postgresql.org/message-id/flat/CACw4T0p4Lzd6VpwptxgPgoTMh2dEKTQBGu7NTaJ1%2BA0PRx1BGg%40mail.gmail.com#cacw4t0p4lzd6vpwptxgpgotmh2dektqbgu7ntaj1+a0prx1...@mail.gmail.com I ran into this while evaluating Tom's responding patch, but you don't need to apply that patch to run this example and see the effect. I don't have an example of a case that demonstrates the problem in the absence of a degenerate hash bucket. I think there should be non-degenerate cases that trigger it, but I haven't been able to produce one yet. > > Although if the > > hash code is producing a whole lot of requests that are only a bit bigger > > than the separate-block threshold, I'd say It's Doing It Wrong. It > should > > learn to aggregate them into larger requests. > > That's probably right, but we can't really address that in the > back-branches. And to me this sounds like something we should address > in the branches, not just in master. Even if we'd also fix the > hash-aggregation logic, I think such an O(n^2) behaviour in the > allocator is a bad idea in general, and we should fix it anyway. > I don't know how important a back-patch would be. This is a toy case for me. Presumably not for David Hinkle, except he doesn't want a hash join in the first place. While I want one that finishes in a reasonable time. It might depend on what happens to Tom's OOM patch. It would be great if the allocator was made bullet-proof, but I don't think adding reverse links (or anything else back-patchable) is going to be enough to do that. Cheers, Jeff
Re: [HACKERS] Poor memory context performance in large hash joins
On Thu, Feb 23, 2017 at 2:28 PM, Tom Lanewrote: > Jeff Janes writes: > > The number of new chunks can be almost as as large as the number of old > > chunks, especially if there is a very popular value. The problem is that > > every time an old chunk is freed, the code in aset.c around line 968 has > to > > walk over all the newly allocated chunks in the linked list before it can > > find the old one being freed. This is an N^2 operation, and I think it > has > > horrible CPU cache hit rates as well. > > Maybe it's time to convert that to a doubly-linked list. I don't think that would help. You would need some heuristic to guess whether the chunk you are looking for is near the front, or near the end. And in this case, the desired chunk starts out at the front, and then keeps moving down the list with each iteration as new things are added to the front, until near the end of the ExecHashIncreaseNumBatches it is freeing them from near the end. But in between, it is freeing them from the middle, so I don't think a doubly-linked list would change it from N^2, just lower the constant, even if you always knew which end to start at. Or am I misunderstanding what the implications for a doubly-linked-list are? What would really help here is if it remembered the next pointer of the just-freed chunk, and started the scan from that location the next time, cycling around to the head pointer if it doesn't find anything. But I don't think that that is a very general solution. Or, if you could pass a flag when creating the context telling it whether memory will be freed mostly-LIFO or mostly-FIFO, and have it use a stack or a queue accordingly. > Although if the > hash code is producing a whole lot of requests that are only a bit bigger > than the separate-block threshold, I'd say It's Doing It Wrong. It should > learn to aggregate them into larger requests. > Right now it is using compiled-in 32KB chunks. Should it use something like max(32kb,work_mem/128) instead? Cheers, Jeff
Re: [HACKERS] Poor memory context performance in large hash joins
On 2017-02-24 01:59:01 -0500, Tom Lane wrote: > Andres Freundwrites: > > On 2017-02-23 17:28:26 -0500, Tom Lane wrote: > >> Maybe it's time to convert that to a doubly-linked list. > > > Yes, I do think so. Given that we only have that for full blocks, not > > for small chunks, the cost seems neglegible. > > That would also, partially, address the performance issue > > http://archives.postgresql.org/message-id/d15dff83-0b37-28ed-0809-95a5cc7292ad%402ndquadrant.com > > addresses, in a more realistically backpatchable manner. > > Yeah, I was wondering if we could get away with back-patching such a > change. In principle, nothing outside aset.c should know what's in the > header of an AllocBlock, but ... You'd need to go through a fair amount of intentional pain to be affected by a change AllocBlockData's structure. We could add the ->prev pointer to the end of AllocBlockData's definition to make it less likely that one would be affected in that unlikely case - but I'm a bit doubtful it's worth the trouble. - Andres -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Poor memory context performance in large hash joins
On Fri, Feb 24, 2017 at 12:17 PM, Andres Freundwrote: > Jeff, do you have a handy demonstrator? If you want to hit ExecHashIncreaseNumBatches a bunch of times, see the "ugly" example in the attached. -- Thomas Munro http://www.enterprisedb.com hj-test-queries.sql Description: Binary data -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Poor memory context performance in large hash joins
Andres Freundwrites: > On 2017-02-23 17:28:26 -0500, Tom Lane wrote: >> Maybe it's time to convert that to a doubly-linked list. > Yes, I do think so. Given that we only have that for full blocks, not > for small chunks, the cost seems neglegible. > That would also, partially, address the performance issue > http://archives.postgresql.org/message-id/d15dff83-0b37-28ed-0809-95a5cc7292ad%402ndquadrant.com > addresses, in a more realistically backpatchable manner. Yeah, I was wondering if we could get away with back-patching such a change. In principle, nothing outside aset.c should know what's in the header of an AllocBlock, but ... > Jeff, do you have a handy demonstrator? A solid test case would definitely help to convince people that it was worth taking some risk here. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Poor memory context performance in large hash joins
On 2017-02-23 17:28:26 -0500, Tom Lane wrote: > Jeff Janeswrites: > > The number of new chunks can be almost as as large as the number of old > > chunks, especially if there is a very popular value. The problem is that > > every time an old chunk is freed, the code in aset.c around line 968 has to > > walk over all the newly allocated chunks in the linked list before it can > > find the old one being freed. This is an N^2 operation, and I think it has > > horrible CPU cache hit rates as well. > > Maybe it's time to convert that to a doubly-linked list. Yes, I do think so. Given that we only have that for full blocks, not for small chunks, the cost seems neglegible. That would also, partially, address the performance issue http://archives.postgresql.org/message-id/d15dff83-0b37-28ed-0809-95a5cc7292ad%402ndquadrant.com addresses, in a more realistically backpatchable manner. Jeff, do you have a handy demonstrator? > Although if the > hash code is producing a whole lot of requests that are only a bit bigger > than the separate-block threshold, I'd say It's Doing It Wrong. It should > learn to aggregate them into larger requests. That's probably right, but we can't really address that in the back-branches. And to me this sounds like something we should address in the branches, not just in master. Even if we'd also fix the hash-aggregation logic, I think such an O(n^2) behaviour in the allocator is a bad idea in general, and we should fix it anyway. Regards, Andres -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Poor memory context performance in large hash joins
Jeff Janeswrites: > The number of new chunks can be almost as as large as the number of old > chunks, especially if there is a very popular value. The problem is that > every time an old chunk is freed, the code in aset.c around line 968 has to > walk over all the newly allocated chunks in the linked list before it can > find the old one being freed. This is an N^2 operation, and I think it has > horrible CPU cache hit rates as well. Maybe it's time to convert that to a doubly-linked list. Although if the hash code is producing a whole lot of requests that are only a bit bigger than the separate-block threshold, I'd say It's Doing It Wrong. It should learn to aggregate them into larger requests. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Poor memory context performance in large hash joins
On Thu, Feb 23, 2017 at 2:13 PM, Jeff Janeswrote: > Is there a good solution to this? Could the new chunks be put in a > different memory context, and then destroy the old context and install the > new one at the end of ExecHashIncreaseNumBatches? I couldn't find a destroy > method for memory contexts, it looks like you just reset the parent instead. > But I don't think that would work here. Are you aware of the fact that tuplesort.c got a second memory context for 9.6, entirely on performance grounds? -- Peter Geoghegan -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers