Re: [HACKERS] [PATCH] parallel & isolated makefile for plpython
Hi Tom, On Saturday, October 1, 2016 12:23:03 PM CEST Tom Lane wrote: > Hm, actually that's unnecessary because Makefile.global already > established 'all' as the default target. I'm inclined to think > that the comment in Makefile.shlib is wrong and should be removed > or at least rewritten, because I doubt it would work at all to > include Makefile.shlib before Makefile.global; certainly that's > not the intended usage. correct analysis, the patch in git fixing this is also correct. Thanks! > > ! # gram.y=>gram.c. Run `info make -n "Empty Recipes"` for more info. > > I'm okay with pointing people at that part of the gmake docs, but don't > care for insisting on exactly one way of accessing those docs. How > about something like "... See "Empty Recipes" in the GNU Make docs for > more info."? > > [ checks around... ] Wait, actually there's a bigger problem: this > advice is apparently gmake-version-specific. I don't see any > such section title in RHEL6's version of the gmake docs (make 3.81). > I do find a section titled "Using Empty Commands", which looks like > it might be the same thing, but a person looking for "Empty Recipes" is > not going to find that. Is it worth the trouble to provide this pointer > if we need a lot of weasel wording about how the section name is spelled? Valid points. Let's forget about that patch. I like pointing people to use 'info' directly instead of googling for link, but PostgreSQL docs is not a correct place for this. To the point; I was confused by two things. First that (a) docs claim that there is no _correct_ (instead of probably? _portable_) way to write one rule for two "targets" and (b) that the comment for ';' ("The semicolon is important, otherwise make will choose the built-in rule for ...") looked too magically (without actually saying what it does). I wrongly concentrated on (b) only before. First suggestion would be something towards: -# There is no correct way to write a rule that generates two files. +# This is a way to write a rule (for 'gram.c') that generates two files. And the second: -# make, we must chain the dependencies like this. The semicolon is -# important, otherwise make will choose the built-in rule for -# gram.y=>gram.c. +# make, we must chain the dependencies like this. The semicolon (Empty +# Command/Recipe) is important, otherwise make will choose the built-in +# rule for gram.y=>gram.c. FTR, also note that something like this could work: all: xxx.c xxx.h xxx%c xxx%h: xxx.txt touch xxx.c xxx.h # info make -n "Pattern Intro", grep for 'multiple targets' But note that we don't have iterate here at all, because probably that was just my problem (having a look at make docs resolved it ...). Thanks again for the fix in git! Pavel -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] PATCH: two slab-like memory allocators
On 10/02/2016 12:23 AM, John Gorman wrote: I reproduced the quadradic pfree performance problem and verified that these patches solved it. The slab.c data structures and functions contain no quadradic components. I noticed the sizing loop in SlabContextCreate() and came up with a similar formula to determine chunksPerBlock that you arrived at. Firstly, I've realized there's an issue when chunkSize gets too large - once it exceeds blockSize, the SlabContextCreate() fails as it's impossible to place a single chunk into the block. In reorderbuffer, this may happen when the tuples (allocated in tup_context) get larger than 8MB, as the context uses SLAB_LARGE_BLOCK_SIZE (which is 8MB). But maybe there's a simpler solution - we may simply cap the chunkSize (in GenSlab) to ALLOC_CHUNK_LIMIT. That's fine, because AllocSet handles those requests in a special way - for example instead of tracking them in freelist, those chunks got freed immediately. I like this approach because it fixes the performance problems with smaller allocations and doesn't change how larger allocations are handled. One more comment about GenSlab, particularly about unpredictability of the repalloc() behavior. It works for "large" chunks allocated in the AllocSet part, and mostly does not work for "small" chunks allocated in the SlabContext. Moreover, the chunkSize changes over time, so for two chunks of the same size, repalloc() may work on one of them and fail on the other, depending on time of allocation. With SlabContext it's perfectly predictable - repalloc() call fails unless the chunk size is exactly the same as before (which is perhaps a bit pointless, but if we decide to fail even in this case it'll work 100% time). But with GenSlabContext it's unclear whether the call fails or not. I don't like this unpredictability - I'd much rather have consistent failures (making sure people don't do repalloc() on with GenSlab). But I don't see a nice way to achieve that - the repalloc() call does not go through GenSlabRealloc() at all, but directly to SlabRealloc() or AllocSetRealloc(). The best solution I can think of is adding an alternate version of AllocSetMethods, pointing to a different AllocSetReset implementation. regards -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Macro customizable hashtable / bitmapscan & aggregation perf
On 10/02/2016 02:17 AM, Andres Freund wrote: Hi, > ... I think a crucial part of the benchmarking will be identifying and measuring corner cases, e.g. a lot of rows with the same key, etc. Although that probably is not a major issue for the two places switched to the new implementation (e.g. execGrouping merges the duplicates to a single group, by definition). Hm. I don't really see a case where that's going to cause issues - all the execGrouping.c users only store one key, and either ignore later ones, or add the data to the initial tuple in the group. I don't really see any case where repeated keys should cause an issue for a hash table? Yep, that's pretty much what I suggested. I was wondering more about the other places where this hash table might be used - I've been thinking about hashjoins, but now that I think of it - that's a fairly specialized and tuned code. In any case, let's not complicate the discussion for now. My point is that I don't really see any potential usecase where that's an issue. OK, I think we agree the two places modified by the submitted patch series are fine. Let's leave discussion about places modified by future patches for the future ;-) A few comments, after quickly skimming through the first two patches: 1) SH_CREATE does this: /* round up size to the next power of 2, eases lookups */ if (size < 2) size = 2; else size = sh_pow2_int(size); I very much doubt a hash table with 2 buckets is very useful. What about using some reasonable lower boundary - IIRC hashjoins use 1024 buckets, which might be a bit too much, but perhaps 256 would work? For some cases that'll waste a good bit of space. Consider e.g. catcache.c. Callers can (and do) specify more appropriate sizes for the individual uses. Hmmm, OK. I have my doubts about those hardcoded constants, but you're right 256 is probably an overkill. 5) SH_RESIZE Do we expect to shrink hash tables? Not at the moment. Should be doable, but I'm not sure it's worth developing and testing - I don't see any usecases in pg atm. OK, sure. Then what about adding this to SH_RESIZE? Assert(oldsize <= newsize); I assumed we might shrink the catcache after invalidation, for example (but maybe I don't really know how that works). It's not entirely clear why is it guaranteed that there's always an element with optimal position (when load factor < 1)? Adding an explanation or a link to a paper would be helpful, I guess. As the load factor always has to be below 1, there always has to be an empty bucket. Every bucket after an empty bucket has to be at it's optimal position. Hmmm, thanks to moving the entries after delete? Adding an assert() enforcing this seems like a good idea. 7) Should hash agg size estimation for the planner consider the fillfactor? I think we should account for fillfactor - we should account for the allocated memory, even if there's free space in the hash table. After all, that's what hashjoins do. We didn't really for dynahash (as it basically assumed a fillfactor of 1 all the time), not sure if changing it won't also cause issues. That's a case of glass half-full vs. half-empty, I guess. If we assumed load factor 1.0, then I see it as accounting for load factor (while you see it as not accounting of it). I don't see why this should cause issues - of course, the hash table may be a bit larger, exceed work_mem and make the tidbitmap lossy, or switch HashAggregate to GroupAggregate. But I don't think that's a major issue, as those decisions depend on estimates anyway, so we can't really guarantee them. Also, with load factor 0.8 the table is only ~20% larger, so even if we don't include that into work_mem it's a reasonably small error (easily smaller than errors in the pre-9.5 HashJoin accounting, which did not include chunk headers etc.). So I won't fight for this, but I don't see why not to account for it. regards -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] PATCH: two slab-like memory allocators
On 10/02/2016 12:23 AM, John Gorman wrote: I reproduced the quadradic pfree performance problem and verified that these patches solved it. The slab.c data structures and functions contain no quadradic components. I noticed the sizing loop in SlabContextCreate() and came up with a similar formula to determine chunksPerBlock that you arrived at. ;-) Firstly, I've realized there's an issue when chunkSize gets too large - once it exceeds blockSize, the SlabContextCreate() fails as it's impossible to place a single chunk into the block. In reorderbuffer, this may happen when the tuples (allocated in tup_context) get larger than 8MB, as the context uses SLAB_LARGE_BLOCK_SIZE (which is 8MB). But maybe there's a simpler solution - we may simply cap the chunkSize (in GenSlab) to ALLOC_CHUNK_LIMIT. That's fine, because AllocSet handles those requests in a special way - for example instead of tracking them in freelist, those chunks got freed immediately. I like this approach because it fixes the performance problems with smaller allocations and doesn't change how larger allocations are handled. Right. In slab.c it looks like a line in the top comments could be clearer. Perhaps this is what is meant. < * (plus alignment), now wasting memory. * (plus alignment), not wasting memory. In slab.c some lines are over 80 characters could be folded. It would be nice to give each patch version a unique file name. OK, will fix. Nice patch, I enjoyed reading it! ;-) -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] PATCH: two slab-like memory allocators
On 10/02/2016 01:53 AM, Jim Nasby wrote: On 9/26/16 9:10 PM, Tomas Vondra wrote: Attached is v2 of the patch, updated based on the review. That means: +/* make sure the block can store at least one chunk (with 1B for a bitmap)? */ (and the comment below it) I find the question to be confusing... I think these would be better as +/* make sure the block can store at least one chunk (with 1B for a bitmap) */ and +/* number of chunks can we fit into a block, including header and bitmap */ Thanks, will rephrase the comments a bit. I'm also wondering if a 1B freespace bitmap is actually useful. If nothing else, there should probably be a #define for the initial bitmap size. That's not the point. The point is that we need to store at least one entry per block, with one bit in a bitmap. But we can't store just a single byte - we always allocate at least 1 byte. So it's more an explanation for the "1" literal in the check, nothing more. +/* otherwise add it to the proper freelist bin */ Looks like something went missing... :) Ummm? The patch contains this: + /* otherwise add it to the proper freelist bin */ + if (set->freelist[block->nfree]) + set->freelist[block->nfree]->prev = block; + + block->next = set->freelist[block->nfree]; + set->freelist[block->nfree] = block; Which does exactly the thing it should do. Or what is missing? Should zeroing the block in SlabAlloc be optional like it is with palloc/palloc0? Good catch. The memset(,0) should not be in SlabAlloc() as all, as the zeroing is handled in mctx.c, independently of the implementation. +/* + * If there are no free chunks in any existing block, create a new block + * and put it to the last freelist bucket. + */ +if (set->minFreeCount == 0) Couldn't there be blocks that have a free count > minFreeCount? (In which case you don't want to just alloc a new block, no?) Nevermind, after reading further down I understand what's going on. I got confused by "we've decreased nfree for a block with the minimum count" until I got to the part that deals with minFreeCount = 0. I think it'd be helpful if the "we've decreased nfree" comment mentioned that if nfree is 0 we'll look for the correct value after updating the freelists. Right. I think it'd be good to add an assert ensuring the minFreeCount value matches the block freelist. Or at least SlabCheck() could check this. +block->bitmapptr[idx/8] |= (0x01 << (idx % 8)); Did you consider using a pre-defined map of values to avoid the shift? I know I've seen that somewhere in the code... Patch 2... Doesn't GenSlabReset() need to actually free something? If the call magically percolates to the other contexts it'd be nice to note that in the comment. No, the other contexts are created as children of GenSlab, so the memory context infrastructure resets them automatically. I don't think this needs to be mentioned in the comments - it's pretty basic part of the parent-child context relationship. Thanks! -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Macro customizable hashtable / bitmapscan & aggregation perf
Hi, On 2016-10-02 01:37:35 +0200, Tomas Vondra wrote: > On 10/01/2016 09:59 PM, Andres Freund wrote: > >>What about running a bigger benchmark - say, TPC-DS - and evaluating > >>the impact? > > > >Worthwhile, although obviously the impact will be a lot smaller, > >since they're not entirely bottlenecked on hash-aggs and bitmap > >scans. > > > > Sure, the improvement won't be as significant as on the simple queries, but > it's interesting IMHO. Agreed. > >>I think a crucial part of the benchmarking will be identifying and > >>measuring corner cases, e.g. a lot of rows with the same key, etc. > >>Although that probably is not a major issue for the two places > >>switched to the new implementation (e.g. execGrouping merges the > >>duplicates to a single group, by definition). > > > >Hm. I don't really see a case where that's going to cause issues - all > >the execGrouping.c users only store one key, and either ignore later > >ones, or add the data to the initial tuple in the group. I don't really > >see any case where repeated keys should cause an issue for a hash table? > > > > Yep, that's pretty much what I suggested. I was wondering more about the > other places where this hash table might be used - I've been thinking about > hashjoins, but now that I think of it - that's a fairly specialized and > tuned code. In any case, let's not complicate the discussion for now. My point is that I don't really see any potential usecase where that's an issue. > A few comments, after quickly skimming through the first two patches: > > 1) SH_CREATE does this: > > /* round up size to the next power of 2, eases lookups */ > if (size < 2) > size = 2; > else > size = sh_pow2_int(size); > > I very much doubt a hash table with 2 buckets is very useful. What about > using some reasonable lower boundary - IIRC hashjoins use 1024 buckets, > which might be a bit too much, but perhaps 256 would work? For some cases that'll waste a good bit of space. Consider e.g. catcache.c. Callers can (and do) specify more appropriate sizes for the individual uses. > 2) tb->resize_above > > I'd say 'resize_threshold' is a better name. Also, 0.8 should be defined as > a constant somewhere (not sure if it makes sense to make it part of SH_TYPE, > so that hash tables may use different load factors). Fair point. > 3) 'size' is a rather poor name for parameter (e.g. in SH_CREATE or > SH_RESIZE), as it's unclear whether it's 'element size in bytes', 'total > size in bytes' or 'number of entries'. Ok. > 4) SH_CONTAINS sounds more like a function checking whether a hash table > contains a key, not as a 'type of hash table entry'. What about > SH_ENTRY_TYPE instead? Also, SH_KEYTYPE should be SH_KEY_TYPE I guess. Hm, ok. > 5) SH_RESIZE > > Do we expect to shrink hash tables? Not at the moment. Should be doable, but I'm not sure it's worth developing and testing - I don't see any usecases in pg atm. > It's not entirely clear why is it guaranteed that there's always an element > with optimal position (when load factor < 1)? Adding an explanation or a > link to a paper would be helpful, I guess. As the load factor always has to be below 1, there always has to be an empty bucket. Every bucket after an empty bucket has to be at it's optimal position. > > 6) SH_STAT > > This bit is a bit suspicious: > > uint32 max_collisions = 0; > > ... > > /* single contained element is not a collision */ > curcoll--; > total_collisions += curcoll; > if (curcoll > max_collisions) > max_collisions = curcoll - 1; > > Shouldn't the last line be just "max_collisions = curcoll"? Hm. That's probably a refactoring artefact. Nice catch. > 7) Should hash agg size estimation for the planner consider the fillfactor? > > I think we should account for fillfactor - we should account for the > allocated memory, even if there's free space in the hash table. After all, > that's what hashjoins do. We didn't really for dynahash (as it basically assumed a fillfactor of 1 all the time), not sure if changing it won't also cause issues. Thanks! Greetings, Andres Freund -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] PATCH: two slab-like memory allocators
On 9/26/16 9:10 PM, Tomas Vondra wrote: Attached is v2 of the patch, updated based on the review. That means: + /* make sure the block can store at least one chunk (with 1B for a bitmap)? */ (and the comment below it) I find the question to be confusing... I think these would be better as + /* make sure the block can store at least one chunk (with 1B for a bitmap) */ and + /* number of chunks can we fit into a block, including header and bitmap */ I'm also wondering if a 1B freespace bitmap is actually useful. If nothing else, there should probably be a #define for the initial bitmap size. + /* otherwise add it to the proper freelist bin */ Looks like something went missing... :) Should zeroing the block in SlabAlloc be optional like it is with palloc/palloc0? + /* +* If there are no free chunks in any existing block, create a new block +* and put it to the last freelist bucket. +*/ + if (set->minFreeCount == 0) Couldn't there be blocks that have a free count > minFreeCount? (In which case you don't want to just alloc a new block, no?) Nevermind, after reading further down I understand what's going on. I got confused by "we've decreased nfree for a block with the minimum count" until I got to the part that deals with minFreeCount = 0. I think it'd be helpful if the "we've decreased nfree" comment mentioned that if nfree is 0 we'll look for the correct value after updating the freelists. + block->bitmapptr[idx/8] |= (0x01 << (idx % 8)); Did you consider using a pre-defined map of values to avoid the shift? I know I've seen that somewhere in the code... Patch 2... Doesn't GenSlabReset() need to actually free something? If the call magically percolates to the other contexts it'd be nice to note that in the comment. -- Jim Nasby, Data Architect, Blue Treble Consulting, Austin TX Experts in Analytics, Data Architecture and PostgreSQL Data in Trouble? Get it in Treble! http://BlueTreble.com 855-TREBLE2 (855-873-2532) mobile: 512-569-9461 -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Macro customizable hashtable / bitmapscan & aggregation perf
On 10/01/2016 09:59 PM, Andres Freund wrote: Hi, On 2016-10-01 20:19:21 +0200, Tomas Vondra wrote: On 10/01/2016 02:44 AM, Andres Freund wrote: Hi, On 2016-07-26 17:43:33 -0700, Andres Freund wrote: In the attached patch I've attached simplehash.h, which can be customized by a bunch of macros, before being inlined. There's also a patch using this for tidbitmap.c and nodeAgg/nodeSubplan/... via execGrouping.c. Attached is a significantly improved version of this series. The main changes are: - The hash table now uses robin-hood style hash collision handling. See the commit message of 0002 (or simplehash.h) for an introduction. That significantly reduces the impact of "clustering" due to linear addressing. Interesting. Have you looked at cuckoo hashing? Yes. I don't think it's a good fit for modern CPUs. Because it doesn't probe linearly you constantly have random uncached memory accesses. I've played with a few schemes, and they all seemed to be slower than linear probing deviations, unless you intentionally used a wrong hash-distribution, or intentionally "unbalanced" the hash-space by iterating over either end of the keyspace and moved the entries around. OK, understood. - Significant comment and correctness fixes, both in simplehash.h - itself, and 0003/0004. - a lot of other random performance improvements for the hashing code. Unfortunately I'm running out battery right now, so I don't want to re-run the benchmarks posted upthread (loading takes a while). But the last time I ran them all the results after the patches were better than before. Well, I have rather bad experience with running benchmarks on laptops anyway - a lot of noise due to power management, etc. Well, with factor ~2x improvements thats not really a big detractor. Using a new CPU makes some forms of analysis easier (better PMUs). True. > For longrunning tests I agree. What about running a bigger benchmark - say, TPC-DS - and evaluating the impact? Worthwhile, although obviously the impact will be a lot smaller, since they're not entirely bottlenecked on hash-aggs and bitmap scans. Sure, the improvement won't be as significant as on the simple queries, but it's interesting IMHO. I think a crucial part of the benchmarking will be identifying and measuring corner cases, e.g. a lot of rows with the same key, etc. Although that probably is not a major issue for the two places switched to the new implementation (e.g. execGrouping merges the duplicates to a single group, by definition). Hm. I don't really see a case where that's going to cause issues - all the execGrouping.c users only store one key, and either ignore later ones, or add the data to the initial tuple in the group. I don't really see any case where repeated keys should cause an issue for a hash table? Yep, that's pretty much what I suggested. I was wondering more about the other places where this hash table might be used - I've been thinking about hashjoins, but now that I think of it - that's a fairly specialized and tuned code. In any case, let's not complicate the discussion for now. This patch series currently consists out of four patches: - 0001 - add unlikely/likely() macros. The use here is not entirely mandatory, but they do provide a quite consistent improvement. There are other threads where they proved to be beneficial, so I see little reason not to introduce them. - 0002 - the customizable hashtable itself. It now contains documentation. - 0003 - convert tidbitmap.c to use the new hashmap. This includes a fix for the issue mentioned in [1], to improve peformance in heavily lossy scenarios (otherwise we could regress in some extreme cases) - 0004 - convert execGrouping.c, e.g. used by hash aggregates While not quite perfect yet, I do think this is at a state where input is needed. I don't want to go a lot deeper into this rabbit hole, before we have some agreement that this is a sensible course of action. So, is it the right time to do some benchmarking? That's one thing that's required, yes. The other is whether we can live with the uglyness of implementing templates via macros. I do think we can, but... Hmmm ... not sure. If one of the points is to get rid of function calls determined at runtime (which make it impossible for the compiler to optimize the code), then I can think of three options: (a) copy-paste the code for each place (b) use some templating (c) use JIT I think (b) is way better than (a), and I don't think we're using JIT anywhere at this point. So while the macro-based templates look a bit awkward, I'm not aware of a better C thing. A few comments, after quickly skimming through the first two patches: 1) SH_CREATE does this: /* round up size to the next power of 2, eases lookups */ if (size < 2) size = 2; else size = sh_pow2_int(size); I very much doubt a hash table with 2 buckets is very usef
Re: [HACKERS] PATCH: two slab-like memory allocators
I reproduced the quadradic pfree performance problem and verified that these patches solved it. The slab.c data structures and functions contain no quadradic components. I noticed the sizing loop in SlabContextCreate() and came up with a similar formula to determine chunksPerBlock that you arrived at. > Firstly, I've realized there's an issue when chunkSize gets too > large - once it exceeds blockSize, the SlabContextCreate() fails > as it's impossible to place a single chunk into the block. In > reorderbuffer, this may happen when the tuples (allocated in > tup_context) get larger than 8MB, as the context uses > SLAB_LARGE_BLOCK_SIZE (which is 8MB). > > But maybe there's a simpler solution - we may simply cap the > chunkSize (in GenSlab) to ALLOC_CHUNK_LIMIT. That's fine, because > AllocSet handles those requests in a special way - for example > instead of tracking them in freelist, those chunks got freed > immediately. I like this approach because it fixes the performance problems with smaller allocations and doesn't change how larger allocations are handled. In slab.c it looks like a line in the top comments could be clearer. Perhaps this is what is meant. < * (plus alignment), now wasting memory. > * (plus alignment), not wasting memory. In slab.c some lines are over 80 characters could be folded. It would be nice to give each patch version a unique file name. Nice patch, I enjoyed reading it! Best, John John Gorman On Mon, Sep 26, 2016 at 10:10 PM, Tomas Vondra wrote: > Hi, > > Attached is v2 of the patch, updated based on the review. That means: > > - Better comment explaining how free chunks are tracked in Slab context. > > - Removed the unused SlabPointerIsValid macro. > > - Modified the comment before SlabChunkData, explaining how it relates > to StandardChunkHeader. > > - Replaced the two Assert() calls with elog(). > > - Implemented SlabCheck(). I've ended up with quite a few checks there, > checking pointers between the context, block and chunks, checks due > to MEMORY_CONTEXT_CHECKING etc. And of course, cross-checking the > number of free chunks (bitmap, freelist vs. chunk header). > > - I've also modified SlabContextCreate() to compute chunksPerBlock a > bit more efficiently (use a simple formula instead of the loop, which > might be a bit too expensive for large blocks / small chunks). > > > I haven't done any changes to GenSlab, but I do have a few notes: > > Firstly, I've realized there's an issue when chunkSize gets too large - > once it exceeds blockSize, the SlabContextCreate() fails as it's impossible > to place a single chunk into the block. In reorderbuffer, this may happen > when the tuples (allocated in tup_context) get larger than 8MB, as the > context uses SLAB_LARGE_BLOCK_SIZE (which is 8MB). > > For Slab the elog(ERROR) is fine as both parameters are controlled by the > developer directly, but GenSlab computes the chunkSize on the fly, so we > must not let it fail like that - that'd result in unpredictable failures, > which is not very nice. > > I see two ways to fix this. We may either increase the block size > automatically - e.g. instead of specifying specifying chunkSize and > blockSize when creating the Slab, specify chunkSize and chunksPerBlock (and > then choose the smallest 2^k block large enough). For example with > chunkSize=96 and chunksPerBlock=1000, we'd get 128kB blocks, as that's the > closest 2^k block larger than 96000 bytes. > > But maybe there's a simpler solution - we may simply cap the chunkSize (in > GenSlab) to ALLOC_CHUNK_LIMIT. That's fine, because AllocSet handles those > requests in a special way - for example instead of tracking them in > freelist, those chunks got freed immediately. > > > regards > > -- > Tomas Vondra http://www.2ndQuadrant.com > PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services > > > -- > Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) > To make changes to your subscription: > http://www.postgresql.org/mailpref/pgsql-hackers > >
Re: [HACKERS] VACUUM's ancillary tasks
On Sun, Oct 2, 2016 at 9:34 AM, Vik Fearing wrote: > On 10/01/2016 09:28 AM, Thomas Munro wrote: >> On Mon, Aug 29, 2016 at 1:26 PM, Vik Fearing wrote: >>> The attached two patches scratch two itches I've been having for a >>> while. I'm attaching them together because the second depends on the first. >>> >>> Both deal with the fact that [auto]vacuum has taken on more roles than >>> its original purpose. >>> >>> >>> Patch One: autovacuum insert-heavy tables >>> >>> If you have a table that mostly receives INSERTs, it will never get >>> vacuumed because there are no (or few) dead rows. I have added an >>> "inserts_since_vacuum" field to PgStat_StatTabEntry which works exactly >>> the same way as "changes_since_analyze" does. >>> >>> The reason such a table needs to be vacuumed is currently twofold: the >>> visibility map is not updated, slowing down index-only scans; and BRIN >>> indexes are not maintained, rendering them basically useless. >> >> I'm aware of those two problems, but not very familiar with the >> details. I don't feel qualified to say whether insert counting is the >> best approach to the problem at this point. I looked into it a little >> bit however, and had the following thoughts: >> >> About BRIN indexes: I couldn't find an explanation of why BRIN >> indexes don't automatically create new summary tuples when you insert >> a new tuple in an unsummarised page range. Is it deferred until >> VACUUM time in order to untangle some otherwise unresolvable >> interlocking or crash safety problem, or could that one day be done? >> Assuming that it must be deferred for some technical reason and there >> is no way around it, then I wonder if there is a more direct and >> accurate way to figure out when it's necessary than counting inserts. >> Counting inserts seems slightly bogus because you can't tell whether >> those were inserts into an existing summarised block which is >> self-maintaining or not. At first glance it looks a bit like >> unsummarised ranges can only appear at the end of the table, is that >> right? If so, couldn't you detect the number of unsummarised BRIN >> blocks just by comparing the highest summarised BRIN block and the >> current heap size? >> >> About visibility maps: How crazy would it be to estimate the number >> of not-all-visible pages instead? It would be less work to count that >> since it would only increase when the *first* tuple is inserted into a >> page that is currently all visible (ie when the bit is cleared), not >> for every tuple inserted into any page like your inserts_since_vacuum >> counter. Another difference is that inserts_since_vacuum is reset >> even if vacuum finds that it *can't* set the all-visible bit for a >> given page yet because of some concurrent transaction. In that case >> the bit is still not set but autovacuum has no reason to be triggered >> again. > > Sure, I could handle each case separately, but the goal of this patch > (as hinted at in the Subject) is to generalize all the different tasks > we've been giving to VACUUM. The only missing piece is what the first > patch addresses; which is insert-mostly tables never getting vacuumed. Yeah, that makes sense. I just wanted to discuss what the ideal launch conditions would be for those particular ancillary jobs, and then figure out whether the difference matters. Generally, I think changes to autovacuum heuristics need some consensus-building discussion, especially in light of other related ideas from Jeff Janes, and from people involved with BRIN and visibility map design, including Simon who signed up as a reviewer. Since we're out of time I'm going to move this to the November CF, and let's hear from them. -- Thomas Munro http://www.enterprisedb.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
[HACKERS] Contains and is contained by operators of inet datatypes
Attached patch adds <@, @>, <<@, and @>> operator symbols for inet datatype to replace <<=, >>=, <<, and >>. <@ and @> symbols are used for containment for all datatypes except inet, particularly on the geometric types, arrays; cube, hstore, intaray, ltree extensions. <@ and @> symbols are standardised as on version 8.2 by Tom Lane at 2006 [1]. The previous symbols are left in-place but deprecated. The patch does exactly the same for inet datatypes. The << and >> are standard symbols for strictly left of and strictly right of operators. Those operators would also make sense for inet datatypes. If we make this change now; we can remove the symbols, and reuse them for new operators in distant future. The patch removes the recently committed SP-GiST index support for the existing operator symbols to give move reason to the users to use the new symbols. This change will also indirectly deprecate the undocumented non-transparent btree index support that works sometimes for some of the existing operators [2]. The patch includes changes on the regression tests and the documentation. I will add it to 2016-11 Commitfest. [1] https://www.postgresql.org/message-id/14277.1157304...@sss.pgh.pa.us [2] https://www.postgresql.org/message-id/389.1042992...@sss.pgh.pa.us 0001-inet-contain-op-v1.patch Description: Binary data -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] VACUUM's ancillary tasks
On 10/01/2016 09:28 AM, Thomas Munro wrote: > On Mon, Aug 29, 2016 at 1:26 PM, Vik Fearing wrote: >> The attached two patches scratch two itches I've been having for a >> while. I'm attaching them together because the second depends on the first. >> >> Both deal with the fact that [auto]vacuum has taken on more roles than >> its original purpose. >> >> >> Patch One: autovacuum insert-heavy tables >> >> If you have a table that mostly receives INSERTs, it will never get >> vacuumed because there are no (or few) dead rows. I have added an >> "inserts_since_vacuum" field to PgStat_StatTabEntry which works exactly >> the same way as "changes_since_analyze" does. >> >> The reason such a table needs to be vacuumed is currently twofold: the >> visibility map is not updated, slowing down index-only scans; and BRIN >> indexes are not maintained, rendering them basically useless. > > I'm aware of those two problems, but not very familiar with the > details. I don't feel qualified to say whether insert counting is the > best approach to the problem at this point. I looked into it a little > bit however, and had the following thoughts: > > About BRIN indexes: I couldn't find an explanation of why BRIN > indexes don't automatically create new summary tuples when you insert > a new tuple in an unsummarised page range. Is it deferred until > VACUUM time in order to untangle some otherwise unresolvable > interlocking or crash safety problem, or could that one day be done? > Assuming that it must be deferred for some technical reason and there > is no way around it, then I wonder if there is a more direct and > accurate way to figure out when it's necessary than counting inserts. > Counting inserts seems slightly bogus because you can't tell whether > those were inserts into an existing summarised block which is > self-maintaining or not. At first glance it looks a bit like > unsummarised ranges can only appear at the end of the table, is that > right? If so, couldn't you detect the number of unsummarised BRIN > blocks just by comparing the highest summarised BRIN block and the > current heap size? > > About visibility maps: How crazy would it be to estimate the number > of not-all-visible pages instead? It would be less work to count that > since it would only increase when the *first* tuple is inserted into a > page that is currently all visible (ie when the bit is cleared), not > for every tuple inserted into any page like your inserts_since_vacuum > counter. Another difference is that inserts_since_vacuum is reset > even if vacuum finds that it *can't* set the all-visible bit for a > given page yet because of some concurrent transaction. In that case > the bit is still not set but autovacuum has no reason to be triggered > again. Sure, I could handle each case separately, but the goal of this patch (as hinted at in the Subject) is to generalize all the different tasks we've been giving to VACUUM. The only missing piece is what the first patch addresses; which is insert-mostly tables never getting vacuumed. As for the second patch, I would like to withdraw it and redesign it, based on your comments. The redesign I have in mind will no longer be dependent on the first patch, so I'll submit it separately. -- Vik Fearing +33 6 46 75 15 36 http://2ndQuadrant.fr PostgreSQL : Expertise, Formation et Support -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Macro customizable hashtable / bitmapscan & aggregation perf
On 2016-10-01 20:04:18 +0100, Greg Stark wrote: > On Sat, Oct 1, 2016 at 1:44 AM, Andres Freund wrote: > > > > Unfortunately I'm running out battery right now, so I don't want to > > re-run the benchmarks posted upthread (loading takes a while). But the > > last time I ran them all the results after the patches were better than > > before. > I have a machine sitting idle now too if you have specific ideas of > what to benchmark. Last time I just extracted interesting parts of queries from tpc-h (to be able to look at bitmapscans/hash-aggs in isolation) and ran tpc-h as a whole. During development I also tried to come up with adverse cases (e.g. huge bitmapscans, with very low work_mem). I don't really have a lot better ideas than that. Andres -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Macro customizable hashtable / bitmapscan & aggregation perf
Hi, On 2016-10-01 20:19:21 +0200, Tomas Vondra wrote: > On 10/01/2016 02:44 AM, Andres Freund wrote: > > Hi, > > > > On 2016-07-26 17:43:33 -0700, Andres Freund wrote: > > > In the attached patch I've attached simplehash.h, which can be > > > customized by a bunch of macros, before being inlined. There's also a > > > patch using this for tidbitmap.c and nodeAgg/nodeSubplan/... via > > > execGrouping.c. > > > > Attached is a significantly improved version of this series. The main > > changes are: > > > > - The hash table now uses robin-hood style hash collision handling. See the > > commit message of 0002 (or simplehash.h) for an introduction. That > > significantly reduces the impact of "clustering" due to linear > > addressing. > > Interesting. Have you looked at cuckoo hashing? Yes. I don't think it's a good fit for modern CPUs. Because it doesn't probe linearly you constantly have random uncached memory accesses. I've played with a few schemes, and they all seemed to be slower than linear probing deviations, unless you intentionally used a wrong hash-distribution, or intentionally "unbalanced" the hash-space by iterating over either end of the keyspace and moved the entries around. > > - Significant comment and correctness fixes, both in simplehash.h > > - itself, and 0003/0004. > > - a lot of other random performance improvements for the hashing code. > > > > > > Unfortunately I'm running out battery right now, so I don't want to > > re-run the benchmarks posted upthread (loading takes a while). But the > > last time I ran them all the results after the patches were better than > > before. > > > > Well, I have rather bad experience with running benchmarks on laptops anyway > - a lot of noise due to power management, etc. Well, with factor ~2x improvements thats not really a big detractor. Using a new CPU makes some forms of analysis easier (better PMUs). For longrunning tests I agree. > What about running a bigger benchmark - say, TPC-DS - and evaluating > the impact? Worthwhile, although obviously the impact will be a lot smaller, since they're not entirely bottlenecked on hash-aggs and bitmap scans. > I think a crucial part of the benchmarking will be identifying and measuring > corner cases, e.g. a lot of rows with the same key, etc. > Although that probably is not a major issue for the two places > switched to the new implementation (e.g. execGrouping merges the > duplicates to a single group, by definition). Hm. I don't really see a case where that's going to cause issues - all the execGrouping.c users only store one key, and either ignore later ones, or add the data to the initial tuple in the group. I don't really see any case where repeated keys should cause an issue for a hash table? > > This patch series currently consists out of four patches: > > - 0001 - add unlikely/likely() macros. The use here is not entirely > > mandatory, but they do provide a quite consistent improvement. There > > are other threads where they proved to be beneficial, so I see little > > reason not to introduce them. > > - 0002 - the customizable hashtable itself. It now contains documentation. > > - 0003 - convert tidbitmap.c to use the new hashmap. This includes a fix > > for the issue mentioned in [1], to improve peformance in heavily lossy > > scenarios (otherwise we could regress in some extreme cases) > > - 0004 - convert execGrouping.c, e.g. used by hash aggregates > > > > > > While not quite perfect yet, I do think this is at a state where input > > is needed. I don't want to go a lot deeper into this rabbit hole, > > before we have some agreement that this is a sensible course of action. > > > > So, is it the right time to do some benchmarking? That's one thing that's required, yes. The other is whether we can live with the uglyness of implementing templates via macros. I do think we can, but... Greetings, Andres Freund -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Macro customizable hashtable / bitmapscan & aggregation perf
On Sat, Oct 1, 2016 at 1:44 AM, Andres Freund wrote: > > Unfortunately I'm running out battery right now, so I don't want to > re-run the benchmarks posted upthread (loading takes a while). But the > last time I ran them all the results after the patches were better than > before. I have a machine sitting idle now too if you have specific ideas of what to benchmark. -- greg -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Hash Indexes
On Fri, Sep 30, 2016 at 2:11 AM, Robert Haas wrote: > For one thing, we can stop shipping a totally broken feature in release after > release For what it's worth I'm for any patch that can accomplish that. In retrospect I think we should have done the hash-over-btree thing ten years ago but we didn't and if Amit's patch makes hash indexes recoverable today then go for it. -- greg -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Macro customizable hashtable / bitmapscan & aggregation perf
On 10/01/2016 02:44 AM, Andres Freund wrote: Hi, On 2016-07-26 17:43:33 -0700, Andres Freund wrote: In the attached patch I've attached simplehash.h, which can be customized by a bunch of macros, before being inlined. There's also a patch using this for tidbitmap.c and nodeAgg/nodeSubplan/... via execGrouping.c. Attached is a significantly improved version of this series. The main changes are: - The hash table now uses robin-hood style hash collision handling. See the commit message of 0002 (or simplehash.h) for an introduction. That significantly reduces the impact of "clustering" due to linear addressing. Interesting. Have you looked at cuckoo hashing? That seems to be the go-to hashing in several database-related papers I've read recently, so I guess there's a reason for that (IIRC it has constant worst case for lookups, constant amortized time for inserts/deletes). Not sure how it compares to robin-hood hashing regarding cache-friendliness though. - Significant comment and correctness fixes, both in simplehash.h - itself, and 0003/0004. - a lot of other random performance improvements for the hashing code. Unfortunately I'm running out battery right now, so I don't want to re-run the benchmarks posted upthread (loading takes a while). But the last time I ran them all the results after the patches were better than before. Well, I have rather bad experience with running benchmarks on laptops anyway - a lot of noise due to power management, etc. What about running a bigger benchmark - say, TPC-DS - and evaluating the impact? I think a crucial part of the benchmarking will be identifying and measuring corner cases, e.g. a lot of rows with the same key, etc. Although that probably is not a major issue for the two places switched to the new implementation (e.g. execGrouping merges the duplicates to a single group, by definition). This patch series currently consists out of four patches: - 0001 - add unlikely/likely() macros. The use here is not entirely mandatory, but they do provide a quite consistent improvement. There are other threads where they proved to be beneficial, so I see little reason not to introduce them. - 0002 - the customizable hashtable itself. It now contains documentation. - 0003 - convert tidbitmap.c to use the new hashmap. This includes a fix for the issue mentioned in [1], to improve peformance in heavily lossy scenarios (otherwise we could regress in some extreme cases) - 0004 - convert execGrouping.c, e.g. used by hash aggregates While not quite perfect yet, I do think this is at a state where input is needed. I don't want to go a lot deeper into this rabbit hole, before we have some agreement that this is a sensible course of action. So, is it the right time to do some benchmarking? I think there's several possible additional users of simplehash.h, e.g. catcache.c - which has an open coded, not particularly fast hash table and frequently shows up in profiles - but I think the above two conversions are plenty to start with. regards -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [GENERAL] pg_upgrade from 9.5 to 9.6 fails with "invalid argument"
Andrew Dunstan writes: > On 09/30/2016 12:24 PM, Tom Lane wrote: >> Seems to be some additional prep work needed somewhere ... >> No upgrade_install_root at >> /home/bfarm/bf-scripts/build-farm-4.17/PGBuild/Modules/TestUpgradeXversion.pm >> line 51. > Oh sorry, you also need an entry for that in the config file. Mine has: > upgrade_install_root => '/home/bf/bfr/root/upgrade', Yesterday's runs of prairiedog had this turned on. I'm not sure how to interpret the output (because there isn't any, ahem), but it does not appear that the test detected anything wrong. I was able to demonstrate the problem, and that my patch of yesterday fixed it, by building a table in 9.5 that was all-visible, deleting a few widely scattered tuples, and then pg_upgrading. It wasn't that easy to exhibit a query failure, though --- my first attempt failed because I'd done an index-only scan in 9.5 before upgrading, and that had the side effect of marking the index tuples dead, making the same query in HEAD produce correct answers despite having invalid vismap state. If we wanted a regression test, it seems like we would have to build a custom test specifically designed to detect this type of problem, and I'm not sure it's worth it. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Show hash / bitmap sizes in EXPLAIN ANALYZE?
On 10/01/2016 01:37 AM, Andres Freund wrote: Hi, At the moment in-memory sort and hash nodes show their memory usage in explain: │ -> Sort (cost=59.83..62.33 rows=1000 width=4) (actual time=0.512..0.632 rows=1000 loops=1)│ │ Sort Key: a.a │ │ Sort Method: quicksort Memory: 71kB │ │ -> Function Scan on generate_series a (cost=0.00..10.00 rows=1000 width=4) (actual time=0.165..0.305 rows=1000 loops=1) │ and │ -> Hash (cost=10.00..10.00 rows=1000 width=4) (actual time=0.581..0.581 rows=1000 loops=1)│ │ Buckets: 1024 Batches: 1 Memory Usage: 44kB │ I think we should show something similar for bitmap scans, and for some execGrouping.c users (at least hash aggregates, subplans and setop seem good candidates too). +1 to improve this For both categories it's useful to see how close within work_mem a scan ended up being (to understand how high to set it, and how much the data can grow till work_mem is excceded), and for execGrouping.c users it's also very interesting to see the actual memory usage because the limit is only a very soft one. Does anybody see a reason not to add that? Well, the obvious problem with execGrouping.c is that we don't have information about memory usage - we don't know how large the aggregate state is. It's trivial to compute it for aggregates that use fixed-length data types, but for aggregates that use varlena/internal state that's not going to work. This is actually the same problem Jeff Davis ran into when trying to implement memory-bounded HashAgg ~2 years ago, which also needs this information. Back then there was a lot of discussion about whether the ~1% penalty measured is acceptable price for the accounting, which kinda killed the whole patch. I plan to revisit that hashagg patch, or rather a new patch with the same goal - now that we have serial/deserial functions for aggregates, we should be able to implement much nicer spill-to-disk method. But that'll need the memory accounting, so if you want to look into it, you're welcome. regards -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Hash Indexes
Andres Freund : On 2016-09-30 17:39:04 +0100, Peter Geoghegan wrote: On Fri, Sep 30, 2016 at 4:46 PM, Robert Haas wrote: > I would just be very disappointed if, after the amount of work that > Amit and others have put into this project, the code gets rejected > because somebody thinks a different project would have been more worth > doing. I wouldn't presume to tell anyone else how to spend their time, and am not concerned about this making the hash index code any less useful from the user's perspective. Me neither. I'm concerned that this is a heck of a lot of work, and I don't think we've reached the end of it by a good bit. I think it would have, and probably still is, a more efficient use of time to go for the hash-via-btree method, and rip out the current hash indexes. But that's just me. I find it more than a bit odd to be accused of trying to waste others time by saying this, and that this is too late because time has already been invested. Especially the latter never has been a standard in the community, and while excruciatingly painful when one is the person(s) having invested the time, it probably shouldn't be. > The fact that we have hash indexes already and cannot remove them > because too much other code depends on hash opclasses is also, in my > opinion, a sufficiently good reason to pursue improving them. I think that Andres was suggesting that hash index opclasses would be usable with hash-over-btree, so you might still not end up with the wart of having hash opclasses without hash indexes (an idea that has been proposed and rejected at least once before now). Andres? Yes, that was what I was pretty much thinking. I was kind of guessing that this might be easiest implemented as a separate AM ("hash2" ;)) that's just a layer ontop of nbtree. Greetings, Andres Freund Hi, There have been benchmarks posted over the years were even the non-WAL logged hash out performed the btree usage variant. You cannot argue against O(1) algorithm behavior. We need to have a usable hash index so that others can help improve it. My 2 cents. Regards, Ken -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [PATCH] parallel & isolated makefile for plpython
Pavel Raiskup writes: > Hi, we observed issues with parallel make during RPM build in plpython, > seems like the attached patch 0002 should help. Feel free to reject 0001, > but comment like that would save some time to me as a "newcomer" into that > Makefile. Hi Pavel! > The submake-generated-headers can't be used as 'all' prerequisite > at the same level with with 'all-lib', because > 'submake-generated-headers' is actually prerequisite for > 'all-libs' in this makefile. So use submake-generated-headers as > $(OBJS) prerequisite, as several object files depend on it. D'oh. This is my fault, IIRC. Will fix. > Move the 'all' target before include statement, according to > documentation in Makefile.shlib. Hm, actually that's unnecessary because Makefile.global already established 'all' as the default target. I'm inclined to think that the comment in Makefile.shlib is wrong and should be removed or at least rewritten, because I doubt it would work at all to include Makefile.shlib before Makefile.global; certainly that's not the intended usage. > ! # gram.y=>gram.c. Run `info make -n "Empty Recipes"` for more info. I'm okay with pointing people at that part of the gmake docs, but don't care for insisting on exactly one way of accessing those docs. How about something like "... See "Empty Recipes" in the GNU Make docs for more info."? [ checks around... ] Wait, actually there's a bigger problem: this advice is apparently gmake-version-specific. I don't see any such section title in RHEL6's version of the gmake docs (make 3.81). I do find a section titled "Using Empty Commands", which looks like it might be the same thing, but a person looking for "Empty Recipes" is not going to find that. Is it worth the trouble to provide this pointer if we need a lot of weasel wording about how the section name is spelled? regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] pgbench more operators & functions
Hello Stephen, bitwise: <<, >>, &, |, ^/#, ~ comparisons: =/==, <>/!=, <, <=, >, >= logical: and/&&, or/||, xor/^^, not, ! I'm not sure that we want to introduce operators '&&', '||' as logical 'and' and 'or' when those have specific meaning in PG which is different (array overlaps and concatenation, specifically). I can certainly see how it could be useful to be able to perform string concatenation in a \set command in pgbench, for example.. Also, are we sure that we want to have both '=' and '==' for comparison? The reason I added C operators is that Pg SQL has '!=' and a few others, so once some are there I do not see why others should not be there as well for consistency. Now I agree that having operators which behave differently from psql to pgbench is a bad idea. In the attached patched I only included pg operators, plus "xor" which I feel is missing and does not seem to harm. [...] I certainly understand how the if() function could be useful Indeed, some kind of "if" is needed, for instance to implement "tpc-b" correctly. , but I'm not entirely sure that the if(expression, true-result, false-result) is the best approach. In particular, I recall cases with other languages where that gets to be quite ugly when there are multiple levels of if/else using that approach. I think that pgbench is not a programming language, but an expression language, which means that you have to provide a result, so you can only have balanced if/then/else, which simplifies things a bit compared to "full" language. The SQL syntax for CASE is on the very heavy side and would be quite complicated to implement in pgbench, so I rejected that and selected the simplest possible function for the job. Maybe the "if" function could be named something fancier to avoid possible future clash on an eventual "if" keyword? Note that excel has an IF function. Maybe "conditional"... However, as I do not envision pgbench growing to a full language, I would be fine keeping "if" as it is. The table should really have a row per operator, which is what we do in pretty much all of the core documention. [...] Ok for one line per operator. The question which was brought up about having a line-continuation (eg: '\') character would be useful, :-) I submitted a patch for that, which got rejected and resulted in Tom making pgbench share psql lexer. This is a good thing, although the continuation extension was lost in the process. Also this means that now such logic would probably be shared with psql, not necessarily a bad thing either, but clearly material for another patch. which really makes me wonder if we shouldn't try to come up with a way to create functions in a pgbench script that can later be used in a \set command. I do not think that pgbench script is a good target to define functions, because the script is implicitely in a very large loop which is executing it over and over again. I do not think that it would make much sense to redefine functions on each transaction... So my opinion is that without a compeling use case, I would avoid such a feature, which would need some careful thinking on the design side, and the implementation of which is non trivial. With such an approach, we could have proper control structures for conditionals (if/else), loops (while/for), etc, and not complicate the single-statement set of operators with those constructs. If you want control, you can already use a DO & PL/pgsql script, although it is interpreted server-side. Yet again, I'm not convince that pgbench is material for such features, and it would take a compeling use case for me to add such a thing. Moreover, the currently simple internal data structures and executor would have to be profoundly reworked and extended to handle a full language and functions. Lastly, we should really add some regression tests to pgbench. We already have the start of a TAP test script which it looks like it wouldn't be too hard to add on to. I agree that pgbench should be tested systematically. I think that this is not limited to these functions and operators but a more general and desirable feature, thus is really material for another patch. Attached version changes: - removes C operators not present in psql - document operators one per line -- Fabien.diff --git a/doc/src/sgml/ref/pgbench.sgml b/doc/src/sgml/ref/pgbench.sgml index 285608d..c29966f 100644 --- a/doc/src/sgml/ref/pgbench.sgml +++ b/doc/src/sgml/ref/pgbench.sgml @@ -821,9 +821,8 @@ pgbench options dbname The expression may contain integer constants such as 5432, double constants such as 3.14159, references to variables :variablename, - unary operators (+, -) and binary operators - (+, -, *, /, - %) with their usual precedence and associativity, + operators + with their usual precedence and associativity, function calls, and parentheses. @@ -909,6 +908,
Re: [HACKERS] PoC: Make it possible to disallow WHERE-less UPDATE and DELETE
On Sat, Oct 1, 2016 at 5:08 AM, Thomas Munro wrote: > I guess you need something involving query_tree_walker or some other > kind of recursive traversal if you want to find DELETE/UPDATE lurking > in there. > > One option would be to document it as working for top level DELETE or > UPDATE, and leave the recursive version as an improvement for a later > patch. That's the most interesting kind to catch because that's what > people are most likely to type directly into a command line. That would be a halfy-baked feature then, and the patch would finish by being refactored anyway if we support more cases in the future, because those will need a tree walker (think CTAS, INSERT SELECT, using WITH queries that contain DMLs)... Personally I think that it would be surprising if subqueries are not restrained. So I am -1 for the patch as-is, and let's come up with the most generic approach. Having more regression tests would be a good idea as well. I am marking the patch as returned with feedback. This CF has normally already ended. -- Michael -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] COPY command with RLS bug
On Sat, Oct 1, 2016 at 3:11 AM, Stephen Frost wrote: > Comments and testing welcome, of course, though it's looking pretty good > to me at this point and I'll likely commit it in another day or two > unless issues are found. +* nodes/value.h) that correspond to the column name "correspondS"? Except that, it looks good to me. -- Michael -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] On conflict update & hint bits
On Fri, Sep 30, 2016 at 5:33 PM, Konstantin Knizhnik wrote: > So the question is whether it is correct that ExecOnConflictUpdate tries to > access and update tuple without holding lock on the buffer? You're right -- this is a bug in Postgres. I'm travelling from Ireland to the USA this weekend, but will work on this early next week. I don't think it's a particularly tricky fix -- as you say, it is necessary to have at least a shared buffer lock to call HeapTupleSatisfiesVisibility(), and we quite simply fail to ensure that. We must have a shared buffer lock in the visibility-check path for ON CONFLICT DO UPDATE where isolation level > READ COMMITTED -- a buffer pin is not enough. It also looks like the DO NOTHING variant is similarly affected, even when the isolation level is READ COMMITTED.:-( Thanks for the detailed report. -- Peter Geoghegan -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] PoC: Make it possible to disallow WHERE-less UPDATE and DELETE
On Sat, Oct 1, 2016 at 5:08 AM, Thomas Munro wrote: > Right. These cases work because they show up as CMD_DELETE/CMD_UPDATE: > > postgres=# set require_where.delete = on; > SET > postgres=# with answer as (select 42) delete from foo; > ERROR: DELETE requires a WHERE clause when require_where.delete is set to on > HINT: To delete all rows, use "WHERE true" or similar. > postgres=# prepare x as delete from foo; > ERROR: DELETE requires a WHERE clause when require_where.delete is set to on > HINT: To delete all rows, use "WHERE true" or similar. Is this patch able to handle the case of DML queries using RETURNING in COPY? Those are authorized since 9.6, and it has not been mentioned yet on this thread. Going through the patch quickly I guess that would not work. -- Michael -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] VACUUM's ancillary tasks
On Mon, Aug 29, 2016 at 1:26 PM, Vik Fearing wrote: > The attached two patches scratch two itches I've been having for a > while. I'm attaching them together because the second depends on the first. > > Both deal with the fact that [auto]vacuum has taken on more roles than > its original purpose. > > > Patch One: autovacuum insert-heavy tables > > If you have a table that mostly receives INSERTs, it will never get > vacuumed because there are no (or few) dead rows. I have added an > "inserts_since_vacuum" field to PgStat_StatTabEntry which works exactly > the same way as "changes_since_analyze" does. > > The reason such a table needs to be vacuumed is currently twofold: the > visibility map is not updated, slowing down index-only scans; and BRIN > indexes are not maintained, rendering them basically useless. I'm aware of those two problems, but not very familiar with the details. I don't feel qualified to say whether insert counting is the best approach to the problem at this point. I looked into it a little bit however, and had the following thoughts: About BRIN indexes: I couldn't find an explanation of why BRIN indexes don't automatically create new summary tuples when you insert a new tuple in an unsummarised page range. Is it deferred until VACUUM time in order to untangle some otherwise unresolvable interlocking or crash safety problem, or could that one day be done? Assuming that it must be deferred for some technical reason and there is no way around it, then I wonder if there is a more direct and accurate way to figure out when it's necessary than counting inserts. Counting inserts seems slightly bogus because you can't tell whether those were inserts into an existing summarised block which is self-maintaining or not. At first glance it looks a bit like unsummarised ranges can only appear at the end of the table, is that right? If so, couldn't you detect the number of unsummarised BRIN blocks just by comparing the highest summarised BRIN block and the current heap size? About visibility maps: How crazy would it be to estimate the number of not-all-visible pages instead? It would be less work to count that since it would only increase when the *first* tuple is inserted into a page that is currently all visible (ie when the bit is cleared), not for every tuple inserted into any page like your inserts_since_vacuum counter. Another difference is that inserts_since_vacuum is reset even if vacuum finds that it *can't* set the all-visible bit for a given page yet because of some concurrent transaction. In that case the bit is still not set but autovacuum has no reason to be triggered again. -- Thomas Munro http://www.enterprisedb.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers