Re: Speeding up GIST index creation for tsvectors
On Sat, 20 Mar 2021 at 02:19, John Naylor wrote: > On Fri, Mar 19, 2021 at 8:57 AM Amit Khandekar wrote: > > Regarding the alignment changes... I have removed the code that > > handled the leading identically unaligned bytes, for lack of evidence > > that percentage of such cases is significant. Like I noted earlier, > > for the tsearch data I used, identically unaligned cases were only 6%. > > If I find scenarios where these cases can be significant after all and > > if we cannot do anything in the gist index code, then we might have to > > bring back the unaligned byte handling. I didn't get a chance to dig > > deeper into the gist index implementation to see why they are not > > always 8-byte aligned. > > I find it stranger that something equivalent to char* is not randomly > misaligned, but rather only seems to land on 4-byte boundaries. > > [thinks] I'm guessing it's because of VARHDRSZ, but I'm not positive. > > FWIW, I anticipate some push back from the community because of the fact that > the optimization relies on statistical phenomena. I dug into this issue for tsvector type. Found out that it's the way in which the sign array elements are arranged that is causing the pointers to be misaligned: Datum gtsvector_picksplit(PG_FUNCTION_ARGS) { .. cache = (CACHESIGN *) palloc(sizeof(CACHESIGN) * (maxoff + 2)); cache_sign = palloc(siglen * (maxoff + 2)); for (j = 0; j < maxoff + 2; j++) cache[j].sign = _sign[siglen * j]; } If siglen is not a multiple of 8 (say 700), cache[j].sign will in some cases point to non-8-byte-aligned addresses, as you can see in the above code snippet. Replacing siglen by MAXALIGN64(siglen) in the above snippet gets rid of the misalignment. This change applied over the 0001-v3 patch gives additional ~15% benefit. MAXALIGN64(siglen) will cause a bit more space, but for not-so-small siglens, this looks worth doing. Haven't yet checked into types other than tsvector. Will get back with your other review comments. I thought, meanwhile, I can post the above update first.
Re: speed up verifying UTF-8
On Sat, 17 Jul 2021 at 04:48, John Naylor wrote: > v17-0001 is the same as v14. 0002 is a stripped-down implementation of Amit's > chunk idea for multibyte, and it's pretty good on x86. On Power8, not so > much. 0003 and 0004 are shot-in-the-dark guesses to improve it on Power8, > with some success, but end up making x86 weirdly slow, so I'm afraid that > could happen on other platforms as well. Thanks for trying the chunk approach. I tested your v17 versions on Arm64. For the chinese characters, v17-0002 gave some improvement over v14. But for all the other character sets, there was around 10% degradation w.r.t. v14. I thought maybe the hhton64 call and memcpy() for each mb character might be the culprit, so I tried iterating over all the characters in the chunk within the same pg_utf8_verify_one() function by left-shifting the bits. But that worsened the figures. So I gave up that idea. Here are the numbers on Arm64 : HEAD: chinese | mixed | ascii | mixed16 | mixed8 -+---+---+-+ 1781 | 1095 | 628 | 944 | 1151 v14: chinese | mixed | ascii | mixed16 | mixed8 -+---+---+-+ 852 | 484 | 144 | 584 |971 v17-0001+2: chinese | mixed | ascii | mixed16 | mixed8 -+---+---+-+ 731 | 520 | 152 | 645 | 1118 Haven't looked at your v18 patch set yet.
Re: speed up verifying UTF-8
On Tue, 13 Jul 2021 at 01:15, John Naylor wrote: > > It seems like it would be easy to have pg_utf8_verify_one in my proposed > > pg_utf8.h header and replace the body of pg_utf8_verifychar with it. > > 0001: I went ahead and tried this for v15, and also attempted some clean-up: > > - Rename pg_utf8_verify_one to pg_utf8_verifychar_internal. > - Have pg_utf8_verifychar_internal return -1 for invalid input to match other > functions in the file. We could also do this for check_ascii, but it's not > quite the same thing, because the string could still have valid bytes in it, > just not enough to advance the pointer by the stride length. > - Remove hard-coded numbers (not wedded to this). > > - Use a call to pg_utf8_verifychar in the slow path. > - Reduce pg_utf8_verifychar to thin wrapper around > pg_utf8_verifychar_internal. - check_ascii() seems to be used only for 64-bit chunks. So why not remove the len argument and the len <= sizeof(int64) checks inside the function. We can rename it to check_ascii64() for clarity. - I was thinking, why not have a pg_utf8_verify64() that processes 64-bit chunks (or a 32-bit version). In check_ascii(), we anyway extract a 64-bit chunk from the string. We can use the same chunk to extract the required bits from a two byte char or a 4 byte char. This way we can avoid extraction of separate bytes like b1 = *s; b2 = s[1] etc. More importantly, we can avoid the separate continuation-char checks for each individual byte. Additionally, we can try to simplify the subsequent overlong or surrogate char checks. Something like this : int pg_utf8_verifychar_32(uint32 chunk) { intlen, l; for (len = sizeof(chunk); len > 0; (len -= l), (chunk = chunk << l)) { /* Is 2-byte lead */ if ((chunk & 0xF000) == 0xC000) { l = 2; /* ... ... */ } /* Is 3-byte lead */ else if ((chunk & 0xF000) == 0xE000) { l = 3; if (len < l) break; /* b2 and b3 should be continuation bytes */ if ((chunk & 0x00C0C000) != 0x00808000) return sizeof(chunk) - len; switch (chunk & 0xFF20) { /* check 3-byte overlong: 1110. 1001. 10xx. * i.e. (b1 == 0xE0 && b2 < 0xA0). We already know b2 is of the form * 10xx since it's a continuation char. Additionally condition b2 <= * 0x9F means it is of the form 100x.. i.e. either 1000. * or 1001.. So just verify that it is xx0x. */ case 0xE000: return sizeof(chunk) - len; /* check surrogate: 1110.1101 101x. 10xx. * i.e. (b1 == 0xED && b2 > 0x9F): Here, > 0x9F means either * 1010., 1011., 1100., or 1110.. Last two are not * possible because b2 is a continuation char. So it has to be * first two. So just verify that it is xx1x. */ case 0xED20: return sizeof(chunk) - len; default: ; } } /* Is 4-byte lead */ else if ((chunk & 0xF000) == 0xF000) { /* . */ l = 4; } else return sizeof(chunk) - len; } return sizeof(chunk) - len; }
Relaxing the sub-transaction restriction in parallel query
Hi, Currently we don't allow a sub-transaction to be spawned from inside a parallel worker (and also from a leader who is in parallel mode). This imposes a restriction that pl/pgsql functions that use an exception block can't be marked parallel safe, even when the exception block is there only to catch trivial errors such as divide-by-zero. I tried to look at the implications of removing this sub-transaction restriction, and came up with the attached WIP patch after considering the below points. I may have missed other points, or may have assumed something wrong. So comments are welcome. - TBLOCK_PARALLEL_INPROGRESS Now that there can be an in-progress sub-transaction in a parallel worker, the sub-transaction states need to be accounted for. Rather than having new transaction states such as TBLOCK_PARALLEL_SUBINPROGRESS, I removed the existing TBLOCK_PARALLEL_INPROGRESS from the code. At a couple of places is_parallel_worker is set if state is TBLOCK_PARALLEL_INPROGRESS. Instead, for now, I have used (ParallelCurrentXids != NULL) to identify if it's a worker in a valid transaction state. Maybe we can improve on this. In EndTransactionBlock(), there is a fatal error thrown if it's a parallel-worker in-progress. This seems to be a can't-have case. So I have removed this check. Need to further think how we can retain this check. - IsInParallelMode() On HEAD, the parallel worker cannot have any sub-transactions, so CurrentTransactionState always points to the TopTransactionStateData. And when ParallelWorkerMain() calls EnterParallelMode(), from that point onwards IsInParallelMode() always returns true. But with the patch, CurrentTransactionState can point to some nest level down below, and in that TransactionState, parallelModeLevel would be 0, so IsInParallelMode() will return false in a sub-transaction, unless some other function happens to explicitly call EnterParallelMode(). One option for making IsInParallelMode() always return true for worker is to just check whether the worker is in a transaction (ParallelCurrentXids != NULL). Or else, check only the TopTransactionData->parallelModeLevel. Still another option is for the new TransactionState to inherit parallelModeLevel from it's parent. I chose this option. This avoids additional conditions in IsInParallelMode() specifically for worker. Does this inherit-parent-parallelModeLevel option affect the leader code ? The functions calling EnterParallelMode() are : begin_parallel_vacuum, _bt_begin_parallel, ParallelWorkerMain, CommitTransaction, ExecutePlan(). After entering Parallel mode, it does not look like a subtransaction will be spawned at these places. If at all it does, on HEAD, the IsInParallelMode() will return false, which does not sound right. For all the child transactions, this function should return true. So w.r.t. this, in fact inheriting parent's parallelModeLevel looks better. Operations that are not allowed to run in worker would continue to be disallowed in a worker sub-transaction as well. E.g. assigning new xid, heap_insert, etc. These places already are using IsInParallelMode() which takes care of guarding against such operations. Just for archival ... ExecutePlan() is called with use_parallel_mode=true when there was a gather plan created, in which case it enters Parallel mode. From here, if the backend happens to start a new subtransaction for some reason, it does sound right for the Parallel mode to be true for this sub-transaction, although I am not sure if there can be such a case. In worker, as far as I understand, ExecutePlan() always gets called with use_parallel_mode=false, because there is no gather plan in the worker. So it does not enter Parallel mode. But because the worker is already in parallel mode, it does not matter. - List of ParallelContexts (pcxt_list) : ParallelContext is created only in backends. So there are no implications of the pcxt_list w.r.t. parallel workers spawning a subtransaction, because pcxt_list will always be empty in workers. - ParallelCurrentXids : A parallel worker always maintains a global flat sorted list of xids which represent all the xids that are considered as current xids (i.e. the ones that are returned by TransactionIdIsCurrentTransactionId() in a leader). So this global list should continue to work no matter what is the sub-transaction nest level, since there won't be new xids created in the worker. - Savepoints : Haven't considered savepoints. The restriction is retained for savepoints. Thanks -Amit Khandekar Huawei Technologies diff --git a/src/backend/access/transam/xact.c b/src/backend/access/transam/xact.c index 441445927e..c46e32ba37 100644 --- a/src/backend/access/transam/xact.c +++ b/src/backend/access/transam/xact.c @@ -160,7 +160,6 @@ typedef enum TBlockState TBLOCK_BEGIN,/* starting transaction block */ TBLOCK_INPROGRESS, /* live transaction */ TBLOCK_IMPLICIT_INPROGRESS, /* live transaction after implicit BEGIN */ - TBLOCK_PARALLEL_INPROGR
Re: Result Cache node shows per-worker info even for workers not launched
On Wed, 28 Apr 2021 at 13:54, David Rowley wrote: > So, given that I removed the parallel test in partition_prune.sql, and > don't have any EXPLAIN ANALYZE output for parallel tests in > resultcache.sql, it should be safe enough to put that cache_misses == > 0 test back into explain.c > > I've attached a patch to do this. The explain.c part is pretty similar > to your patch, I just took my original code and comment. Sounds good. And thanks for the cleanup patch, and the brief history. Patch looks ok to me.
Re: Result Cache node shows per-worker info even for workers not launched
On Wed, 28 Apr 2021 at 16:14, David Rowley wrote: > However, I did add 1 test that sets work_mem down to 64kB to ensure > the eviction code does get some exercise. You'll notice that I pass > "true" to explain_resultcache() to hide the hits and misses there. We > can't test the exact number of hits/misses/evictions there, but we can > at least tell apart the zero and non-zero by how I coded > explain_resultcache() to replace with Zero or N depending on if the > number was zero or above zero. Thanks for the explanation. I did realize after replying to Bharat upthread, that I was wrong in assuming that the cache misses and cache hits are always stable for non-parallel scans.
Re: Result Cache node shows per-worker info even for workers not launched
On Wed, 28 Apr 2021 at 15:08, Bharath Rupireddy wrote: > > On Wed, Apr 28, 2021 at 1:54 PM David Rowley wrote: > > I plan to push this in the next 24 hours or so. > > I happen to see explain_resultcache in resultcache.sql, seems like two > of the tests still have numbers for cache hits and misses - Hits: 980 > Misses: 20, won't these make tests unstable? Will these numbers be > same across machines? Or is it that no buildfarm had caught these? The > comment below says that, the hits and misses are not same across > machines: > -- Ensure we get some evictions. We're unable to validate the hits and misses > -- here as the number of entries that fit in the cache at once will vary > -- between different machines. > > Should we remove the hide_hitmiss parameter in explain_resultcache and > always print N for non-zero and Zero for 0 hits and misses? This > clearly shows that we have 0 or non-zero hits or misses. > > Am I missing something? I believe, the assumption here is that with no workers involved, it is guaranteed to have the exact same cache misses and hits anywhere.
Result Cache node shows per-worker info even for workers not launched
Hi, If planned parallel workers do not get launched, the Result Cache plan node shows all-0 stats for each of those workers: tpch=# set max_parallel_workers TO 0; SET tpch=# explain analyze select avg(l_discount) from orders, lineitem where l_orderkey = o_orderkey and o_orderdate < date '1995-03-09' and l_shipdate > date '1995-03-09'; QUERY PLAN - Finalize Aggregate (cost=315012.87..315012.88 rows=1 width=32) (actual time=27533.482..27533.598 rows=1 loops=1) -> Gather (cost=315012.44..315012.85 rows=4 width=32) (actual time=27533.471..27533.587 rows=1 loops=1) Workers Planned: 4 Workers Launched: 0 -> Partial Aggregate (cost=314012.44..314012.45 rows=1 width=32) (actual time=27533.177..27533.178 rows=1 loops=1) -> Nested Loop (cost=0.44..309046.68 rows=1986303 width=4) (actual time=0.400..27390.835 rows=748912 loops=1) -> Parallel Seq Scan on lineitem (cost=0.00..154513.66 rows=4120499 width=12) (actual time=0.044..7910.399 rows=16243662 loops=1) Filter: (l_shipdate > '1995-03-09'::date) Rows Removed by Filter: 13756133 -> Result Cache (cost=0.44..0.53 rows=1 width=4) (actual time=0.001..0.001 rows=0 loops=16243662) Cache Key: lineitem.l_orderkey Hits: 12085749 Misses: 4157913 Evictions: 3256424 Overflows: 0 Memory Usage: 65537kB Worker 0: Hits: 0 Misses: 0 Evictions: 0 Overflows: 0 Memory Usage: 0kB Worker 1: Hits: 0 Misses: 0 Evictions: 0 Overflows: 0 Memory Usage: 0kB Worker 2: Hits: 0 Misses: 0 Evictions: 0 Overflows: 0 Memory Usage: 0kB Worker 3: Hits: 0 Misses: 0 Evictions: 0 Overflows: 0 Memory Usage: 0kB -> Index Scan using orders_pkey on orders (cost=0.43..0.52 rows=1 width=4) (actual time=0.002..0.002 rows=0 loops=4157913) Index Cond: (o_orderkey = lineitem.l_orderkey) Filter: (o_orderdate < '1995-03-09'::date) Rows Removed by Filter: 1 Planning Time: 0.211 ms Execution Time: 27553.477 ms (22 rows) By looking at the other cases like show_sort_info() or printing per-worker jit info, I could see that the general policy is that we skip printing info for workers that are not launched. Attached is a patch to do the same for Result Cache. I was earlier thinking about using (instrument[n].nloops == 0) to check for not-launched workers. But we are already using "if (rcstate->stats.cache_misses == 0)" for the leader process, so for consistency I used the same method for workers. -- Thanks, -Amit Khandekar Huawei Technologies skip_notlaunched_workers.patch Description: Binary data
Re: Speeding up GIST index creation for tsvectors
On Thu, 11 Mar 2021 at 04:25, John Naylor wrote: > Okay, so it's hard-coded in various functions in contrib modules. In that > case, let's just keep the existing coding for those. In fact, the comments > that got removed by your patch specifically say: /* Using the popcount > functions here isn't likely to win */ ...which your testing confirmed. The > new function should be used only for Gist and possibly Intarray, whose > default is 124. That means we can't get rid of hemdistsign(), but that's > fine. The comment is there for all types. Since I get the performance better on all the types, I have kept the pg_xorcount() call for all these contrib modules. I understand that since for some types default siglen is small, we won't get benefit. But I think we should call pg_xorcount() for the benefit of non-default siglen case. Have replaced hemdistsign() by pg_xorcount() everywhere; but with that, the call looks a bit clumsy because of having to type-cast the parameters to const unsigned char *. I noticed that the cast to "unsigned char *" is required only when we use the value in the pg_number_of_ones array. Elsewhere we can safely use 'a' and 'b' as "char *". So I changed the pg_xorcount() parameters from unsigned char * to char *. > I think the way to go is a simplified version of your 0001 (not 0002), with > only a single function, for gist and intarray only, and a style that better > matches the surrounding code. If you look at my xor functions in the attached > text file, you'll get an idea of what it should look like. Note that it got > the above performance without ever trying to massage the pointer alignment. > I'm a bit uncomfortable with the fact that we can't rely on alignment, but > maybe there's a simple fix somewhere in the gist code. In the attached 0001-v3 patch, I have updated the code to match it with surrounding code; specifically the usage of *a++ rather than a[i]. Regarding the alignment changes... I have removed the code that handled the leading identically unaligned bytes, for lack of evidence that percentage of such cases is significant. Like I noted earlier, for the tsearch data I used, identically unaligned cases were only 6%. If I find scenarios where these cases can be significant after all and if we cannot do anything in the gist index code, then we might have to bring back the unaligned byte handling. I didn't get a chance to dig deeper into the gist index implementation to see why they are not always 8-byte aligned. I have kept the 0002 patch as-is. Due to significant *additional* speedup, over and above the 0001 improvement, I think the code re-arrangement done is worth it for non-x86 platforms. -Amit Khandekar From c6ae575dd483b85cec6748c2e014d0f32565b4eb Mon Sep 17 00:00:00 2001 From: Amit Khandekar Date: Fri, 19 Mar 2021 20:22:44 +0800 Subject: [PATCH 1/2] Speed up xor'ing of two gist index signatures for tsvectors In hemdistsign(), rather than using xor operator on char values, use it in 64-bit chunks. And since the chunks are 64-bit, use popcount64() on each of the chunks. I have checked that the two bitvector pointer arguments of hemdistsign() are not always 64-bit aligned. So do the 64-bit chunks only if both pointers are 8 byte-aligned. This results in speed-up in Gist index creation for tsvectors. With default siglen (124), the speed up is 12-20%. With siglen=700, it is 30-50%. So with longer signature lengths, we get higher percentage speed-up. Similar results are seen in other types using gist index, such as intarray, hstore, and ltree that are availale in contrib. With smaller siglens such as 10, 20 etc, there is a bit of a reduction in speed by 1-7% if we use this optimization. It's probably because of an extra function call for pg_xorcount(); and also might be due to the extra logic in pg_xorcount(). So for siglen less than 32, keep the existing method using byte-by-byte traversal. --- contrib/hstore/hstore_gist.c | 17 ++-- contrib/intarray/_intbig_gist.c | 18 +--- contrib/ltree/_ltree_gist.c | 19 ++--- src/backend/utils/adt/tsgistidx.c | 26 +-- src/include/port/pg_bitutils.h| 17 src/port/pg_bitutils.c| 34 +++ 6 files changed, 61 insertions(+), 70 deletions(-) diff --git a/contrib/hstore/hstore_gist.c b/contrib/hstore/hstore_gist.c index 102c9cea72..4970a0a2f0 100644 --- a/contrib/hstore/hstore_gist.c +++ b/contrib/hstore/hstore_gist.c @@ -8,6 +8,7 @@ #include "access/stratnum.h" #include "catalog/pg_type.h" #include "hstore.h" +#include "port/pg_bitutils.h" #include "utils/pg_crc.h" /* gist_hstore_ops opclass options */ @@ -256,20 +257,6 @@ sizebitvec(BITVECP sign, int siglen) return size; } -static int -hemdistsign(BITVECP a, BITVECP b, int siglen) -{ - int i, -dist = 0; - - L
Re: [POC] verifying UTF-8 using SIMD instructions
On Tue, 9 Mar 2021 at 17:14, John Naylor wrote: > On Tue, Mar 9, 2021 at 5:00 AM Amit Khandekar wrote: > > Just a quick question before I move on to review the patch ... The > > improvement looks like it is only meant for x86 platforms. > > Actually it's meant to be faster for all platforms, since the C fallback is > quite a bit different from HEAD. I've found it to be faster on ppc64le. An > earlier version of the patch was a loser on 32-bit Arm because of alignment > issues, but if you could run the test script attached to [1] on 64-bit Arm, > I'd be curious to see how it does on 0002, and whether 0003 and 0004 make > things better or worse. If there is trouble building on non-x86 platforms, > I'd want to fix that also. On my Arm64 VM : HEAD : mixed | ascii ---+--- 1091 | 628 (1 row) PATCHED : mixed | ascii ---+--- 681 | 119 So the fallback function does show improvements on Arm64. I guess, if at all we use the equivalent Arm NEON intrinsics, the "mixed" figures will be close to the "ascii" figures, going by your figures on x86. > > Can this be > > done in a portable way by arranging for auto-vectorization ? Something > > like commit 88709176236caf. This way it would benefit other platforms > > as well. > > I'm fairly certain that the author of a compiler capable of doing that in > this case would be eligible for some kind of AI prize. :-) :) I was not thinking about auto-vectorizing the code in pg_validate_utf8_sse42(). Rather, I was considering auto-vectorization inside the individual helper functions that you wrote, such as _mm_setr_epi8(), shift_right(), bitwise_and(), prev1(), splat(), saturating_sub() etc. I myself am not sure whether it is feasible to write code that auto-vectorizes all these function definitions. saturating_sub() seems hard, but I could see the gcc docs mentioning support for generating such instructions for a particular code loop. But for the index lookup function() it seems impossible to generate the needed index lookup intrinsics. We can have platform-specific function definitions for such exceptional cases. I am considering this only because that would make the exact code work on other platforms like arm64 and ppc, and won't have to have platform-specific files. But I understand that it is easier said than done. We will have to process the loop in pg_validate_utf8_sse42() in 128-bit chunks, and pass each chunk to individual functions, which could mean extra work and extra copy in extracting the chunk data and passing it around, which may make things drastically slow. You are passing around the chunks using __m128i type, so perhaps it means passing around just a reference to the simd registers. Not sure. -- Thanks, -Amit Khandekar Huawei Technologies
Re: [POC] verifying UTF-8 using SIMD instructions
Hi, Just a quick question before I move on to review the patch ... The improvement looks like it is only meant for x86 platforms. Can this be done in a portable way by arranging for auto-vectorization ? Something like commit 88709176236caf. This way it would benefit other platforms as well. I tried to compile the following code using -O3, and the assembly does have vectorized instructions. #include int main() { int i; char s1[200] = "abcdewhruerhetr"; char s2[200] = "oweurietiureuhtrethre"; char s3[200] = {0}; for (i = 0; i < sizeof(s1); i++) { s3[i] = s1[i] ^ s2[i]; } printf("%s\n", s3); }
Re: Speeding up GIST index creation for tsvectors
On Wed, 3 Mar 2021 at 23:32, John Naylor wrote: > Your performance numbers look like this is a fruitful area to improve. I have > not yet tested performance, but I will do so at a later date. Thanks for reviewing the patch ! > I did some > microbenchmarking of our popcount implementation, since I wasn't quite sure > it's optimal, and indeed, there is room for improvement there [1]. I'd be > curious to hear your thoughts on those findings. I think it'd be worth it to > test a version of this patch using those idioms here as well, so at some > point I plan to post something. I am not yet clear about the implications of that work on this patch set here, but I am still going over it, and will reply to that. > > Now for the patches: > > 0001: > > + /* > + * We can process 64-bit chunks only if both are mis-aligned by the same > + * number of bytes. > + */ > + if (b_aligned - b == a_aligned - a) > > The obvious question here is: how often are they identically misaligned? You > don't indicate that your measurements differ in a bimodal fashion, so does > that mean they happen to be always (mis)aligned? I ran CREATE INDEX on tsvector columns using the tsearch.data that I had attached upthread, with some instrumentation; here are the proportions : 1. In 15% of the cases, only one among a and b was aligned. The other was offset from the 8-byte boundary by 4 bytes. 2. 6% of the cases, both were offset by 4 bytes, i.e. identically misaligned. 3. Rest of the cases, both were aligned. With other types, and with different sets of data, I believe I can get totally different proportions. > Is that an accident of the gist coding and could change unexpectedly? > And how hard would it be to > allocate those values upstream so that the pointers are always aligned on > 8-byte boundaries? (I imagine pretty hard, since there are multiple callers, > and such tight coupling is not good style) That I am not sure though; haven't clearly understood the structure of gist indexes yet. I believe it would depend on individual gist implementation, and we can't assume about that ? > + /* For smaller lengths, do simple byte-by-byte traversal */ > + if (bytes <= 32) > > You noted upthread: > > > Since for the gist index creation of some of these types the default > > value for siglen is small (8-20), I tested with small siglens. For > > siglens <= 20, particularly for values that are not multiples of 8 > > (e.g. 10, 13, etc), I see a 1-7 % reduction in speed of index > > creation. It's probably because of > > an extra function call for pg_xorcount(); and also might be due to the > > extra logic in pg_xorcount() which becomes prominent for shorter > > traversals. So for siglen less than 32, I kept the existing method > > using byte-by-byte traversal. > > I wonder if that can be addressed directly, while cleaning up the loops and > alignment checks in pg_xorcount_long() a little. For example, have a look at > pg_crc32c_armv8.c -- it might be worth testing a similar approach. Yeah, we can put the bytes <= 32 condition inside pg_xorcount_long(). I avoided that to not hamper the <= 32 scenarios. Details explained below for "why inline pg_xorcount is calling global function" > Also, pardon my ignorance, but where can I find the default siglen for > various types? Check SIGLEN_DEFAULT. > > +/* Count the number of 1-bits in the result of xor operation */ > +extern uint64 pg_xorcount_long(const unsigned char *a, const unsigned char > *b, > + int bytes); > +static inline uint64 pg_xorcount(const unsigned char *a, const unsigned char > *b, > + int bytes) > > I don't think it makes sense to have a static inline function call a global > function. As you may note, the global function will be called only in a subset of cases where siglen <= 32. Yeah, we can put the bytes <= 32 condition inside pg_xorcount_long(). I avoided that to not hamper the <= 32 scenarios. If I do this, it will add a function call for these small siglen scenarios. The idea was: use the function call only for cases where we know that the function call overhead will be masked by the popcount() optimization. > > -static int > +static inline int > hemdistsign(BITVECP a, BITVECP b, int siglen) > > Not sure what the reason is for inlining all these callers. > Come to think of it, I don't see a reason to have hemdistsign() > at all anymore. All it does is call pg_xorcount(). I suspect that's > what Andrey Borodin meant when he said upthread: > > > > > Meanwhile there are at least 4 incarnation of hemdistsign() > > > > functions that are quite similar. I'd propose to refactor them > > > > somehow... I had something in mind when I finally decided to not remove hemdistsign(). Now I think you are right, we can remove hemdistsign() altogether. Let me check again. > 0002: > > I'm not really happy with this patch. I like the idea of keeping indirect > calls out of non-x86 platforms, but I believe it could be done more simply. I am open for
Re: Speeding up GIST index creation for tsvectors
I have added this one in the March commitfest. https://commitfest.postgresql.org/32/3023/
Re: Speeding up GIST index creation for tsvectors
On Tue, 15 Dec 2020 at 20:34, Amit Khandekar wrote: > > On Sun, 13 Dec 2020 at 9:28 PM, Andrey Borodin wrote: > > +1 > > This will make all INSERTs and UPDATES for tsvector's GiSTs. > > Oh, I didn't realize that this code is getting used in GIST index > insertion and creation too. Will check there. I ran some insert and update tests; they show only marginal improvement. So looks like the patch is mainly improving index builds. > > Meanwhile there are at least 4 incarnation of hemdistsign() functions that > > are quite similar. I'd propose to refactor them somehow... > > Yes, I hope we get the benefit there also. Before that, I thought I > should post the first use-case to get some early comments. Thanks for > your encouraging comments :) The attached v2 version of 0001 patch extends the hemdistsign() changes to the other use cases like intarray, ltree and hstore. I see the same index build improvement for all these types. Since for the gist index creation of some of these types the default value for siglen is small (8-20), I tested with small siglens. For siglens <= 20, particularly for values that are not multiples of 8 (e.g. 10, 13, etc), I see a 1-7 % reduction in speed of index creation. It's probably because of an extra function call for pg_xorcount(); and also might be due to the extra logic in pg_xorcount() which becomes prominent for shorter traversals. So for siglen less than 32, I kept the existing method using byte-by-byte traversal. -- Thanks, -Amit Khandekar Huawei Technologies From b8905b47f7e0af8d4558fdac73cc2283c263cbcc Mon Sep 17 00:00:00 2001 From: Amit Khandekar Date: Thu, 10 Dec 2020 21:45:49 +0800 Subject: [PATCH 2/2] Avoid function pointer dereferencing for pg_popcount32/64() call. With USE_POPCNT_ASM set (which is true only on x86), we decide with the help of __get_cpuid() at runtime whether the CPU supports popcnt instruction, and hence we dynamically choose the corresponding function; thus pg_popcount32/64 is a function pointer. But for platforms with USE_POPCNT_ASM not set, we don't have to use dynamic assignment of functions, but we were still declaring pg_popcount64 as a function pointer. So arrange for direct function call so as to get rid of function pointer dereferencing each time pg_popcount32/64 is called. To do this, define pg_popcount64 to another function name (pg_popcount64_nonasm) rather than a function pointer, whenever USE_POPCNT_ASM is not defined. And let pg_popcount64_nonasm() be a static inline function so that whenever pg_popcount64() is called, the __builtin_popcount() gets called directly. For platforms not supporting __builtin_popcount(), continue using the slow version as is the current behaviour. pg_popcount64_slow() was actually a misnomer, because it was actually calling __builtin_popcount() which is not a slow function. Now with this commit, pg_popcount64_slow() indeed has only slow code. Tested this on ARM64, and the gist index creation for tsvectors speeds up by ~ 6% for default siglen (=124), and by 12% with siglen=700 --- src/include/port/pg_bitutils.h | 59 ++ src/port/pg_bitutils.c | 47 +-- 2 files changed, 67 insertions(+), 39 deletions(-) diff --git a/src/include/port/pg_bitutils.h b/src/include/port/pg_bitutils.h index 174df28e66..b039f94ee5 100644 --- a/src/include/port/pg_bitutils.h +++ b/src/include/port/pg_bitutils.h @@ -23,6 +23,19 @@ extern const uint8 pg_rightmost_one_pos[256]; extern const uint8 pg_number_of_ones[256]; #endif +/* + * On x86_64, we can use the hardware popcount instruction, but only if + * we can verify that the CPU supports it via the cpuid instruction. + * + * Otherwise, we fall back to __builtin_popcount if the compiler has that, + * or a hand-rolled implementation if not. + */ +#ifdef HAVE_X86_64_POPCNTQ +#if defined(HAVE__GET_CPUID) || defined(HAVE__CPUID) +#define USE_POPCNT_ASM 1 +#endif +#endif + /* * pg_leftmost_one_pos32 * Returns the position of the most significant set bit in "word", @@ -208,8 +221,54 @@ pg_ceil_log2_64(uint64 num) } /* Count the number of one-bits in a uint32 or uint64 */ + +/* + * With USE_POPCNT_ASM set (which is true only on x86), we decide at runtime + * whether the CPU supports popcnt instruction, and hence we dynamically choose + * the corresponding function; thus pg_popcount32/64 is a function pointer. But + * for platforms with USE_POPCNT_ASM not set, we don't have to use dynamic + * assignment of functions, so we arrange for direct function call so as to get + * rid of function pointer dereferencing each time pg_popcount32/64 is called. + */ +#ifdef USE_POPCNT_ASM extern int (*pg_popcount32) (uint32 word); extern int (*pg_popcount64) (uint64 word); +#else +#define pg_popcount32 pg_popcount32_nonasm +#define pg_popcount64 pg_popcount64_nonasm +#endif + +/* Slow versions of pg_popcount */ +#ifndef HAVE__BUILTIN_P
Re: Speeding up GIST index creation for tsvectors
On Sun, 13 Dec 2020 at 9:28 PM, Andrey Borodin wrote: > +1 > This will make all INSERTs and UPDATES for tsvector's GiSTs. Oh, I didn't realize that this code is getting used in GIST index insertion and creation too. Will check there. > Also I really like idea of taking advantage of hardware capabilities like > __builtin_* etc wherever possible. Yes. Also, the __builtin_popcount() uses SIMD vectorization (on arm64 : "cnt v0.8b, v0.8b"), hence there's all the more reason to use it. Over and above that, I had thought that if we can auto-vectorize the byte-by-byte xor operation and the popcount() call using compiler optimizations, we would benefit out of this, but didn't see any more improvement. I hoped for the benefit because that would have allowed us to process in 128-bit chunks or 256-bit chunks, since the vector registers are at least that long. Maybe gcc is not that smart to translate __builtin_popcount() to 128/256 bit vectorized instruction. But for XOR operator, it does translate to 128bit vectorized instructions (on arm64 : "eor v2.16b, v2.16b, v18.16b") > Meanwhile there are at least 4 incarnation of hemdistsign() functions that > are quite similar. I'd propose to refactor them somehow... Yes, I hope we get the benefit there also. Before that, I thought I should post the first use-case to get some early comments. Thanks for your encouraging comments :)
Re: Speeding up GIST index creation for tsvectors
On Thu, 10 Dec 2020 at 20:43, Pavel Borisov wrote: > > Hi, Amit! > It's really cool to hear about another GiST improvement proposal. I'd like to > connect recently committed GiST ordered build discussion here [1] and further > improvement proposed [2] > > I've tested feature [1] and got 2.5-3 times speed improvement which is much > better I believe. Yeah, I am completely new to the GIST stuff, but I had taken a quick look at the sortsupport feature for GIST, and found it very interesting. I believe it's an additional option for making the gist index builds much faster. But then I thought that my small patch would still be worthwhile because for tsvector types the non-sort method for index build would continue to be used by users, and in general we can extend this small optimization for other gist types also. > There is an ongoing activity [2] to build support for different data types > for GiST. Maybe you will consider it interesting to join. > > BTW you may have heard about Gin and Rum [3] indexes which suit text search > much, much better (and faster) than GiST. The idea to process data in bigger > chunks is good. Still optimize index structure, minimizing disc pages access, > etc. seems better in many cases. Sure. Thanks for the pointers. -- Thanks, -Amit Khandekar Huawei Technologies
Re: Improving spin-lock implementation on ARM.
On Sat, 5 Dec 2020 at 02:55, Alexander Korotkov wrote: > > On Wed, Dec 2, 2020 at 6:58 AM Krunal Bauskar wrote: > > Let me know what do you think about this analysis and any specific > > direction that we should consider to help move forward. > > BTW, it would be also nice to benchmark my lwlock patch on the > Kunpeng. I'm very optimistic about this patch, but it wouldn't be > fair to completely throw it away. It still might be useful for > LSE-disabled builds. Coincidentally I was also looking at some hotspot locations around LWLockAcquire() and LWLockAttempt() for read-only work-loads on both arm64 and x86, and had narrowed down to the place where pg_atomic_compare_exchange_u32() is called. So it's likely we are working on the same hotspot area. When I get a chance, I do plan to look at your patch while I myself am trying to see if we can do some optimizations. Although, this is unrelated to the optimization of this mail thread, so this will need a different mail thread. -- Thanks, -Amit Khandekar Huawei Technologies
Re: Improving spin-lock implementation on ARM.
On Tue, 1 Dec 2020 at 15:33, Krunal Bauskar wrote: > What I meant was outline-atomics support was added in GCC-9.4 and was made > default in gcc-10. > LSE support is present for quite some time. FWIW, here is an earlier discussion on the same (also added the proposal author here) : https://www.postgresql.org/message-id/flat/099F69EE-51D3-4214-934A-1F28C0A1A7A7%40amazon.com
Re: Improving spin-lock implementation on ARM.
On Thu, 26 Nov 2020 at 10:55, Krunal Bauskar wrote: > Hardware: ARM Kunpeng 920 BareMetal Server 2.6 GHz. 64 cores (56 cores for > server and 8 for client) [2 numa nodes] > Storage: 3.2 TB NVMe SSD > OS: CentOS Linux release 7.6 > PGSQL: baseline = Release Tag 13.1 > Invocation suite: > https://github.com/mysqlonarm/benchmark-suites/tree/master/pgsql-pbench (Uses > pgbench) Using the same hardware, attached are my improvement figures, which are pretty much in line with your figures. Except that, I did not run for more than 400 number of clients. And, I am getting some improvement even for select-only workloads, in case of 200-400 clients. For read-write load, I had seen that the s_lock() contention was caused when the XLogFlush() uses the spinlock. But for read-only case, I have not analyzed where the improvement occurred. The .png files in the attached tar have the graphs for head versus patch. The GUCs that I changed : work_mem=64MB shared_buffers=128GB maintenance_work_mem = 1GB min_wal_size = 20GB max_wal_size = 100GB checkpoint_timeout = 60min checkpoint_completion_target = 0.9 full_page_writes = on synchronous_commit = on effective_io_concurrency = 200 log_checkpoints = on For backends, 64 CPUs were allotted (covering 2 NUMA nodes) , and for pgbench clients a separate set of 28 CPUs were allotted on a different socket. Server was pre_warmed(). results.tar.gz Description: application/gzip
Re: Redundant tuple copy in tqueueReceiveSlot()
On Thu, 17 Sep 2020 at 08:55, Andres Freund wrote: > > Hi, > > On 2020-09-17 14:20:50 +1200, Thomas Munro wrote: > > I wonder if there is a way we could make "Minimal Tuples but with the > > length travelling separately (and perhaps chopped off?)" into a > > first-class concept... It's also a shame to be schlepping a bunch of > > padding bytes around. Yeah, I think we can pass a "length" data separately, but since the receiver end already is assuming that it knows the received data is a minimal tuple, I thought why not skip passing this redundant component. But anyways, if you and Andres are suggesting that being able to skip the copy is important for virtual tuples as well, then I think the approach you suggested (supplying an allocated memory to the tuple API for conversion) would be one of the better options with us, if not the only good option. Maybe I will try looking into the shm_mq working to see if we can come up with a good solution. > > There really is no justification for having MinimalTuples, as we have > them today at least, anymore. We used to rely on being able to construct > pointers to MinimalTuples that are mostly compatible with HeapTuple. But > I think there's none of those left since PG 12. Ah ok. > > I think it'd make a bit more sense to do some steps towards having a > more suitable "minimal" tuple representation, rather than doing this > local, pretty ugly, hacks. A good way would be to just starting to > remove the padding, unnecessary fields etc from MinimalTuple. So there are two things we wish to do : 1. Prevent an extra tuple forming step before sending minimal tuple data. Possibly device an shm_mq API to get memory to write tuple of a given length, and device something like FormMinimalTupleDataInHere(memory_allocated_by_shm_mq) which will write minimal tuple data. 2. Shrink the MinimalTupleData structure because it no longer needs the current padding etc and we can substitute this new MinimalTuple structure with the current one all over the code wherever it is currently being used. If we remove the unnecessary fields from the tuple data being sent to Gather node, then we need to again form a MinimalTuple at the receiving end, which again adds an extra tuple forming. So I understand, that's the reason why you are saying we should shrink the MinimalTupleData structure itself, in which case we will continue to use the received new MinimalTupledata as an already-formed tuple, like how we are doing now. Now, the above two things (1. and 2.) look independent to me. Suppose we first do 1. i.e. we come up with a good way to form an in-place MinimalTuple at the sender's end, without any change to the MinimalTupleData. And then when we do 2. i.e. shrink the MinimalTupleData; but for that, we won't require any change in the in-place-tuple-forming API we wrote in 1. . Just the existing underlying function heap_form_minimal_tuple() or something similar might need to be changed. At least that's what I feel right now. > > I also think that it'd be good to look at a few of the other places that > are heavily bottlenecked by MinimalTuple overhead before designing new > API around this. IIRC it's e.g. very easy to see hash joins spending a > lot of time doing MinimalTuple copies & conversions. Yeah, makes sense. The above FormMinimalTupleDataInHere() should be able to be used for these other places as well. Will keep that in mind. > > > > Thomas, I guess you had a different approach in mind when you said > > > about "returning either success or > > > hey-that-buffer's-too-small-I-need-N-bytes". But what looks clear to > > > > Yeah I tried some things like that, but I wasn't satisfied with any of > > them; basically the extra work involved in negotiating the size was a > > bit too high. Hmm, ok. Let me see if there is any way around this. >> On the other hand, because my interface was "please > > write a MinimalTuple here!", it had the option to *form* a > > MinimalTuple directly in place, whereas your approach can only avoid > > creating and destroying a temporary tuple when the source is a heap > > tuple. True. > > There's a lot of cases where the source is a virtual slot (since we'll > often project stuff below Gather). So it'd be quite advantageous to > avoid building an unnecessary HeapTuple in those cases. Yeah right.
Re: calling procedures is slow and consumes extra much memory against calling function
On Thu, 17 Sep 2020 at 11:07, Michael Paquier wrote: > > On Thu, Jul 16, 2020 at 09:08:09PM +0200, Pavel Stehule wrote: > > I am sending another patch that tries to allow CachedPlans for CALL > > statements. I think this patch is very accurate, but it is not nice, > > because it is smudging very precious reference counting for CachedPlans. > > Amit, you are registered as a reviewer of this patch for two months > now. Are you planning to look at it more? If you are not planning to > do so, that's fine, but it may be better to remove your name as > reviewer then. Thanks Michael for reminding. I *had* actually planned to do some more review. But I think I might end up not getting bandwidth for this one during this month. So I have removed my name. But I have kept my name as reviewer for bitmaps and correlation : "https://commitfest.postgresql.org/29/2310/ since I do plan to do some review on that one. Thanks, -Amit Khandekar Huawei Technologies
Redundant tuple copy in tqueueReceiveSlot()
Hi, I am starting a new thread that continues with the following point that was discussed in [1] On Fri, 17 Jul 2020 at 09:05, Thomas Munro wrote: > > On Sun, Jul 12, 2020 at 7:25 AM Soumyadeep Chakraborty > wrote: > > Do you mean that we should have an implementation for > > get_minimal_tuple() for the heap AM and have it return a pointer to the > > minimal tuple from the MINIMAL_TUPLE_OFFSET? And then a caller such as > > tqueueReceiveSlot() will ensure that the heap tuple from which it wants > > to extract the minimal tuple was allocated in the tuple queue in the > > first place? If we consider that the node directly below a gather is a > > SeqScan, then we could possibly, in ExecInitSeqScan() set-up the > > ss_ScanTupleSlot to point to memory in the shared tuple queue? > > Similarly, other ExecInit*() methods can do the same for other executor > > nodes that involve parallelism? Of course, things would be slightly > > different for > > the other use cases you mentioned (such as hash table population) > > What I mean is that where ExecHashTableInsert() and > tqueueReceiveSlot() do ExecFetchSlotMinimalTuple(), you usually get a > freshly allocated copy, and then you copy that again, and free it. > There may be something similar going on in tuplestore and sort code. > Perhaps we could have something like > ExecFetchSlotMinimalTupleInPlace(slot, output_buffer, > output_buffer_size) that returns a value that indicates either success > or hey-that-buffer's-too-small-I-need-N-bytes, or something like that. > That silly extra copy is something Andres pointed out to me in some > perf results involving TPCH hash joins, a couple of years ago. I went ahead and tried doing this. I chose an approach where we can return the pointer to the in-place minimal tuple data if it's a heap/buffer/minimal tuple slot. A new function ExecFetchSlotMinimalTupleData() returns in-place minimal tuple data. If it's neither heap, buffer or minimal tuple, it returns a copy as usual. The receiver should not assume the data is directly taken from MinimalTupleData, so it should set it's t_len to the number of bytes returned. Patch attached (0001-Avoid-redundant-tuple-copy-while-sending-tuples-to-G.patch) Thomas, I guess you had a different approach in mind when you said about "returning either success or hey-that-buffer's-too-small-I-need-N-bytes". But what looks clear to me is that avoiding the copy shows consistent improvement of 4 to 10% for simple parallel table scans. I tried my patch on both x86_64 and arm64, and observed this speedup on both. create table tab as select generate_series(1, 2000) id, 'abcdefghijklmnopqrstuvwxyzabcdefghijklmnopqrstuvwxyz' v; select pg_prewarm('tab'::regclass); explain analyze select * from tab where id %2 = 0; Times in milli-secs : HEAD : 1833.119 1816.522 1875.648 1872.153 1834.383 Patch'ed : 1763.786 1721.160 1714.665 1719.738 1704.478 This was with the default 2 parallel workers. With 3 or 4 workers, for the above testcase I didn't see a noticeable difference. I think, if I double the number of rows, the difference will be noticeable. In any case, the gain would go on reducing with the number of workers, because the tuple copy also gets parallelized. In some scenarios, parallel_leader_participation=off causes the difference to amplify. Haven't had a chance to see if this helps any of the TPC-H queries. Also attached is a patch guc_for_testing.patch that I used for testing the gain. This patch is only for testing. Without this, in order to compare the performance figures it requires server restart, and the figures anyway shift back and forth by 5-15 percent after each restart, which creates lot of noise when comparing figures with and without fix. Use this GUC enable_fix to enable/disable the fix. [1] https://www.postgresql.org/message-id/CA%2BhUKGLrN2M18-hACEJbNoj2sn_WoUj9rkkBeoPK7SY427pAnA%40mail.gmail.com -- Thanks, -Amit Khandekar Huawei Technologies From 5d19626e35e50a5630e5f1a042f7ecee6acb7c70 Mon Sep 17 00:00:00 2001 From: Amit Khandekar Date: Wed, 9 Sep 2020 12:03:01 +0800 Subject: [PATCH 2/2] Add guc for ease of testing speedup. This is only for testing the performance gain. Otherwise, to compare the performance figures, it requires server restart, and the figures anyway shift back and forth by 5-15 percent after each restart, which creates lot of noise when comparing figures with and without fix. With this, we can easily see at least 4-10% difference in execution times by setting/unsetting the GUC enable_fix. --- src/backend/executor/tqueue.c | 12 src/backend/utils/misc/guc.c | 12 2 files changed, 24 insertions(+) diff --git a/src/backend/executor/tqueue.c b/src/backend/executor/tqueue.c index c70ee0f39a..7dd60b5c7e 100644 --- a/src/backend/executor/tqueue.c +++ b/src/backend/executor/tqueue.c @@
Re: Auto-vectorization speeds up multiplication of large-precision numerics
On Tue, 8 Sep 2020 at 02:19, Tom Lane wrote: > > I wrote: > > I experimented with a few different ideas such as adding restrict > > decoration to the pointers, and eventually found that what works > > is to write the loop termination condition as "i2 < limit" > > rather than "i2 <= limit". It took me a long time to think of > > trying that, because it seemed ridiculously stupid. But it works. Ah ok. I checked the "Auto-Vectorization in LLVM" link that you shared. All the examples use "< n" or "> n". None of them use "<= n". Looks like a hidden restriction. > > I've done more testing and confirmed that both gcc and clang can > vectorize the improved loop on aarch64 as well as x86_64. (clang's > results can be confusing because -ftree-vectorize doesn't seem to > have any effect: its vectorizer is on by default. But if you use > -fno-vectorize it'll go back to the old, slower code.) > > The only buildfarm effect I've noticed is that locust and > prairiedog, which are using nearly the same ancient gcc version, > complain > > c1: warning: -ftree-vectorize enables strict aliasing. -fno-strict-aliasing > is ignored when Auto Vectorization is used. > > which is expected (they say the same for checksum.c), but then > there are a bunch of > > warning: dereferencing type-punned pointer will break strict-aliasing rules > > which seems worrisome. (This sort of thing is the reason I'm > hesitant to apply higher optimization levels across the board.) > Both animals pass the regression tests anyway, but if any other > compilers treat -ftree-vectorize as an excuse to apply stricter > optimization assumptions, we could be in for trouble. > > I looked closer and saw that all of those warnings are about > init_var(), and this change makes them go away: > > -#define init_var(v)MemSetAligned(v, 0, sizeof(NumericVar)) > +#define init_var(v)memset(v, 0, sizeof(NumericVar)) > > I'm a little inclined to commit that as future-proofing. It's > essentially reversing out a micro-optimization I made in d72f6c750. > I doubt I had hard evidence that it made any noticeable difference; > and even if it did back then, modern compilers probably prefer the > memset approach. Thanks. I must admit it did not occur to me that I could have very well installed clang on my linux machine and tried compiling this file, or tested with some older gcc versions. I think I was using gcc 8. Do you know what was the gcc compiler version that gave these warnings ? -- Thanks, -Amit Khandekar Huawei Technologies
Re: Auto-vectorization speeds up multiplication of large-precision numerics
On Mon, 7 Sep 2020 at 11:23, Tom Lane wrote: > > I wrote: > > I made some cosmetic changes to this and committed it. Thanks! > > BTW, poking at this further, it seems that the patch only really > works for gcc. clang accepts the -ftree-vectorize switch, but > looking at the generated asm shows that it does nothing useful. > Which is odd, because clang does do loop vectorization. > > I tried adding -Rpass-analysis=loop-vectorize and got > > numeric.c:8341:3: remark: loop not vectorized: could not determine number of > loop iterations [-Rpass-analysis=loop-vectorize] > for (i2 = 0; i2 <= i; i2++) Hmm, yeah that's unfortunate. My guess is that the compiler would do vectorization only if 'i' is a constant, which is not true for our case. -- Thanks, -Amit Khandekar Huawei Technologies
Re: Re: [HACKERS] Custom compression methods
On Thu, 13 Aug 2020 at 17:18, Dilip Kumar wrote: > I have rebased the patch on the latest head and currently, broken into 3 > parts. > > v1-0001: As suggested by Robert, it provides the syntax support for > setting the compression method for a column while creating a table and > adding columns. However, we don't support changing the compression > method for the existing column. As part of this patch, there is only > one built-in compression method that can be set (pglz). In this, we > have one in-build am (pglz) and the compressed attributes will directly > store the oid of the AM. In this patch, I have removed the > pg_attr_compresion as we don't support changing the compression > for the existing column so we don't need to preserve the old > compressions. > v1-0002: Add another built-in compression method (zlib) > v1:0003: Remaining patch set (nothing is changed except rebase on the > current head, stabilizing check-world and 0001 and 0002 are pulled > out of this) > > Next, I will be working on separating out the remaining patches as per > the suggestion by Robert. Thanks for this new feature. Looks promising and very useful, with so many good compression libraries already available. I see that with the patch-set, I would be able to create an extension that defines a PostgreSQL C handler function which assigns all the required hook function implementations for compressing, decompressing and validating, etc. In short, I would be able to use a completely different compression algorithm to compress toast data if I write such an extension. Correct me if I am wrong with my interpretation. Just a quick superficial set of review comments A minor re-base is required due to a conflict in a regression test - In heap_toast_insert_or_update() and in other places, the comments for new parameter preserved_am_info are missing. - +toast_compress_datum(Datum value, Oid acoid) { struct varlena *tmp = NULL; int32 valsize; - CompressionAmOptions cmoptions; + CompressionAmOptions *cmoptions = NULL; I think tmp and cmoptions need not be initialized to NULL - - TOAST_COMPRESS_SET_RAWSIZE(tmp, valsize); - SET_VARSIZE_COMPRESSED(tmp, len + TOAST_COMPRESS_HDRSZ); /* successful compression */ + toast_set_compressed_datum_info(tmp, amoid, valsize); return PointerGetDatum(tmp); Any particular reason why is this code put in a new extern function ? Is there a plan to re-use it ? Otherwise, it's not necessary to do this. Also, not sure why "HTAB *amoptions_cache" and "MemoryContext amoptions_cache_mcxt" aren't static declarations. They are being used only in toast_internals.c --- The tab-completion doesn't show COMPRESSION : postgres=# create access method my_method TYPE INDEX TABLE postgres=# create access method my_method TYPE Also, the below syntax also would better be tab-completed so as to display all the installed compression methods, in line with how we show all the storage methods like plain,extended,etc: postgres=# ALTER TABLE lztab ALTER COLUMN t SET COMPRESSION I could see the differences in compression ratio, and the compression and decompression speed when I use lz versus zib : CREATE TABLE zlibtab(t TEXT COMPRESSION zlib WITH (level '4')); create table lztab(t text); ALTER TABLE lztab ALTER COLUMN t SET COMPRESSION pglz; pgg:s2:pg$ time psql -c "\copy zlibtab from text.data" COPY 13050 real0m1.344s user0m0.031s sys 0m0.026s pgg:s2:pg$ time psql -c "\copy lztab from text.data" COPY 13050 real0m2.088s user0m0.008s sys 0m0.050s pgg:s2:pg$ time psql -c "select pg_table_size('zlibtab'::regclass), pg_table_size('lztab'::regclass)" pg_table_size | pg_table_size ---+--- 1261568 | 1687552 pgg:s2:pg$ time psql -c "select NULL from zlibtab where t like ''" > /dev/null real0m0.127s user0m0.000s sys 0m0.002s pgg:s2:pg$ time psql -c "select NULL from lztab where t like ''" > /dev/null real0m0.050s user0m0.002s sys 0m0.000s -- Thanks, -Amit Khandekar Huawei Technologies
Re: Auto-vectorization speeds up multiplication of large-precision numerics
On Mon, 13 Jul 2020 at 14:27, Amit Khandekar wrote: > I tried this in utils/adt/Makefile : > + > +numeric.o: CFLAGS += ${CFLAGS_VECTOR} > + > and it works. > > CFLAGS_VECTOR also includes the -funroll-loops option, which I > believe, had showed improvements in the checksum.c runs ( [1] ). This > option makes the object file a bit bigger. For numeric.o, it's size > increased by 15K; from 116672 to 131360 bytes. I ran the > multiplication test, and didn't see any additional speed-up with this > option. Also, it does not seem to be related to vectorization. So I > was thinking of splitting the CFLAGS_VECTOR into CFLAGS_VECTOR and > CFLAGS_UNROLL_LOOPS. Checksum.c can use both these flags, and > numeric.c can use only CFLAGS_VECTOR. I did as above. Attached is the v2 patch. In case of existing CFLAGS_VECTOR, an env variable also could be set by that name when running configure. I did the same for CFLAGS_UNROLL_LOOPS. Now, developers who already are using CFLAGS_VECTOR env while configur'ing might be using this env because their compilers don't have these compiler options so they must be using some equivalent compiler options. numeric.c will now be compiled with CFLAGS_VECTOR, so for them it will now be compiled with their equivalent of vectorize and unroll-loops option, which is ok, I think. Just that the numeric.o size will be increased, that's it. > > [1] > https://www.postgresql.org/message-id/flat/CA%2BU5nML8JYeGqM-k4eEwNJi5H%3DU57oPLBsBDoZUv4cfcmdnpUA%40mail.gmail.com#2ec419817ff429588dd1229fb663080e -- Thanks, -Amit Khandekar Huawei Technologies From 029373ab2e9d2e6802326871f248ba2f21ffb204 Mon Sep 17 00:00:00 2001 From: Amit Khandekar Date: Tue, 21 Jul 2020 16:42:51 +0800 Subject: [PATCH] Auto-vectorize loop to speedup large-precision numeric product A 'for' loop in mul_var() runs backwards by decrementing two variables. This prevents the gcc compiler from auto-vectorizing the for loop. So make it a forward loop with a single variable. This gives performance benefits for product of numeric types with large precision, with speedups becoming noticeable from values with precisions starting from 20-40. Typical pattern of benefit is : precision 50: 5%; precision 60: 11%; 120 : 50%; 240: 2.2x; and so on. On some CPU architectures, the speedup starts from 20 precision onwards. With the precisions used in the numeric_big regression test, the multiplication speeds up by 2.5 to 2.7 times. Auto-vectorization happens with gcc -O3 flag or -ftree-loop-vectorize. So arrange for -free-loop-vectorize flag specifically when compiling numeric.c. CFLAGS_VECTOR was already present for similar functionality in checksum.c, but CFLAGS_VECTOR also includes -funroll-loops which unecessarily makes numeric.o larger. So split CFLAGS_VECTOR into CFLAGS_UNROLL_LOOPS and CFLAGS_VECTOR. --- configure | 17 +++-- configure.in | 9 +++-- src/Makefile.global.in| 1 + src/backend/storage/page/Makefile | 2 +- src/backend/utils/adt/Makefile| 3 +++ src/backend/utils/adt/numeric.c | 11 --- 6 files changed, 31 insertions(+), 12 deletions(-) diff --git a/configure b/configure index cb8fbe1051..36ca734ad5 100755 --- a/configure +++ b/configure @@ -734,6 +734,7 @@ CPP CFLAGS_SL BITCODE_CXXFLAGS BITCODE_CFLAGS +CFLAGS_UNROLL_LOOPS CFLAGS_VECTOR PERMIT_DECLARATION_AFTER_STATEMENT LLVM_BINPATH @@ -5266,10 +5267,13 @@ BITCODE_CFLAGS="" user_BITCODE_CXXFLAGS=$BITCODE_CXXFLAGS BITCODE_CXXFLAGS="" -# set CFLAGS_VECTOR from the environment, if available +# set CFLAGS_VECTOR and CFLAGS_UNROLL_LOOPS from the environment, if available if test "$ac_env_CFLAGS_VECTOR_set" = set; then CFLAGS_VECTOR=$ac_env_CFLAGS_VECTOR_value fi +if test "$ac_env_CFLAGS_UNROLL_LOOPS_set" = set; then + CFLAGS_UNROLL_LOOPS=$ac_env_CFLAGS_UNROLL_LOOPS_value +fi # Some versions of GCC support some additional useful warning flags. # Check whether they are supported, and add them to CFLAGS if so. @@ -6102,16 +6106,16 @@ if test x"$pgac_cv_prog_CXX_cxxflags__fexcess_precision_standard" = x"yes"; then fi - # Optimization flags for specific files that benefit from vectorization - { $as_echo "$as_me:${as_lineno-$LINENO}: checking whether ${CC} supports -funroll-loops, for CFLAGS_VECTOR" >&5 -$as_echo_n "checking whether ${CC} supports -funroll-loops, for CFLAGS_VECTOR... " >&6; } + # Optimization flags for specific files that benefit from loop unrolling + { $as_echo "$as_me:${as_lineno-$LINENO}: checking whether ${CC} supports -funroll-loops, for CFLAGS_UNROLL_LOOPS" >&5 +$as_echo_n "checking whether ${CC} supports -funroll-loops, for CFLAGS_UNROLL_LOOPS... " >&6; } if ${pgac_cv_prog_CC_cflags__funroll_loops+:} false; then : $as_echo_n "(cached) " >
Re: Auto-vectorization speeds up multiplication of large-precision numerics
On Fri, 10 Jul 2020 at 19:02, Tom Lane wrote: > > Peter Eisentraut writes: > > We normally don't compile with -O3, so very few users would get the > > benefit of this. > > Yeah. I don't think changing that baseline globally would be a wise move. > > > We have CFLAGS_VECTOR for the checksum code. I > > suppose if we are making the numeric code vectorizable as well, we > > should apply this there also. > > > I think we need a bit of a policy decision from the group here. > > I'd vote in favor of applying CFLAGS_VECTOR to specific source files > that can benefit. We already have experience with that and we've not > detected any destabilization potential. I tried this in utils/adt/Makefile : + +numeric.o: CFLAGS += ${CFLAGS_VECTOR} + and it works. CFLAGS_VECTOR also includes the -funroll-loops option, which I believe, had showed improvements in the checksum.c runs ( [1] ). This option makes the object file a bit bigger. For numeric.o, it's size increased by 15K; from 116672 to 131360 bytes. I ran the multiplication test, and didn't see any additional speed-up with this option. Also, it does not seem to be related to vectorization. So I was thinking of splitting the CFLAGS_VECTOR into CFLAGS_VECTOR and CFLAGS_UNROLL_LOOPS. Checksum.c can use both these flags, and numeric.c can use only CFLAGS_VECTOR. I was also wondering if it's worth to extract only the code that we think can be optimized and keep it in a separate file (say numeric_vectorize.c or adt_vectorize.c, which can have mul_var() to start with), and use this file as a collection of all such code in the adt module, and then we can add such files into other modules as and when needed. For numeric.c, there can be already some scope for auto-vectorizations in other similar regions in that file, so we don't require a separate numeric_vectorize.c and just pass the CFLAGS_VECTOR flag for this file itself. [1] https://www.postgresql.org/message-id/flat/CA%2BU5nML8JYeGqM-k4eEwNJi5H%3DU57oPLBsBDoZUv4cfcmdnpUA%40mail.gmail.com#2ec419817ff429588dd1229fb663080e -- Thanks, -Amit Khandekar Huawei Technologies
Re: calling procedures is slow and consumes extra much memory against calling function
On Wed, 17 Jun 2020 at 13:54, Pavel Stehule wrote: > > > > st 17. 6. 2020 v 7:52 odesÃlatel Amit Khandekar > napsal: >> >> On Wed, 10 Jun 2020 at 17:12, Pavel Stehule wrote: >> > st 10. 6. 2020 v 12:26 odesÃlatel Amit Khandekar >> > napsal: >> >> Could you show an example testcase that tests this recursive scenario, >> >> with which your earlier patch fails the test, and this v2 patch passes >> >> it ? I am trying to understand the recursive scenario and the re-use >> >> of expr->plan. >> > >> > >> > it hangs on plpgsql tests. So you can apply first version of patch >> > >> > and "make check" >> >> I could not reproduce the make check hang with the v1 patch. But I >> could see a crash with the below testcase. So I understand the purpose >> of the plan_owner variable that you introduced in v2. >> >> Consider this recursive test : >> >> create or replace procedure p1(in r int) as $$ >> begin >>RAISE INFO 'r : % ', r; >>if r < 3 then >> call p1(r+1); >>end if; >> end >> $$ language plpgsql; >> >> do $$ >> declare r int default 1; >> begin >> call p1(r); >> end; >> $$; >> >> In p1() with r=2, when the stmt "call p1(r+1)" is being executed, >> consider this code of exec_stmt_call() with your v2 patch applied: >> if (expr->plan && !expr->plan->saved) >> { >>if (plan_owner) >> SPI_freeplan(expr->plan); >>expr->plan = NULL; >> } >> >> Here, plan_owner is false. So SPI_freeplan() will not be called, and >> expr->plan is set to NULL. Now I have observed that the stmt pointer >> and expr pointer is shared between the p1() execution at this r=2 >> level and the p1() execution at r=1 level. So after the above code is >> executed at r=2, when the upper level (r=1) exec_stmt_call() lands to >> the same above code snippet, it gets the same expr pointer, but it's >> expr->plan is already set to NULL without being freed. From this >> logic, it looks like the plan won't get freed whenever the expr/stmt >> pointers are shared across recursive levels, since expr->plan is set >> to NULL at the lowermost level ? Basically, the handle to the plan is >> lost so no one at the upper recursion level can explicitly free it >> using SPI_freeplan(), right ? This looks the same as the main issue >> where the plan does not get freed for non-recursive calls. I haven't >> got a chance to check if we can develop a testcase for this, similar >> to your testcase where the memory keeps on increasing. > > > This is a good consideration. > > I am sending updated patch Checked the latest patch. Looks like using a local plan rather than expr->plan pointer for doing the checks does seem to resolve the issue I raised. That made me think of another scenario : Now we are checking for plan value and then null'ifying the expr->plan value. What if expr->plan is different from plan ? Is it possible ? I was thinking of such scenarios. But couldn't find one. As long as a plan is always created with saved=true for all levels, or with saved=false for all levels, we are ok. If we can have a mix of saved and unsaved plans at different recursion levels, then expr->plan can be different from the outer local plan because then the expr->plan will not be set to NULL in the inner level, while the outer level may have created its own plan. But I think a mix of saved and unsaved plans are not possible. If you agree, then I think we should at least have an assert that looks like : if (plan && !plan->saved) { if (plan_owner) SPI_freeplan(plan); /* If expr->plan is present, it must be the same plan that we allocated */ Assert ( !expr->plan || plan == expr->plan) ); expr->plan = NULL; } Other than this, I have no other issues. I understand that we have to do this special handling only for this exec_stmt_call() because it is only here that we call exec_prepare_plan() with keep_plan = false, so doing special handling for freeing the plan seems to make sense.
Re: Auto-vectorization speeds up multiplication of large-precision numerics
FYI : this one is present in the July commitfest.
Re: Inlining of couple of functions in pl_exec.c improves performance
On Sat, 4 Jul 2020 at 01:19, Tom Lane wrote: > > I did some performance testing on 0001+0002 here, and found that > for me, there's basically no change on x86_64 but a win of 2 to 3 > percent on aarch64, using Pavel's pi_est_1() as a representative > case for simple plpgsql statements. That squares with your original > results I believe. It's not clear to me whether any of the later > tests in this thread measured these changes in isolation, or only > with 0003 added. Yeah I had the same observation. 0001+0002 seems to benefit mostly on aarch64. And 0003 (exec_case_value) benefited both on amd64 and aarch64. > > Anyway, that's good enough for me, so I pushed 0001+0002 after a > little bit of additional cosmetic tweaking. > > I attach your original 0003 here (it still applies, with some line > offsets). That's just so the cfbot doesn't get confused about what > it's supposed to test now. Thanks for pushing all the three ! Thanks, -Amit Khandekar Huawei Technologies
Re: Inlining of couple of functions in pl_exec.c improves performance
On Thu, 2 Jul 2020 at 03:47, Tom Lane wrote: > > Amit Khandekar writes: > > There are a couple of function call overheads I observed in pl/pgsql > > code : exec_stmt() and exec_cast_value(). Removing these overheads > > resulted in some performance gains. > > I took a look at the 0001/0002 patches (not 0003 as yet). I do not > like 0001 too much. The most concrete problem with it is that > you broke translatability of the error messages, cf the first > translatability guideline at [1]. Yeah, I thought we can safely use %s for proper nouns such as "trigger procedure" or "function" as those would not be translated. But looks like even if they won't be translated, the difference in word order among languages might create problems with this. > While that could be fixed by passing > the entire message not just part of it, I don't see anything that we're > gaining by moving that stuff into exec_toplevel_block in the first place. > Certainly, passing a string that describes what will happen *after* > exec_toplevel_block is just weird. I think what you've got here is > a very arbitrary chopping-up of the existing code based on chance > similarities of the existing callers. I think we're better off to make > exec_toplevel_block be as nearly as possible a match for exec_stmts' > semantics. I thought some of those things that I kept in exec_toplevel_block() do look like they belong to a top level function. But what you are saying also makes sense : better to keep it similar to exec_stmts. > > Hence, I propose the attached 0001 to replace 0001/0002. This should > be basically indistinguishable performance-wise, though I have not > tried to benchmark. Thanks for the patches. Yeah, performance-wise it does look similar; but anyways I tried running, and got similar performance numbers. > Note that for reviewability's sake, I did not > reindent the former body of exec_stmt, though we'd want to do that > before committing. Right. > > Also, 0002 is a small patch on top of that to avoid redundant saves > and restores of estate->err_stmt within the loop in exec_stmts. This > may well not be a measurable improvement, but it's a pretty obvious > inefficiency in exec_stmts now that it's refactored this way. 0002 also makes sense.
Re: calling procedures is slow and consumes extra much memory against calling function
On Wed, 10 Jun 2020 at 17:12, Pavel Stehule wrote: > st 10. 6. 2020 v 12:26 odesÃlatel Amit Khandekar > napsal: >> Could you show an example testcase that tests this recursive scenario, >> with which your earlier patch fails the test, and this v2 patch passes >> it ? I am trying to understand the recursive scenario and the re-use >> of expr->plan. > > > it hangs on plpgsql tests. So you can apply first version of patch > > and "make check" I could not reproduce the make check hang with the v1 patch. But I could see a crash with the below testcase. So I understand the purpose of the plan_owner variable that you introduced in v2. Consider this recursive test : create or replace procedure p1(in r int) as $$ begin RAISE INFO 'r : % ', r; if r < 3 then call p1(r+1); end if; end $$ language plpgsql; do $$ declare r int default 1; begin call p1(r); end; $$; In p1() with r=2, when the stmt "call p1(r+1)" is being executed, consider this code of exec_stmt_call() with your v2 patch applied: if (expr->plan && !expr->plan->saved) { if (plan_owner) SPI_freeplan(expr->plan); expr->plan = NULL; } Here, plan_owner is false. So SPI_freeplan() will not be called, and expr->plan is set to NULL. Now I have observed that the stmt pointer and expr pointer is shared between the p1() execution at this r=2 level and the p1() execution at r=1 level. So after the above code is executed at r=2, when the upper level (r=1) exec_stmt_call() lands to the same above code snippet, it gets the same expr pointer, but it's expr->plan is already set to NULL without being freed. From this logic, it looks like the plan won't get freed whenever the expr/stmt pointers are shared across recursive levels, since expr->plan is set to NULL at the lowermost level ? Basically, the handle to the plan is lost so no one at the upper recursion level can explicitly free it using SPI_freeplan(), right ? This looks the same as the main issue where the plan does not get freed for non-recursive calls. I haven't got a chance to check if we can develop a testcase for this, similar to your testcase where the memory keeps on increasing. -Amit
Re: Auto-vectorization speeds up multiplication of large-precision numerics
On Wed, 10 Jun 2020 at 04:20, Peter Eisentraut wrote: > > On 2020-06-09 13:50, Amit Khandekar wrote: > > Also, the regress/sql/numeric_big test itself speeds up by 80% > > That's nice. I can confirm the speedup: > > -O3 without the patch: > > numeric ... ok 737 ms > test numeric_big ... ok 1014 ms > > -O3 with the patch: > > numeric ... ok 680 ms > test numeric_big ... ok 580 ms > > Also: > > -O2 without the patch: > > numeric ... ok 693 ms > test numeric_big ... ok 1160 ms > > -O2 with the patch: > > numeric ... ok 677 ms > test numeric_big ... ok 917 ms > > So the patch helps either way. Oh, I didn't observe that the patch helps numeric_big.sql to speed up to some extent even with -O2. For me, it takes 805 on head and 732 ms with patch. One possible reason that I can think of is : Because of the forward loop traversal, pre-fetching might be helping. But this is just a wild guess. -O3 : HEAD test numeric ... ok 102 ms test numeric_big ... ok 803 ms -O3 : patched : test numeric ... ok 100 ms test numeric_big ... ok 450 ms -O2 : HEAD test numeric ... ok 100 ms test numeric_big ... ok 805 ms -O2 patched : test numeric ... ok 103 ms test numeric_big ... ok 732 ms > But it also seems that without the patch, -O3 might > be a bit slower in some cases. This might need more testing. For me, there is no observed change in the times with -O2 versus -O3, on head. Are you getting a consistent slower numeric*.sql tests with -O3 on current code ? Not sure what might be the reason. But this is not related to the patch. Is it with the context of patch that you are suggesting that it might need more testing ? There might be existing cases that might be running a bit slower with O3, but that's strange actually. Probably optimization in those cases might not be working as thought by the compiler, and in fact they might be working negatively. Probably that's one of the reasons -O2 is the default choice. > > > For the for loop to be auto-vectorized, I had to simplify it to > > something like this : > > Well, how do we make sure we keep it that way? How do we prevent some > random rearranging of the code or some random compiler change to break > this again? I believe the compiler rearranges the code segments w.r.t. one another when those are independent of each other. I guess the compiler is able to identify that. With our case, it's the for loop. There are some variables used inside it, and that would prevent it from moving the for loop. Even if the compiler finds it safe to move relative to surrounding code, it would not spilt the for loop contents themselves, so the for loop will remain intact, and so would the vectorization, although it seems to keep an unrolled version of the loop in case it is called with smaller iteration counts. But yes, if someone in the future tries to change the for loop, it would possibly break the auto-vectorization. So, we should have appropriate comments (patch has those). Let me know if you find any possible reasons due to which the compiler would in the future stop the vectorization even when there is no change in the for loop. It might look safer if we take the for loop out into an inline function; just to play it safe ?
Re: Inlining of couple of functions in pl_exec.c improves performance
On Tue, 9 Jun 2020 at 21:49, Pavel Stehule wrote: > Is your patch in commitfest in commitfest application? Thanks for reminding me. Just added. https://commitfest.postgresql.org/28/2590/
Re: calling procedures is slow and consumes extra much memory against calling function
On Sat, 16 May 2020 at 00:07, Pavel Stehule wrote: > > Hi > >>> >>> The problem is in plpgsql implementation of CALL statement >>> >>> In non atomic case - case of using procedures from DO block, the >>> expression plan is not cached, and plan is generating any time. This is >>> reason why it is slow. >>> >>> Unfortunately, generated plans are not released until SPI_finish. Attached >>> patch fixed this issue. >> >> >> But now, recursive calling doesn't work :-(. So this patch is not enough > > > Attached patch is working - all tests passed Could you show an example testcase that tests this recursive scenario, with which your earlier patch fails the test, and this v2 patch passes it ? I am trying to understand the recursive scenario and the re-use of expr->plan. > > It doesn't solve performance, and doesn't solve all memory problems, but > significantly reduce memory requirements from 5007 bytes to 439 bytes per one > CALL So now this patch's intention is to reduce memory consumption, and it doesn't target slowness improvement, right ? -- Thanks, -Amit Khandekar Huawei Technologies
Auto-vectorization speeds up multiplication of large-precision numerics
to-vectorization is -ftree-loop-vectorize. But for -O3 it is always true. What we can do in the future is to have a separate file that has such optimized code that is proven to work with such optimization flags, and enable the required compiler flags only for such files, if the build is done with -O2. [1] https://gcc.gnu.org/projects/tree-ssa/vectorization.html#using -- Thanks, -Amit Khandekar Huawei Technologies create_schema.sql Description: application/sql From 91aa5ef1034b763e279b4a8970101355e4a79600 Mon Sep 17 00:00:00 2001 From: Amit Khandekar Date: Tue, 9 Jun 2020 16:35:23 +0530 Subject: [PATCH] Auto-vectorize loop to speedup large-precision numeric product A 'for' loop in mul_var() runs backwards by decrementing two variables. This prevents the gcc compiler from auto-vectorizing the for loop. So make it a forward loop with a single variable. This gives performance benefits for product of numeric types with large precision, with speedups becoming noticeable from values with precisions starting from 20-40. Typical pattern of benefit is : precision 50: 5%; precision 60: 11%; 120 : 50%; 240: 2.2x; and so on. On some CPU architectures, the speedup starts from 20 precision onwards. With the precisions used in the numeric_big regression test, the multiplication speeds up by 2.5 to 2.7 times. Auto-vectorization happens with -O3 flag or -ftree-loop-vectorize. So this benefit with be seen when built with gcc -O3. --- src/backend/utils/adt/numeric.c | 11 --- 1 file changed, 8 insertions(+), 3 deletions(-) diff --git a/src/backend/utils/adt/numeric.c b/src/backend/utils/adt/numeric.c index f3a725271e..4243242ad9 100644 --- a/src/backend/utils/adt/numeric.c +++ b/src/backend/utils/adt/numeric.c @@ -7226,6 +7226,7 @@ mul_var(const NumericVar *var1, const NumericVar *var2, NumericVar *result, int res_weight; int maxdigits; int *dig; + int *digptr; int carry; int maxdig; int newdig; @@ -7362,10 +7363,14 @@ mul_var(const NumericVar *var1, const NumericVar *var2, NumericVar *result, * * As above, digits of var2 can be ignored if they don't contribute, * so we only include digits for which i1+i2+2 <= res_ndigits - 1. + * + * For large precisions, this can become a bottleneck; so keep this for + * loop simple so that it can be auto-vectorized. */ - for (i2 = Min(var2ndigits - 1, res_ndigits - i1 - 3), i = i1 + i2 + 2; - i2 >= 0; i2--) - dig[i--] += var1digit * var2digits[i2]; + i2 = Min(var2ndigits - 1, res_ndigits - i1 - 3); + digptr = [i1 + 2]; + for (i = 0; i <= i2; i++) + digptr[i] += var1digit * var2digits[i]; } /* -- 2.17.1
Re: Inlining of couple of functions in pl_exec.c improves performance
On Mon, 1 Jun 2020 at 12:27, Pavel Stehule wrote: > po 1. 6. 2020 v 8:15 odesÃlatel Amit Khandekar > napsal: >> >> On Sat, 30 May 2020 at 11:11, Pavel Stehule wrote: >> > I think so the effect of these patches strongly depends on CPU and compile >> >> I quickly tried pi() with gcc 10 as well, and saw more or less the >> same benefit. I think, we are bound to see some differences in the >> benefits across architectures, kernels and compilers; but looks like >> some benefit is always there. >> >> > but it is micro optimization, and when I look to profiler, the bottle neck >> > is elsewhere. >> >> Please check the perf numbers in my reply to Michael. I suppose you >> meant CachedPlanIsSimplyValid() when you say the bottle neck is >> elsewhere ? Yeah, this function is always the hottest spot, which I >> recall is being discussed elsewhere. But I always see exec_stmt(), >> exec_assign_value as the next functions. > > > It is hard to read the profile result, because these functions are nested > together. For your example > > 18.22% postgres postgres [.] CachedPlanIsSimplyValid > > Is little bit strange, and probably this is real bottleneck in your simple > example, and maybe some work can be done there, because you assign just > constant. I had earlier had a quick look on this one. CachedPlanIsSimplyValid() was, I recall, hitting a hotspot when it tries to access plansource->search_path (possibly cacheline miss). But didn't get a chance to further dig on that. For now, i am focusing on these other functions for which the patches were submitted. > > On second hand, your example is pretty unrealistic - and against any > developer's best practices for writing cycles. > > I think so we can look on PostGIS, where is some computing heavy routines in > PLpgSQL, and we can look on real profiles. > > Probably the most people will have benefit from these optimization. I understand it's not a real world example. For generating perf figures, I had to use an example which amplifies the benefits, so that the effect of the patches on the perf figures also becomes visible. Hence, used that example. I had shown the benefits up-thread using a practical function avg_space(). But the perf figures for that example were varying a lot. So below, what I did was : Run the avg_space() ~150 times, and took perf report. This stabilized the results a bit : HEAD : + 16.10% 17.29% 16.82% postgres postgres[.] ExecInterpExpr + 13.80% 13.56% 14.49% postgres plpgsql.so [.] exec_assign_value + 12.64% 12.10% 12.09% postgres plpgsql.so [.] plpgsql_param_eval_var + 12.15% 11.28% 11.05% postgres plpgsql.so [.] exec_stmt + 10.81% 10.24% 10.55% postgres plpgsql.so [.] exec_eval_expr +9.50% 9.35% 9.37% postgres plpgsql.so [.] exec_cast_value . +1.19% 1.06% 1.21% postgres plpgsql.so [.] exec_stmts 0001+0002 patches applied (i.e. inline exec_stmt) : + 16.90% 17.20% 16.54% postgres postgres[.] ExecInterpExpr + 16.42% 15.37% 15.28% postgres plpgsql.so [.] exec_assign_value + 11.34% 11.92% 11.93% postgres plpgsql.so [.] plpgsql_param_eval_var + 11.18% 11.86% 10.99% postgres plpgsql.so [.] exec_stmts.part.0 + 10.51% 9.52% 10.61% postgres plpgsql.so [.] exec_eval_expr +9.39% 9.48% 9.30% postgres plpgsql.so [.] exec_cast_value HEAD : exec_stmts + exec_stmt = ~12.7 % Patched (0001+0002): exec_stmts = 11.3 % Just 0003 patch applied (i.e. inline exec_cast_value) : + 17.00% 16.77% 17.09% postgres postgres [.] ExecInterpExpr + 15.21% 15.64% 15.09% postgres plpgsql.so [.] exec_assign_value + 14.48% 14.06% 13.94% postgres plpgsql.so [.] exec_stmt + 13.26% 13.30% 13.14% postgres plpgsql.so [.] plpgsql_param_eval_var + 11.48% 11.64% 12.66% postgres plpgsql.so [.] exec_eval_expr +1.03% 0.85% 0.87% postgres plpgsql.so [.] exec_stmts HEAD : exec_assign_value + exec_cast_value = ~23.4 % Patched (0001+0002): exec_assign_value = 15.3% Time in milliseconds after calling avg_space() 150 times : HEAD : 7210 Patch 0001+0002 : 6925 Patch 0003 : 6670 Patch 0001+0002+0003 : 6346
Re: Inlining of couple of functions in pl_exec.c improves performance
On Sat, 30 May 2020 at 11:11, Pavel Stehule wrote: > I think so the effect of these patches strongly depends on CPU and compile I quickly tried pi() with gcc 10 as well, and saw more or less the same benefit. I think, we are bound to see some differences in the benefits across architectures, kernels and compilers; but looks like some benefit is always there. > but it is micro optimization, and when I look to profiler, the bottle neck is > elsewhere. Please check the perf numbers in my reply to Michael. I suppose you meant CachedPlanIsSimplyValid() when you say the bottle neck is elsewhere ? Yeah, this function is always the hottest spot, which I recall is being discussed elsewhere. But I always see exec_stmt(), exec_assign_value as the next functions. >> > patch 0002 it is little bit against simplicity, but for PLpgSQL with >> > blocks structure it is correct. >> >> Here, I moved the exec_stmt code into exec_stmts() function because >> exec_stmts() was the only caller, and that function is not that big. I >> am assuming you were referring to this point when you said it is a bit >> against simplicity. But I didn't get what you implied by "but for >> PLpgSQL with blocks structure it is correct" > > > Nested statement in PLpgSQL is always a list of statements. It is not single > statement ever. So is not too strange don't have a function execute_stmt. Right.
Re: Inlining of couple of functions in pl_exec.c improves performance
On Sun, 31 May 2020 at 08:04, Michael Paquier wrote: > This stuff is interesting. Do you have some perf profiles to share? > I am wondering what's the effect of the inlining with your test > cases. Below are the perf numbers for asignmany.sql : HEAD : + 16.88% postgres postgres [.] CachedPlanIsSimplyValid + 16.64% postgres plpgsql.so [.] exec_stmt + 15.56% postgres plpgsql.so [.] exec_eval_expr + 13.58% postgres plpgsql.so [.] exec_assign_value +7.49% postgres plpgsql.so [.] exec_cast_value +7.17% postgres plpgsql.so [.] exec_assign_expr +5.39% postgres postgres [.] MemoryContextReset +3.91% postgres postgres [.] ExecJustConst +3.33% postgres postgres [.] recomputeNamespacePath +2.88% postgres postgres [.] OverrideSearchPathMatchesCurrent +2.18% postgres plpgsql.so [.] exec_eval_cleanup.isra.17 +2.15% postgres plpgsql.so [.] exec_stmts +1.32% postgres plpgsql.so [.] MemoryContextReset@plt +0.57% postgres plpgsql.so [.] CachedPlanIsSimplyValid@plt +0.57% postgres postgres [.] GetUserId 0.30% postgres plpgsql.so [.] assign_simple_var.isra.13 0.05% postgres [kernel.kallsyms] [k] unmap_page_range Patched : + 18.22% postgres postgres [.] CachedPlanIsSimplyValid + 17.25% postgres plpgsql.so [.] exec_eval_expr + 16.31% postgres plpgsql.so [.] exec_stmts + 15.00% postgres plpgsql.so [.] exec_assign_value +7.56% postgres plpgsql.so [.] exec_assign_expr +5.64% postgres postgres [.] MemoryContextReset +5.16% postgres postgres [.] ExecJustConst +4.86% postgres postgres [.] recomputeNamespacePath +4.54% postgres postgres [.] OverrideSearchPathMatchesCurrent +2.33% postgres plpgsql.so [.] exec_eval_cleanup.isra.17 +1.26% postgres plpgsql.so [.] MemoryContextReset@plt +0.81% postgres postgres [.] GetUserId +0.71% postgres plpgsql.so [.] CachedPlanIsSimplyValid@plt 0.26% postgres plpgsql.so [.] assign_simple_var.isra.13 0.03% postgres [kernel.kallsyms] [k] unmap_page_range 0.02% postgres [kernel.kallsyms] [k] mark_page_accessed Notice the reduction in percentages : HEAD : exec_stmts + exec_stmt = 18.79 Patched : exec_stmts = 16.31 HEAD : exec_assign_value + exec_cast_value : 21.07 Patched : exec_assign_value = 15.00 As expected, reduction of percentage in these two functions caused other functions like CachedPlanIsSimplyValid() and exec_eval_expr() to show rise in their percentages.
Re: Inlining of couple of functions in pl_exec.c improves performance
On Thu, 28 May 2020 at 14:39, Pavel Stehule wrote: > I don't see 17% anywhere, but 3-5% is not bad. Did you see 3-5% only for the pi function, or did you see the same improvement also for the functions that I wrote ? I was getting a consistent result of 14-18 % on both of the VMs. Also, is your test machine running on Windows ? All the machines I tested were on Linux kernel (Ubuntu) Below are my results for your pi_est_1() function. For this function, I am consistently getting 5-9 % improvement. I tested on 3 machines : gcc : 8.4.0. -O2 option OS : Ubuntu Bionic explain analyze select pi_est_1(1000) 1. x86_64 laptop VM (Intel Core i7-8665U) HEAD :2666 2617 2600 2630 Patched : 2502 2409 2460 2444 2. x86_64 VM (Xeon Gold 6151) HEAD :1664 1662 1721 1660 Patched : 1541 1548 1537 1526 3. ARM64 VM (Kunpeng) HEAD :2873 2864 2860 2861 Patched : 2568 2513 2501 2538 > > patch 0001 has sense and can help with code structure > patch 0002 it is little bit against simplicity, but for PLpgSQL with blocks > structure it is correct. Here, I moved the exec_stmt code into exec_stmts() function because exec_stmts() was the only caller, and that function is not that big. I am assuming you were referring to this point when you said it is a bit against simplicity. But I didn't get what you implied by "but for PLpgSQL with blocks structure it is correct" -- Thanks, -Amit Khandekar Huawei Technologies
Re: Inlining of couple of functions in pl_exec.c improves performance
On Tue, 26 May 2020 at 09:06, Amit Khandekar wrote: > > On Sat, 23 May 2020 at 23:24, Pavel Stehule wrote: > > > >FOR counter IN 1..180 LOOP > > id = 0; id = 0; id1 = 0; > > id2 = 0; id3 = 0; id1 = 0; id2 = 0; > > id3 = 0; id = 0; id = 0; id1 = 0; > > id2 = 0; id3 = 0; id1 = 0; id2 = 0; > > id3 = 0; > >END LOOP; > > > > This is not too much typical PLpgSQL code. All expressions are not > > parametrized - so this test is little bit obscure. > > > > Last strange performance plpgsql benchmark did calculation of pi value. It > > does something real > > Yeah, basically I wanted to have many statements, and that too with > many assignments where casts are not required. Let me check if I can > come up with a real-enough testcase. Thanks. create table tab (id int[]); insert into tab select array((select ((random() * 10)::bigint) id from generate_series(1, 3) order by 1)); insert into tab select array((select ((random() * 60)::bigint) id from generate_series(1, 3) order by 1)); insert into tab select array((select ((random() * 100)::bigint) id from generate_series(1, 3) order by 1)); insert into tab select array((select ((random() * 10)::bigint) id from generate_series(1, 3) order by 1)); insert into tab select array((select ((random() * 60)::bigint) id from generate_series(1, 3) order by 1)); insert into tab select array((select ((random() * 100)::bigint) id from generate_series(1, 3) order by 1)); insert into tab select array((select ((random() * 10)::bigint) id from generate_series(1, 3) order by 1)); insert into tab select array((select ((random() * 60)::bigint) id from generate_series(1, 3) order by 1)); insert into tab select array((select ((random() * 100)::bigint) id from generate_series(1, 3) order by 1)); -- Return how much two consecutive array elements are apart from each other, on average; i.e. how much the numbers are spaced out. -- Input is an ordered array of integers. CREATE OR REPLACE FUNCTION avg_space(int[]) RETURNS bigint AS $$ DECLARE diff int = 0; num int; prevnum int = 1; BEGIN FOREACH num IN ARRAY $1 LOOP diff = diff + num - prevnum; prevnum = num; END LOOP; RETURN diff/array_length($1, 1); END; $$ LANGUAGE plpgsql; explain analyze select avg_space(id) from tab; Like earlier figures, these are execution times in milliseconds, taken from explain-analyze. ARM VM: HEAD : 49.8 patch 0001+0002 : 47.8 => 4.2% patch 0001+0002+0003 : 42.9 => 16.1% x86 VM: HEAD : 32.8 patch 0001+0002 : 32.7 => 0% patch 0001+0002+0003 : 28.0 => 17.1%
Re: Inlining of couple of functions in pl_exec.c improves performance
On Sat, 23 May 2020 at 23:24, Pavel Stehule wrote: > >FOR counter IN 1..180 LOOP > id = 0; id = 0; id1 = 0; > id2 = 0; id3 = 0; id1 = 0; id2 = 0; > id3 = 0; id = 0; id = 0; id1 = 0; > id2 = 0; id3 = 0; id1 = 0; id2 = 0; > id3 = 0; >END LOOP; > > This is not too much typical PLpgSQL code. All expressions are not > parametrized - so this test is little bit obscure. > > Last strange performance plpgsql benchmark did calculation of pi value. It > does something real Yeah, basically I wanted to have many statements, and that too with many assignments where casts are not required. Let me check if I can come up with a real-enough testcase. Thanks.
Inlining of couple of functions in pl_exec.c improves performance
There are a couple of function call overheads I observed in pl/pgsql code : exec_stmt() and exec_cast_value(). Removing these overheads resulted in some performance gains. exec_stmt() : plpgsql_exec_function() and other toplevel block executors currently call exec_stmt(). But actually they don't need to do everything that exec_stmt() does. So they can call a new function instead of exec_stmt(), and all the exec_stmt() code can be moved to exec_stmts(). The things that exec_stmt() do, but are not necessary for a top level block stmt, are : 1. save_estmt = estate->err_stmt; estate->err_stmt = stmt; For top level blocks, saving the estate->err_stmt is not necessary, because there is no statement after this block statement. Anyways, plpgsql_exec_function() assigns estate.err_stmt just before calling exec_stmt so there is really no point in exec_stmt() setting it again. 2. CHECK_FOR_INTERRUPTS() This is not necessary for toplevel block callers. 3. exec_stmt_block() can be directly called rather than exec_stmt() because func->action is a block statement. So the switch statement is not necessary. But this one might be necessary for toplevel block statement: if (*plpgsql_plugin_ptr && (*plpgsql_plugin_ptr)->stmt_beg) ((*plpgsql_plugin_ptr)->stmt_beg) (estate, stmt); There was already a repetitive code in plpgsql_exec_function() and other functions around the exec_stmt() call. So in a separate patch 0001*.patch, I moved that code into a common function exec_toplevel_block(). In the main patch 0002-Get-rid-of-exec_stmt-function-call.patch, I additionally called plpgsql_plugin_ptr->stmt_beg() inside exec_toplevel_block(). And moved exec_stmt() code into exec_stmts(). exec_cast_value() : This function does not do the casting if not required. So moved the code that actually does the cast into a separate function, so as to reduce the exec_cast_value() code and make it inline. Attached is the 0003-Inline-exec_cast_value.patch Testing -- I used two available VMs (one x86_64 and the other arm64), and the benefit showed up on both of these machines. Attached patches 0001, 0002, 0003 are to be applied in that order. 0001 is just a preparatory patch. First I tried with a simple for loop with a single assignment (attached forcounter.sql) By inlining of the two functions, found noticeable reduction in execution time as shown (figures are in milliseconds, averaged over multiple runs; taken from 'explain analyze' execution times) : ARM VM : HEAD : 100 ; Patched : 88 => 13.6% improvement x86 VM : HEAD : 71 ; Patched : 66 => 7.63% improvement. Then I included many assignment statements as shown in attachment assignmany.sql. This showed further benefit : ARM VM : HEAD : 1820 ; Patched : 1549 => 17.5% improvement x86 VM : HEAD : 1020 ; Patched : 869 => 17.4% improvement Inlining just exec_stmt() showed the improvement mainly on the arm64 VM (7.4%). For x86, it was 2.7% But inlining exec_stmt() and exec_cast_value() together showed benefits on both machines, as can be seen above. -- Thanks, -Amit Khandekar Huawei Technologies From 66c607ef6f0b7b655819b4b19383e024c5f8788c Mon Sep 17 00:00:00 2001 From: Amit Khandekar Date: Sat, 23 May 2020 21:39:41 +0800 Subject: [PATCH 1/3] Modularize code in toplevel pl/pgsql block callers Functions that call exec_stmt() for executing the toplevel block have a repetitive code that is now moved into a common function exec_toplevel_block(). This in preparation for further changes in this part of code. --- src/pl/plpgsql/src/pl_exec.c | 86 1 file changed, 38 insertions(+), 48 deletions(-) diff --git a/src/pl/plpgsql/src/pl_exec.c b/src/pl/plpgsql/src/pl_exec.c index 9a87cd70f1..0a70ceddbb 100644 --- a/src/pl/plpgsql/src/pl_exec.c +++ b/src/pl/plpgsql/src/pl_exec.c @@ -267,6 +267,9 @@ static MemoryContext get_stmt_mcontext(PLpgSQL_execstate *estate); static void push_stmt_mcontext(PLpgSQL_execstate *estate); static void pop_stmt_mcontext(PLpgSQL_execstate *estate); +static void exec_toplevel_block(PLpgSQL_execstate *estate, +PLpgSQL_stmt_block *block, +char *object_type, char *err_text); static int exec_stmt_block(PLpgSQL_execstate *estate, PLpgSQL_stmt_block *block); static int exec_stmts(PLpgSQL_execstate *estate, @@ -475,7 +478,6 @@ plpgsql_exec_function(PLpgSQL_function *func, FunctionCallInfo fcinfo, PLpgSQL_execstate estate; ErrorContextCallback plerrcontext; int i; - int rc; /* * Setup the execution state @@ -605,23 +607,8 @@ plpgsql_exec_function(PLpgSQL_function *func, FunctionCallInfo fcinfo, /* * Now call the toplevel block of statements */ - estate.err_text = NULL; - estate.err_stmt = (PLpgSQL_stmt *) (func->action); - rc = exec_stmt(, (PLpgSQL_stmt *) func->action); - if (rc != PLPGSQL_RC_RETURN) - { - estate.err_stmt = NULL; - estate.err_text = NULL; - ereport(ERROR, -(errcode(ERRC
pgbench testing with contention scenarios
Subject changed. Earlier it was : spin_delay() for ARM On Fri, 17 Apr 2020 at 22:54, Robert Haas wrote: > > On Thu, Apr 16, 2020 at 3:18 AM Amit Khandekar wrote: > > Not relevant to the PAUSE stuff Note that when the parallel > > clients reach from 24 to 32 (which equals the machine CPUs), the TPS > > shoots from 454189 to 1097592 which is more than double speed gain > > with just a 30% increase in parallel sessions. For referencem the TPS can be seen here : https://www.postgresql.org/message-id/CAJ3gD9e86GY%3DQfyfZQkb11Z%2BCVWowDiGgGThzKKwHDGU9uA2yA%40mail.gmail.com > > I've seen stuff like this too. For instance, check out the graph from > this 2012 blog post: > > http://rhaas.blogspot.com/2012/04/did-i-say-32-cores-how-about-64.html > > You can see that the performance growth is basically on a straight > line up to about 16 cores, but then it kinks downward until about 28, > after which it kinks sharply upward until about 36 cores. > > I think this has something to do with the process scheduling behavior > of Linux, because I vaguely recall some discussion where somebody did > benchmarking on the same hardware on both Linux and one of the BSD > systems, and the effect didn't appear on BSD. They had other problems, > like a huge drop-off at higher core counts, but they didn't have that > effect. Ah I see. By the way, I have observed this behaviour in both x86 and ARM, regardless of whether the CPUs are as low as 8, or as high as 32. But for me, I suspect it's a combination of linux scheduler and interactions between backends and pgbench clients. So I used a custom script. I used the same point query that is used with the -S option, but that query is run again and again on the server side without the client having to send it again and again, so the pgbench clients are idle most of the time. Query used : select foo(30); where foo(int) is defined as : create or replace function foo(iterations int) returns int as $$ declare id int; ret int ; counter int = 0; begin WHILE counter < iterations LOOP counter = counter + 1; id = random() * 300; select into ret aid from pgbench_accounts where aid = id; END LOOP; return ret; end $$ language plpgsql; Below are results for 30 scale factor, with 8 CPUs : Clients TPS 21.255327 42.414139 63.532937 84.586583 10 4.557575 12 4.517226 14 4.551455 18 4.593271 You can see that the tps rise is almost linearly proportional to increase in clients, with no deviation in between, upto 8 clients where it does not rise because CPUs are fully utilized, In this custom case as well, the behaviour is same for both x86 and ARM, regardless of 8 CPUs or 32 CPUs. -- Thanks, -Amit Khandekar Huawei Technologies
Re: spin_delay() for ARM
On Sat, 18 Apr 2020 at 03:30, Thomas Munro wrote: > > On Sat, Apr 18, 2020 at 2:00 AM Ants Aasma wrote: > > On Thu, 16 Apr 2020 at 10:33, Pavel Stehule wrote: > > > what I know, pgbench cannot be used for testing spinlocks problems. > > > > > > Maybe you can see this issue when a) use higher number clients - > > > hundreds, thousands. Decrease share memory, so there will be press on > > > related spin lock. > > > > There really aren't many spinlocks left that could be tickled by a > > normal workload. I looked for a way to trigger spinlock contention > > when I prototyped a patch to replace spinlocks with futexes. The only > > one that I could figure out a way to make contended was the lock > > protecting parallel btree scan. A highly parallel index only scan on a > > fully cached index should create at least some spinlock contention. > > I suspect the snapshot-too-old "mutex_threshold" spinlock can become > contended under workloads that generate a high rate of > heap_page_prune_opt() calls with old_snapshot_threshold enabled. One > way to do that is with a bunch of concurrent index scans that hit the > heap in random order. Some notes about that: > > https://www.postgresql.org/message-id/flat/CA%2BhUKGKT8oTkp5jw_U4p0S-7UG9zsvtw_M47Y285bER6a2gD%2Bg%40mail.gmail.com Thanks all for the inputs. Will keep these two particular scenarios in mind, and try to get some bandwidth on this soon. -- Thanks, -Amit Khandekar Huawei Technologies
Re: spin_delay() for ARM
On Mon, 13 Apr 2020 at 20:16, Amit Khandekar wrote: > On Sat, 11 Apr 2020 at 04:18, Tom Lane wrote: > > > > I wrote: > > > A more useful test would be to directly experiment with contended > > > spinlocks. As I recall, we had some test cases laying about when > > > we were fooling with the spin delay stuff on Intel --- maybe > > > resurrecting one of those would be useful? > > > > The last really significant performance testing we did in this area > > seems to have been in this thread: > > > > https://www.postgresql.org/message-id/flat/CA%2BTgmoZvATZV%2BeLh3U35jaNnwwzLL5ewUU_-t0X%3DT0Qwas%2BZdA%40mail.gmail.com > > > > A relevant point from that is Haas' comment > > > > I think optimizing spinlocks for machines with only a few CPUs is > > probably pointless. Based on what I've seen so far, spinlock > > contention even at 16 CPUs is negligible pretty much no matter what > > you do. Whether your implementation is fast or slow isn't going to > > matter, because even an inefficient implementation will account for > > only a negligible percentage of the total CPU time - much less than 1% > > - as opposed to a 64-core machine, where it's not that hard to find > > cases where spin-waits consume the *majority* of available CPU time > > (recall previous discussion of lseek). > > Yeah, will check if I find some machines with large cores. I got hold of a 32 CPUs VM (actually it was a 16-core, but being hyperthreaded, CPUs were 32). It was an Intel Xeon , 3Gz CPU. 15G available memory. Hypervisor : KVM. Single NUMA node. PG parameters changed : shared_buffer: 8G ; max_connections : 1000 I compared pgbench results with HEAD versus PAUSE removed like this : perform_spin_delay(SpinDelayStatus *status) { - /* CPU-specific delay each time through the loop */ - SPIN_DELAY(); Ran with increasing number of parallel clients : pgbench -S -c $num -j $num -T 60 -M prepared But couldn't find any significant change in the TPS numbers with or without PAUSE: Clients HEAD Without_PAUSE 8 26 247264 16399939 399549 24454189 453244 32 1097592 1098844 40 1090424 1087984 48 1068645 1075173 64 1035035 1039973 96976578 970699 May be it will indeed show some difference only with around 64 cores, or perhaps a bare metal machine will help; but as of now I didn't get such a machine. Anyways, I thought why not archive the results with whatever I have. Not relevant to the PAUSE stuff Note that when the parallel clients reach from 24 to 32 (which equals the machine CPUs), the TPS shoots from 454189 to 1097592 which is more than double speed gain with just a 30% increase in parallel sessions. I was not expecting this much speed gain, because, with contended scenario already pgbench processes are already taking around 20% of the total CPU time of pgbench run. May be later on, I will get a chance to run with some customized pgbench script that runs a server function which keeps on running an index scan on pgbench_accounts, so as to make pgbench clients almost idle. Thanks -Amit Khandekar
Re: spin_delay() for ARM
On Sat, 11 Apr 2020 at 04:18, Tom Lane wrote: > > I wrote: > > A more useful test would be to directly experiment with contended > > spinlocks. As I recall, we had some test cases laying about when > > we were fooling with the spin delay stuff on Intel --- maybe > > resurrecting one of those would be useful? > > The last really significant performance testing we did in this area > seems to have been in this thread: > > https://www.postgresql.org/message-id/flat/CA%2BTgmoZvATZV%2BeLh3U35jaNnwwzLL5ewUU_-t0X%3DT0Qwas%2BZdA%40mail.gmail.com > > A relevant point from that is Haas' comment > > I think optimizing spinlocks for machines with only a few CPUs is > probably pointless. Based on what I've seen so far, spinlock > contention even at 16 CPUs is negligible pretty much no matter what > you do. Whether your implementation is fast or slow isn't going to > matter, because even an inefficient implementation will account for > only a negligible percentage of the total CPU time - much less than 1% > - as opposed to a 64-core machine, where it's not that hard to find > cases where spin-waits consume the *majority* of available CPU time > (recall previous discussion of lseek). Yeah, will check if I find some machines with large cores. > So I wonder whether this patch is getting ahead of the game. It does > seem that ARM systems with a couple dozen cores exist, but are they > common enough to optimize for yet? Can we even find *one* to test on > and verify that this is a win and not a loss? (Also, seeing that > there are so many different ARM vendors, results from just one > chipset might not be too trustworthy ...) Ok. Yes, it would be worth waiting to see if there are others in the community with ARM systems that have implemented YIELD. May be after that we might gain some confidence. I myself also hope that I will get one soon to test, but right now I have one that does not support it, so it will be just a no-op. -- Thanks, -Amit Khandekar Huawei Technologies -- Thanks, -Amit Khandekar Huawei Technologies
Re: spin_delay() for ARM
On Sat, 11 Apr 2020 at 00:47, Andres Freund wrote: > > Hi, > > On 2020-04-10 13:09:13 +0530, Amit Khandekar wrote: > > On my Intel Xeon machine with 8 cores, I tried to test PAUSE also > > using a sample C program (attached spin.c). Here, many child processes > > (much more than CPUs) wait in a tight loop for a shared variable to > > become 0, while the parent process continuously increments a sequence > > number for a fixed amount of time, after which, it sets the shared > > variable to 0. The child's tight loop calls PAUSE in each iteration. > > What I hoped was that because of PAUSE in children, the parent process > > would get more share of the CPU, due to which, in a given time, the > > sequence number will reach a higher value. Also, I expected the CPU > > cycles spent by child processes to drop down, thanks to PAUSE. None of > > these happened. There was no change. > > > Possibly, this testcase is not right. Probably the process preemption > > occurs only within the set of hyperthreads attached to a single core. > > And in my testcase, the parent process is the only one who is ready to > > run. Still, I have anyway attached the program (spin.c) for archival; > > in case somebody with a YIELD-supporting ARM machine wants to use it > > to test YIELD. > > PAUSE doesn't operate on the level of the CPU scheduler. So the OS won't > just schedule another process - you won't see different CPU usage if you > measure it purely as the time running. Yeah, I thought that the OS scheduling would be an *indirect* consequence of the pause because of it's slowing down the CPU, but looks like that does not happen. > You should be able to see a > difference if you measure with a profiler that shows you data from the > CPUs performance monitoring unit. Hmm, I had tried with perf and could see the pause itself consuming 5% cpu. But I haven't yet played with per-process figures. -- Thanks, -Amit Khandekar Huawei Technologies -- Thanks, -Amit Khandekar Huawei Technologies
spin_delay() for ARM
Hi, We use (an equivalent of) the PAUSE instruction in spin_delay() for Intel architectures. The goal is to slow down the spinlock tight loop and thus prevent it from eating CPU and causing CPU starvation, so that other processes get their fair share of the CPU time. Intel documentation [1] clearly mentions this, along with other benefits of PAUSE, like, low power consumption, and avoidance of memory order violation while exiting the loop. Similar to PAUSE, the ARM architecture has YIELD instruction, which is also clearly documented [2]. It explicitly says that it is a way to hint the CPU that it is being called in a spinlock loop and this process can be preempted out. But for ARM, we are not using any kind of spin delay. For PG spinlocks, the goal of both of these instructions are the same, and also both architectures recommend using them in spinlock loops. Also, I found multiple places where YIELD is already used in same situations : Linux kernel [3] ; OpenJDK [4],[5] Now, for ARM implementations that don't implement YIELD, it runs as a no-op. Unfortunately the ARM machine I have does not implement YIELD. But recently there has been some ARM implementations that are hyperthreaded, so they are expected to actually do the YIELD, although the docs do not explicitly say that YIELD has to be implemented only by hyperthreaded implementations. I ran some pgbench tests to test PAUSE/YIELD on the respective architectures, once with the instruction present, and once with the instruction removed. Didn't see change in the TPS numbers; they were more or less same. For Arm, this was expected because my ARM machine does not implement it. On my Intel Xeon machine with 8 cores, I tried to test PAUSE also using a sample C program (attached spin.c). Here, many child processes (much more than CPUs) wait in a tight loop for a shared variable to become 0, while the parent process continuously increments a sequence number for a fixed amount of time, after which, it sets the shared variable to 0. The child's tight loop calls PAUSE in each iteration. What I hoped was that because of PAUSE in children, the parent process would get more share of the CPU, due to which, in a given time, the sequence number will reach a higher value. Also, I expected the CPU cycles spent by child processes to drop down, thanks to PAUSE. None of these happened. There was no change. Possibly, this testcase is not right. Probably the process preemption occurs only within the set of hyperthreads attached to a single core. And in my testcase, the parent process is the only one who is ready to run. Still, I have anyway attached the program (spin.c) for archival; in case somebody with a YIELD-supporting ARM machine wants to use it to test YIELD. Nevertheless, I think because we have clear documentation that strongly recommends to use it, and because it has been used in other use-cases such as linux kernel and JDK, we should start using YIELD for spin_delay() in ARM. Attached is the trivial patch (spin_delay_for_arm.patch). To start with, it contains changes only for aarch64. I haven't yet added changes in configure[.in] for making sure yield compiles successfully (YIELD is present in manuals from ARMv6 onwards). Before that I thought of getting some comments; so didn't do configure changes yet. [1] https://c9x.me/x86/html/file_module_x86_id_232.html [2] https://developer.arm.com/docs/100076/0100/instruction-set-reference/a64-general-instructions/yield [3] https://elixir.bootlin.com/linux/latest/source/arch/arm64/include/asm/processor.h#L259 [4] http://cr.openjdk.java.net/~dchuyko/8186670/yield/spinwait.html [5] http://mail.openjdk.java.net/pipermail/aarch64-port-dev/2017-August/004880.html -- Thanks, -Amit Khandekar Huawei Technologies /* * Sample program to test the effect of PAUSE/YIELD instruction in a highly * contended scenario. The Intel and ARM docs recommend the use of PAUSE and * YIELD respectively, in spinlock tight loops. * * This program can be run with : * gcc -O3 -o spin spin.c -lrt ; ./spin [number_of_processes] * By default, 4 processes are spawned. * * Child processes wait in a tight loop for a shared variable to become 0, * while the parent process continuously increments a sequence number for a * fixed amount of time, after which, it sets the shared variable to 0. The * child tight loop calls YIELD/PAUSE in each iteration. * * The intention is to create a number of processes much larger than the * available CPUs, so that the scheduler hopefully pre-empts the processes * because of the PAUSE, and the main process gets more CPU share because of * which it will increment its sequence number more number of times. So the * expectation is that with PAUSE, the program will end up with a much higher * sequence number than without PAUSE. Similarly, the child processes should * have lesser CPU cycles with PAUSE than without PAUSE. * * Author: Amit Khandekar */ #include #include #include #include
Re: Minimal logical decoding on standbys
On Fri, 17 Jan 2020 at 13:20, Andreas Joseph Krogh wrote: > > PÃ¥ torsdag 16. januar 2020 kl. 05:42:24, skrev Amit Khandekar > : > > On Fri, 10 Jan 2020 at 17:50, Rahila Syed wrote: > > > > Hi Amit, > > > > Can you please rebase the patches as they don't apply on latest master? > > Thanks for notifying. Attached is the rebased version. > > > Will this patch enable logical replication from a standby-server? Sorry for the late reply. This patch only supports logical decoding from standby. So it's just an infrastructure for supporting logical replication from standby. We don't support creating a publication from standby, but the publication on master is replicated on standby, so we might be able to create subscription nodes that connect to existing publications on standby, but basically we haven't tested whether the publication/subscription model works with a publication on a physical standby. This patch is focussed on providing a way to continue logical replication *after* the standby is promoted as master. -- Thanks, -Amit Khandekar EnterpriseDB Corporation The Postgres Database Company
Re: Minimal logical decoding on standbys
On Fri, 10 Jan 2020 at 17:50, Rahila Syed wrote: > > Hi Amit, > > Can you please rebase the patches as they don't apply on latest master? Thanks for notifying. Attached is the rebased version. logicaldecodng_standby_v5_rebased.tar.gz Description: GNU Zip compressed data
Re: logical decoding : exceeded maxAllocatedDescs for .spill files
On Sun, 5 Jan 2020 at 00:21, Noah Misch wrote: > The buildfarm client can capture stack traces, but it currently doesn't do so > for TAP test suites (search the client code for get_stack_trace). If someone > feels like writing a fix for that, it would be a nice improvement. Perhaps, > rather than having the client code know all the locations where core files > might appear, failed runs should walk the test directory tree for core files? I think this might end up having the same code to walk the directory spread out on multiple files. Instead, I think in the build script, in get_stack_trace(), we can do an equivalent of "find -name "*core*" , as against the current way in which it looks for core files only in the specific data directory. So get_stack_trace(bindir, datadir) would change to get_stack_trace(bindir, input_dir) where input_dir can be any directory that can contain multiple data directories. E.g. a recovery test can create multiple instances so there would be multiple data directories inside the test directory. Noah, is it possible to run a patch'ed build script once I submit a patch, so that we can quickly get the stack trace ? I mean, can we do this before getting the patch committed ? I guess, we can run the build script with a single branch specified, right ? -- Thanks, -Amit Khandekar EnterpriseDB Corporation The Postgres Database Company
Re: logical decoding : exceeded maxAllocatedDescs for .spill files
On Fri, 3 Jan 2020 at 10:19, Amit Kapila wrote: > > On Fri, Jan 3, 2020 at 8:29 AM Amit Kapila wrote: >> >> On Thu, Jan 2, 2020 at 5:44 PM Amit Kapila wrote: >>> >>> On Tue, Dec 24, 2019 at 2:31 PM Amit Khandekar >>> wrote: >>> > >>> > >>> > Ok. I tested pg94_95_use_vfd_for_logrep.patch for 9.6 branch, and it >>> > works there. So please use this patch for all the three branches. >>> > >>> >>> Pushed! >> >> >> >> I see one failure in REL_10_STABLE [1] which seems to be due to this commit: >> > > I tried this test on my CentOs and Power8 machine more than 50 times, but > couldn't reproduce it. So, adding Noah to see if he can try this test [1] on > his machine (tern) and get stack track or some other information? > > [1] - make -C src/test/recovery/ check PROVE_TESTS=t/006_logical_decoding.pl I also tested multiple times using PG 10 branch; also tried to inject an error so that PG_CATCH related code also gets covered, but unfortunately didn't get the crash on my machine. I guess, we will have to somehow get the stacktrace. > > -- > With Regards, > Amit Kapila. > EnterpriseDB: http://www.enterprisedb.com -- Thanks, -Amit Khandekar EnterpriseDB Corporation The Postgres Database Company
Re: Minimal logical decoding on standbys
On Tue, 24 Dec 2019 at 14:02, Amit Khandekar wrote: > > On Thu, 19 Dec 2019 at 01:02, Rahila Syed wrote: > > > > Hi, > > > >> Hi, do you consistently get this failure on your machine ? I am not > >> able to get this failure, but I am going to analyze when/how this can > >> fail. Thanks > >> > > Yes, I am getting it each time I run make -C src/test/recovery/ check > > PROVE_TESTS=t/018_standby_logical_decoding_xmins.pl > > Also, there aren't any errors in logs indicating the cause. > > Thanks for the reproduction. Finally I could reproduce the behaviour. > It occurs once in 7-8 runs of the test on my machine. The issue is : > on master, the catalog_xmin does not immediately get updated. It > happens only after the hot standby feedback reaches on master. And I > haven't used wait_for_xmins() for these failing cases. I should use > that. Working on the same ... As mentioned above, I have used wait_for_xmins() so that we can wait for the xmins to be updated after hot standby feedback is processed. In one of the 3 scenarios where it failed for you, I removed the check at the second place because it was redundant. At the 3rd place, I did some appropriate changes with detailed comments. Please check. Basically we are checking that the master's phys catalog_xmin has advanced but not beyond standby's logical catalog_xmin. And for making sure the master's xmins are updated, I call txid_current() and then wait for the master's xmin to advance after hot-standby_feedback, and in this way I make sure the xmin/catalog_xmins are now up-to-date because of hot-standby-feedback, so that we can check whether the master's physical slot catalog_xmin has reached the value of standby's catalog_xmin but not gone past it. I have also moved the "wal_receiver_status_interval = 1" setting from master to standby. It was wrongly kept in master. This now reduces the test time by half, on my machine. Attached patch set v5 has only the test changes. Please check if now the test fails for you. > > -- > Thanks, > -Amit Khandekar > EnterpriseDB Corporation > The Postgres Database Company -- Thanks, -Amit Khandekar EnterpriseDB Corporation The Postgres Database Company logicaldecodng_standby_v5.tar.gz Description: GNU Zip compressed data
Re: logical decoding : exceeded maxAllocatedDescs for .spill files
On Tue, 24 Dec 2019 at 10:41, Amit Kapila wrote: > > On Fri, Dec 20, 2019 at 9:31 AM Amit Khandekar wrote: > > Attached are the patches from master back up to 94 branch. > > > > PG 9.4 and 9.5 have a common patch to be applied : > > pg94_95_use_vfd_for_logrep.patch > > From PG 9.6 onwards, each version has a separate patch. > > > > For PG 9.6, there is no logical decoding perl test file. So I have > > made a new file 006_logical_decoding_spill.pl that has only the > > specific testcase. Also, for building the test_decoding.so, I had to > > add the EXTRA_INSTALL=contrib/test_decoding line in the > > src/test/recovery/Makefile, because this is the first time we are > > using the plugin in the 9.6 tap test. > > > > I am not sure if we need to go that far for 9.6 branch. If the other > tests for logical decoding are not present, then I don't see why we > need to create a new test file for this test only. Also, I think this > will make the patch the same for 9.4,9.5 and 9.6. Ok. I tested pg94_95_use_vfd_for_logrep.patch for 9.6 branch, and it works there. So please use this patch for all the three branches. > > > From PG 10 onwards, pgstat_report_*() calls around read() are removed > > in the patch, because FileRead() itself reports the wait events. > > > > Why there are different patches for 10 and 11? For PG10, OpenTransientFile() and PathNameOpenFile() each have an extra parameter for specifying file creation modes such as S_IRUSR or S_IWUSR. For 11, these functions don't accept the flags, rather the file is always opened with PG_FILE_MODE_OWNER. Because of these differences in the calls, the PG 10 patch does not apply on 11. > > We should try to minimize the difference between patches in different > branches. Ok. -- Thanks, -Amit Khandekar EnterpriseDB Corporation The Postgres Database Company
Re: Minimal logical decoding on standbys
On Thu, 19 Dec 2019 at 01:02, Rahila Syed wrote: > > Hi, > >> Hi, do you consistently get this failure on your machine ? I am not >> able to get this failure, but I am going to analyze when/how this can >> fail. Thanks >> > Yes, I am getting it each time I run make -C src/test/recovery/ check > PROVE_TESTS=t/018_standby_logical_decoding_xmins.pl > Also, there aren't any errors in logs indicating the cause. Thanks for the reproduction. Finally I could reproduce the behaviour. It occurs once in 7-8 runs of the test on my machine. The issue is : on master, the catalog_xmin does not immediately get updated. It happens only after the hot standby feedback reaches on master. And I haven't used wait_for_xmins() for these failing cases. I should use that. Working on the same ... -- Thanks, -Amit Khandekar EnterpriseDB Corporation The Postgres Database Company
Re: logical decoding : exceeded maxAllocatedDescs for .spill files
On Thu, 19 Dec 2019 at 11:59, Amit Kapila wrote: > > On Wed, Dec 18, 2019 at 12:34 PM Amit Khandekar > wrote: > > > > On Tue, 17 Dec 2019 at 17:40, Amit Khandekar wrote: > > > By the way, the backport patch is turning out to be simpler. It's > > > because in pre-12 versions, the file offset is part of the Vfd > > > structure, so all the offset handling is not required. > > > > Please have a look at the attached backport patch for PG 11. branch. > > Once you are ok with the patch, I will port it on other branches. > > Note that in the patch, wherever applicable I have renamed the fd > > variable to vfd to signify that it is a vfd, and not the kernel fd. If > > we don't do the renaming, the patch would be still smaller, but I > > think the renaming makes sense. > > > > The other usage of PathNameOpenFile in md.c is already using 'fd' as a > variable name (also, if you see example in fd.h, that also uses fd as > variable name), so I don't see any problem with using fd especially if > that leads to lesser changes. Ok. I have retained fd name. > Apart from that, your patch LGTM. Attached are the patches from master back up to 94 branch. PG 9.4 and 9.5 have a common patch to be applied : pg94_95_use_vfd_for_logrep.patch >From PG 9.6 onwards, each version has a separate patch. For PG 9.6, there is no logical decoding perl test file. So I have made a new file 006_logical_decoding_spill.pl that has only the specific testcase. Also, for building the test_decoding.so, I had to add the EXTRA_INSTALL=contrib/test_decoding line in the src/test/recovery/Makefile, because this is the first time we are using the plugin in the 9.6 tap test. >From PG 10 onwards, pgstat_report_*() calls around read() are removed in the patch, because FileRead() itself reports the wait events. >From PG 12 onwards, the vfd offset handling had to be added, because the offset is not present in Vfd structure. In master, logical_decoding_work_mem is used in the test file. -- Thanks, -Amit Khandekar EnterpriseDB Corporation The Postgres Database Company use_vfd_for_logrep_patches.tar.gz Description: GNU Zip compressed data
Re: logical decoding : exceeded maxAllocatedDescs for .spill files
On Tue, 17 Dec 2019 at 17:40, Amit Khandekar wrote: > By the way, the backport patch is turning out to be simpler. It's > because in pre-12 versions, the file offset is part of the Vfd > structure, so all the offset handling is not required. Please have a look at the attached backport patch for PG 11. branch. Once you are ok with the patch, I will port it on other branches. Note that in the patch, wherever applicable I have renamed the fd variable to vfd to signify that it is a vfd, and not the kernel fd. If we don't do the renaming, the patch would be still smaller, but I think the renaming makes sense. The recovery TAP tests don't seem to be there on 9.4 and 9.5 branch, so I think it's ok to not have any tests with the patches on these branches that don't have the tap tests. -- Thanks, -Amit Khandekar EnterpriseDB Corporation The Postgres Database Company pg11_use_vfd_for_logrep.patch Description: Binary data
Re: logical decoding : exceeded maxAllocatedDescs for .spill files
On Mon, 16 Dec 2019 at 16:52, Amit Kapila wrote: > > On Mon, Dec 16, 2019 at 3:26 PM Amit Khandekar wrote: > > > > On Sat, 14 Dec 2019 at 11:59, Amit Kapila wrote: > > > > > > I have also made minor changes related to below code in patch: > > > - else if (readBytes != sizeof(ReorderBufferDiskChange)) > > > + > > > + file->curOffset += readBytes; > > > + > > > + if (readBytes != > > > sizeof(ReorderBufferDiskChange)) > > > > > > Why the size is added before the error check? > > The logic was : even though it's an error that the readBytes does not > > match the expected size, the file read is successful so update the vfd > > offset as early as possible. In our case, this might not matter much, > > but who knows, in the future, in the exception block (say, in > > ReorderBufferIterTXNFinish, someone assumes that the file offset is > > correct and does something with that, then we will get in trouble, > > although I agree that it's very unlikely. > > > > I am not sure if there is any such need, but even if it is there, I > think updating after a *short* read (read less than expected) doesn't > seem like a good idea because there is clearly some problem with the > read call. Also, in the case below that case where we read the actual > change data, the offset is updated after the check of *short* read. I > don't see any advantage in such an inconsistency. I still feel it is > better to update the offset after all error checks. Ok, no problem; I don't see any harm in doing the updates after the size checks. By the way, the backport patch is turning out to be simpler. It's because in pre-12 versions, the file offset is part of the Vfd structure, so all the offset handling is not required. > > > > > > and see if you can run perltidy for the test file. > > Hmm, I tried perltidy, and it seems to mostly add a space after ( and > > a space before ) if there's already; so "('postgres'," is replaced by > > "( 'postgres',". And this is going to be inconsistent with > > other places. And it replaces tab with spaces. Do you think we should > > try perltidy, or have we before been using this tool for the tap tests > > ? > > > > See text in src/test/perl/README (Note that all tests and test tools > should have perltidy run on them before patches are submitted, using > perltidy - profile=src/tools/pgindent/perltidyrc). It is recommended > to use perltidy. > > Now, if it is making the added code inconsistent with nearby code, > then I suggest to leave it. In many places, it is becoming inconsistent, but will see if there are some places where it does make sense and does not break consistency. -- Thanks, -Amit Khandekar EnterpriseDB Corporation The Postgres Database Company
Re: logical decoding : exceeded maxAllocatedDescs for .spill files
On Sat, 14 Dec 2019 at 11:59, Amit Kapila wrote: > > On Thu, Dec 12, 2019 at 9:50 PM Amit Khandekar wrote: > > > > Attached is a v4 patch that also addresses your code comments so far. > > I have included the test case in 006_logical_decoding.pl. I observed > > that the test case just adds only about 0.5 to 1 sec time. Please > > verify on your env also, and also whether the test reproduces the > > issue without the code changes. > > > > It takes roughly the same time on my machine as well. I have checked > on Windows as well, it increases the time from 14 to 16 (17) seconds > for this test. I don't think this is any big increase considering the > timing of other tests and it would be good to have a test for such > boundary conditions. I have slightly change the comments in the patch > and ran pgindent. Attached, find the patch with a proposed commit > message. > > I have also made minor changes related to below code in patch: > - else if (readBytes != sizeof(ReorderBufferDiskChange)) > + > + file->curOffset += readBytes; > + > + if (readBytes != > sizeof(ReorderBufferDiskChange)) > > Why the size is added before the error check? The logic was : even though it's an error that the readBytes does not match the expected size, the file read is successful so update the vfd offset as early as possible. In our case, this might not matter much, but who knows, in the future, in the exception block (say, in ReorderBufferIterTXNFinish, someone assumes that the file offset is correct and does something with that, then we will get in trouble, although I agree that it's very unlikely. But IMO, because we want to simulate the file offset support in vfd, we should update the file offset immediately after a file read is known to have succeeded. > I think it should be > after that check, so changed accordingly. Similarly, I don't see why > we need to change 'else if' to 'if' in this code, so changed back. Since for adding the size before the error check I had to remove the else-if, so to be consistent, I removed the else-if at surrounding places also. > > I think we need to change/tweak the test for back branches as there we > don't have logical_decoding_work_mem. Can you please look into that Yeah, I believe we need to backport up to PG 9.4 where logical decoding was introduced, so I am first trying out with 9.4 branch. > and see if you can run perltidy for the test file. Hmm, I tried perltidy, and it seems to mostly add a space after ( and a space before ) if there's already; so "('postgres'," is replaced by "( 'postgres',". And this is going to be inconsistent with other places. And it replaces tab with spaces. Do you think we should try perltidy, or have we before been using this tool for the tap tests ? -- Thanks, -Amit Khandekar EnterpriseDB Corporation The Postgres Database Company
Re: Minimal logical decoding on standbys
On Thu, 12 Dec 2019 at 15:28, Rahila Syed wrote: > > Hi Amit, > > > 2. Following test fails with error. > make -C src/test/recovery/ check > PROVE_TESTS=t/018_standby_logical_decoding_xmins.pl > # Failed test 'physical catalog_xmin not null' > # at t/018_standby_logical_decoding_xmins.pl line 120. > # got: '' > # expected: anything else > > # Failed test 'physical catalog_xmin not null' > # at t/018_standby_logical_decoding_xmins.pl line 141. > # got: '' > # expected: anything else > > # Failed test 'physical catalog_xmin not null' > # at t/018_standby_logical_decoding_xmins.pl line 159. > # got: '' > # expected: anything else > t/018_standby_logical_decoding_xmins.pl .. 20/27 # poll_query_until timed out > executing this query: > # > > Physical catalog_xmin is NULL on master after logical slot creation on > standby . Hi, do you consistently get this failure on your machine ? I am not able to get this failure, but I am going to analyze when/how this can fail. Thanks -- Thanks, -Amit Khandekar EnterpriseDB Corporation The Postgres Database Company
Re: logical decoding : exceeded maxAllocatedDescs for .spill files
On Thu, 12 Dec 2019 at 14:18, Amit Kapila wrote: > > On Thu, Dec 12, 2019 at 11:53 AM Amit Khandekar > wrote: > > > > On Thu, 12 Dec 2019 at 11:34, Amit Khandekar wrote: > > > > > So max_changes_in_memory is the one > > > that allows us to reduce the number of transactions required, so we > > > can cut down on the outer loop iterations and make the test finish > > > much earlier. > > > > > > > > But also note that, we can't use the test suite in > > > contrib/test_decoding, because max_changes_in_memory needs server > > > restart. So we need to shift this test to src/test/recovery. And > > > there, I guess it is not that critical for the testcase to be very > > > quick because the tests in general are much slower than the ones in > > > contrib/test_decoding, although it would be nice to make it fast. What > > > I propose is to modify max_changes_in_memory, do a server restart > > > (which takes hardly a sec), run the testcase (3.5 sec) and then > > > restart after resetting the guc. So totally it will be around 4-5 > > > seconds. > > > > Sorry I meant max_files_per_process. > > > > Okay, what time other individual tests take in that directory on your > machine? For src/test/recovery directory, on average, a test takes about 4-5 seconds. > How about providing a separate test patch for this case so > that I can also test it? Attached is a v4 patch that also addresses your code comments so far. I have included the test case in 006_logical_decoding.pl. I observed that the test case just adds only about 0.5 to 1 sec time. Please verify on your env also, and also whether the test reproduces the issue without the code changes. > I think we can take the opinion of others as > well if they are fine with adding this test, otherwise, we can go > ahead with the main patch. Sure. -- Thanks, -Amit Khandekar EnterpriseDB Corporation The Postgres Database Company use_vfd_for_logrep_v4.patch Description: Binary data
Re: logical decoding : exceeded maxAllocatedDescs for .spill files
On Thu, 12 Dec 2019 at 11:34, Amit Khandekar wrote: > So max_changes_in_memory is the one > that allows us to reduce the number of transactions required, so we > can cut down on the outer loop iterations and make the test finish > much earlier. > > But also note that, we can't use the test suite in > contrib/test_decoding, because max_changes_in_memory needs server > restart. So we need to shift this test to src/test/recovery. And > there, I guess it is not that critical for the testcase to be very > quick because the tests in general are much slower than the ones in > contrib/test_decoding, although it would be nice to make it fast. What > I propose is to modify max_changes_in_memory, do a server restart > (which takes hardly a sec), run the testcase (3.5 sec) and then > restart after resetting the guc. So totally it will be around 4-5 > seconds. Sorry I meant max_files_per_process. We need to reduce max_files_per_process, so that it causes max_safe_fds to be reduced, and so only a few transactions are sufficient to reproduce the problem, because the reserveAllocatedDesc() will return false much sooner due to low max_safe_fds. -- Thanks, -Amit Khandekar EnterpriseDB Corporation The Postgres Database Company
Re: logical decoding : exceeded maxAllocatedDescs for .spill files
On Thu, 12 Dec 2019 at 09:49, Amit Kapila wrote: > > On Wed, Dec 11, 2019 at 4:17 PM Amit Khandekar wrote: > > > > On Sat, 7 Dec 2019 at 11:37, Amit Kapila wrote: > > > > > > On Fri, Dec 6, 2019 at 5:00 PM Amit Khandekar > > > wrote: > > > > > > > > On Fri, 6 Dec 2019 at 15:40, Amit Kapila > > > > wrote: > > > > > > > > > > 1. > > > > > + /* Now that the state fields are initialized, it is safe to return > > > > > it. */ > > > > > + *iter_state = state; > > > > > + > > > > > /* allocate heap */ > > > > > state->heap = > > > > > binaryheap_allocate(state->nr_txns, > > > > > ReorderBufferIterCompare, > > > > > > > > > > Is there a reason for not initializing iter_state after > > > > > binaryheap_allocate? If we do so, then we don't need additional check > > > > > you have added in ReorderBufferIterTXNFinish. > > > > > > > > If iter_state is initialized *after* binaryheap_allocate, then we > > > > won't be able to close the vfds if binaryheap_allocate() ereports(). > > > > > > > > > > Is it possible to have vfds opened before binaryheap_allocate(), if so > > > how? > > No it does not look possible for the vfds to be opened before > > binaryheap_allocate(). But actually, the idea behind placing the > > iter_state at the place where I put is that, we should return back the > > iter_state at the *earliest* place in the code where it is safe to > > return. > > > > Sure, I get that point, but it seems it is equally good to do this > after binaryheap_allocate(). It will be sligthly better because if > there is any error in binaryheap_allocate, then we don't need to even > call ReorderBufferIterTXNFinish(). All right. WIll do that. > > > > > >> 4. I think we should also check how much time increase will happen for > > >> test_decoding regression test after the test added by this patch? > > Amit Khandekar wrote: > > > Yeah, it currently takes noticeably longer compared to the others. > > > Let's see if setting logical_decoding_work_mem to a min value allows > > > us to reproduce the test with much lesser number of inserts. > > > > The test in the patch takes around 20 seconds, as compared to the max > > time of 2 seconds any of the other tests take in that test suite. > > > > But if we set the max_files_per_process to a very low value (say 26) > > as against the default 1000, we can reproduce the issue with as low as > > 20 sub-transactions as against the 600 that I used in spill.sql test. > > And with this, the test runs in around 4 seconds, so this is good. > > > > Do you get 4s even after setting the minimum value of > logical_decoding_work_mem? I think 4s is also too much for this test > which is going to test one extreme scenario. Can you try with some > bigger rows or something else to reduce this time? I think if we can > get it close to 2s or whatever maximum time taken by any other logical > decoding tests, then good, otherwise, it doesn't seem like a good idea > to add this test. > > > But > > the problem is : max_files_per_process needs server restart. So either > > we have to shift this test to src/test/recovery in one of the > > logical_decoding test, or retain it in contrib/test_decoding and let > > it run for 20 seconds. Let me know if you figure out any other > > approach. > > > > I don't think 20s for one test is acceptable. So, we should just give > up that path, you can try what I suggested above, if we can reduce the > test time, then good, otherwise, I suggest to drop the test from the > patch. Having said that, we should try our best to reduce the test > time as it will be good if we can have such a test. I tried; it is actually roughly 3.4 seconds. Note that reducing logical_decoding_work_mem does not reduce the test time. It causes the serialization to start early, and so increases the chance of reproducing the problem. During restore of the serialized data, we still use max_changes_in_memory. So max_changes_in_memory is the one that allows us to reduce the number of transactions required, so we can cut down on the outer loop iterations and make the test finish much earlier. But also note that, we can't use the test suite in contrib/test_decoding, because max_changes_in_memory needs server restart. So we need to shift this test to src/test/recovery. And there, I guess it is not that critical for the testcase to be very quick because the tests in general are much slower than the ones in contrib/test_decoding, although it would be nice to make it fast. What I propose is to modify max_changes_in_memory, do a server restart (which takes hardly a sec), run the testcase (3.5 sec) and then restart after resetting the guc. So totally it will be around 4-5 seconds. -- Thanks, -Amit Khandekar EnterpriseDB Corporation The Postgres Database Company
Re: logical decoding : exceeded maxAllocatedDescs for .spill files
On Sat, 7 Dec 2019 at 11:37, Amit Kapila wrote: > > On Fri, Dec 6, 2019 at 5:00 PM Amit Khandekar wrote: > > > > On Fri, 6 Dec 2019 at 15:40, Amit Kapila wrote: > > > > > > > > > Few comments: > > > -- > > > > > > 1. > > > + /* Now that the state fields are initialized, it is safe to return it. > > > */ > > > + *iter_state = state; > > > + > > > /* allocate heap */ > > > state->heap = > > > binaryheap_allocate(state->nr_txns, > > > ReorderBufferIterCompare, > > > > > > Is there a reason for not initializing iter_state after > > > binaryheap_allocate? If we do so, then we don't need additional check > > > you have added in ReorderBufferIterTXNFinish. > > > > If iter_state is initialized *after* binaryheap_allocate, then we > > won't be able to close the vfds if binaryheap_allocate() ereports(). > > > > Is it possible to have vfds opened before binaryheap_allocate(), if so how? No it does not look possible for the vfds to be opened before binaryheap_allocate(). But actually, the idea behind placing the iter_state at the place where I put is that, we should return back the iter_state at the *earliest* place in the code where it is safe to return. > > I couldn't reproduce the original problem (on HEAD) reported with the > > test case in the patch. So, I can't verify the fix. I think it is > > because of recent commits cec2edfa7859279f36d2374770ca920c59c73dd8 and > > 9290ad198b15d6b986b855d2a58d087a54777e87. It seems you need to either > > change the value of logical_decoding_work_mem or change the test in > > some way so that the original problem can be reproduced. Amit Khandekar wrote: > Yeah, it does seem like the commit cec2edfa78592 must have caused the > test to not reproduce on your env, although the test does fail for me > still. Setting logical_decoding_work_mem to a low value does sound > like a good idea. Will work on it. I checked that setting logical_decoding_work_mem to its min value (64KB) causes early serialization. So I think if you set this, you should be able to reproduce with the spill.sql test that has the new testcase. I will anyway set this value in the test. Also check below ... >> 4. I think we should also check how much time increase will happen for >> test_decoding regression test after the test added by this patch? Amit Khandekar wrote: > Yeah, it currently takes noticeably longer compared to the others. > Let's see if setting logical_decoding_work_mem to a min value allows > us to reproduce the test with much lesser number of inserts. The test in the patch takes around 20 seconds, as compared to the max time of 2 seconds any of the other tests take in that test suite. But if we set the max_files_per_process to a very low value (say 26) as against the default 1000, we can reproduce the issue with as low as 20 sub-transactions as against the 600 that I used in spill.sql test. And with this, the test runs in around 4 seconds, so this is good. But the problem is : max_files_per_process needs server restart. So either we have to shift this test to src/test/recovery in one of the logical_decoding test, or retain it in contrib/test_decoding and let it run for 20 seconds. Let me know if you figure out any other approach. > > > > > > > 5. One naive question about the usage of PathNameOpenFile(). When it > > > reaches the max limit, it will automatically close one of the files, > > > but how will that be reflected in the data structure (TXNEntryFile) > > > you are managing. Basically, when PathNameOpenFile closes some file, > > > how will the corresponding vfd in TXNEntryFile be changed. Because if > > > it is not changed, then won't it start pointing to some wrong > > > filehandle? > > > > In PathNameOpenFile(), excess kernel fds could be closed > > (ReleaseLruFiles). But with that, the vfds themselves don't get > > invalidated. Only the underlying kernel fd gets closed, and the > > vfd->fd is marked VFD_CLOSED. The vfd array element remains valid (a > > non-null vfd->fileName means the vfd slot is valid; check > > FileIsValid). So later, when FileRead(vfd1) is called and that vfd1 > > happens to be the one that had got it's kernel fd closed, it gets > > opened again through FileAccess(). > > > > I was under impression that once the fd is closed due to excess kernel > fds that are opened, the slot in VfdCache array could be resued by > someone else, but on closer inspection that is not true. It will be > only available for reuse after we explicitly call FileClose, right? Yes, that's right. -- Thanks, -Amit Khandekar EnterpriseDB Corporation The Postgres Database Company
Re: logical decoding : exceeded maxAllocatedDescs for .spill files
On Fri, 6 Dec 2019 at 15:40, Amit Kapila wrote: > > On Thu, Dec 5, 2019 at 4:20 PM Amit Kapila wrote: > > > > On Tue, Dec 3, 2019 at 11:10 AM Amit Khandekar > > wrote: > > > > > > > > > Done as stated above; attached v3 patch. I have verified that the file > > > handles do get closed in PG_CATCH block via > > > ReorderBufferIterTXNFinish(). > > > > > > > I couldn't reproduce the original problem (on HEAD) reported with the > > test case in the patch. So, I can't verify the fix. I think it is > > because of recent commits cec2edfa7859279f36d2374770ca920c59c73dd8 and > > 9290ad198b15d6b986b855d2a58d087a54777e87. It seems you need to either > > change the value of logical_decoding_work_mem or change the test in > > some way so that the original problem can be reproduced. Yeah, it does seem like the commit cec2edfa78592 must have caused the test to not reproduce on your env, although the test does fail for me still. Setting logical_decoding_work_mem to a low value does sound like a good idea. Will work on it. > > > > Few comments: > -- > > 1. > + /* Now that the state fields are initialized, it is safe to return it. */ > + *iter_state = state; > + > /* allocate heap */ > state->heap = > binaryheap_allocate(state->nr_txns, > ReorderBufferIterCompare, > > Is there a reason for not initializing iter_state after > binaryheap_allocate? If we do so, then we don't need additional check > you have added in ReorderBufferIterTXNFinish. If iter_state is initialized *after* binaryheap_allocate, then we won't be able to close the vfds if binaryheap_allocate() ereports(). > > 2. > /* No harm in resetting the offset even in case of failure */ > file->curOffset = 0; > > The above comment is not clear because you are not setting it in case > of error rather this is a success path. I meant, even if PathNameOpenFile() failed, it is ok to set file->curOffset to 0, so need not set it only in case of *fd >= 0. In most of the cases, fd would be valid, so just set file->curOffset to 0 always. > > 3. > + * > + * Note: The iterator state is returned through iter_state parameter rather > + * than the function's return value. This is because the state gets cleaned > up > + * in a PG_CATCH block, so we want to make sure the caller gets back the > state > + * even if this function throws an exception, so that the state resources can > + * be cleaned up. > > How about changing it slightly as follows to make it more clear. > "Note: The iterator state is returned through iter_state parameter > rather than the function's return value. This is because the state > gets cleaned up in a PG_CATCH block in the caller, so we want to make > sure the caller gets back the state even if this function throws an > exception." Agreed. Will do that in the next patch version. > > 4. I think we should also check how much time increase will happen for > test_decoding regression test after the test added by this patch? Yeah, it currently takes noticeably longer compared to the others. Let's see if setting logical_decoding_work_mem to a min value allows us to reproduce the test with much lesser number of inserts. > > 5. One naive question about the usage of PathNameOpenFile(). When it > reaches the max limit, it will automatically close one of the files, > but how will that be reflected in the data structure (TXNEntryFile) > you are managing. Basically, when PathNameOpenFile closes some file, > how will the corresponding vfd in TXNEntryFile be changed. Because if > it is not changed, then won't it start pointing to some wrong > filehandle? In PathNameOpenFile(), excess kernel fds could be closed (ReleaseLruFiles). But with that, the vfds themselves don't get invalidated. Only the underlying kernel fd gets closed, and the vfd->fd is marked VFD_CLOSED. The vfd array element remains valid (a non-null vfd->fileName means the vfd slot is valid; check FileIsValid). So later, when FileRead(vfd1) is called and that vfd1 happens to be the one that had got it's kernel fd closed, it gets opened again through FileAccess(). -- Thanks, -Amit Khandekar EnterpriseDB Corporation The Postgres Database Company
Re: logical decoding : exceeded maxAllocatedDescs for .spill files
On Wed, 27 Nov 2019 at 14:16, Amit Khandekar wrote: > What I found was : We do attempt to close the opened vfds in the > PG_CATCH block. In ReorderBufferCommit(), ReorderBufferIterTXNFinish > is called both in PG_TRY and PG_CATCH. This closes all the opened > vfds. But the issue is : if the ereport() occurs inside > ReorderBufferIterTXNInit(), then iterstate is still NULL. So in > PG_CATCH, ReorderBufferIterTXNFinish() is not called, so the vfds in > state->entries[] remain open. > > We can have passed to ReorderBufferIterTXNInit() as another > argument, and initialize it first thing inside the function. This way, > it will never be NULL. But need to be careful about the possibility of > having a iterstate in a half-cooked state, so cleanup might use some > uninitialized handles. Will work on it. At least, we can make sure the > iterstate->entries handle doesn't have junk values. Done as stated above; attached v3 patch. I have verified that the file handles do get closed in PG_CATCH block via ReorderBufferIterTXNFinish(). -- Thanks, -Amit Khandekar EnterpriseDB Corporation The Postgres Database Company use_vfd_for_logrep_v3.patch Description: Binary data
Re: logical decoding : exceeded maxAllocatedDescs for .spill files
On Tue, 26 Nov 2019 at 12:10, Amit Kapila wrote: > > On Tue, Nov 26, 2019 at 11:19 AM Amit Khandekar > wrote: > > > > On Tue, 26 Nov 2019 at 10:49, Amit Kapila wrote: > > > > > > > > > So, what is the next step here? How about if we somehow check whether > > > the file exists before doing unlink, say by using stat? > > But the thing is, the behaviour is so much in a grey area, that we > > cannot reliably say for instance that when stat() says there is no > > such file, there is indeed no such file, > > > > Why so? This was just an example. What I meant was, we are really not sure of the behaviour of file operations when the file is in this state. unlink() returning "Permission denied" when called twice is itself weird enough. > > > and if we re-create the same > > file when it is still open, it is always going to open a new file, > > etc. > > > > Yeah, or maybe even if we don't create with the same name, there is > always be some dangling file which again doesn't sound like a good > thing. Right. > > > > If that doesn't work, I think we might need to go in the direction of > > > tracking > > > file handles in some way, so that they can be closed during an abort. > > Yeah, that is one way. I am still working on different approaches. > > WIll get back with proposals. > > > > Fair enough. See, if you can also consider an approach that is local > to ReorderBuffer module wherein we can track those handles in > ReorderBufferTxn or some other place local to that module. What I found was : We do attempt to close the opened vfds in the PG_CATCH block. In ReorderBufferCommit(), ReorderBufferIterTXNFinish is called both in PG_TRY and PG_CATCH. This closes all the opened vfds. But the issue is : if the ereport() occurs inside ReorderBufferIterTXNInit(), then iterstate is still NULL. So in PG_CATCH, ReorderBufferIterTXNFinish() is not called, so the vfds in state->entries[] remain open. We can have passed to ReorderBufferIterTXNInit() as another argument, and initialize it first thing inside the function. This way, it will never be NULL. But need to be careful about the possibility of having a iterstate in a half-cooked state, so cleanup might use some uninitialized handles. Will work on it. At least, we can make sure the iterstate->entries handle doesn't have junk values. -- Thanks, -Amit Khandekar EnterpriseDB Corporation The Postgres Database Company
Re: logical decoding : exceeded maxAllocatedDescs for .spill files
On Tue, 26 Nov 2019 at 10:49, Amit Kapila wrote: > > On Fri, Nov 22, 2019 at 7:38 PM Amit Khandekar wrote: > > > > On Fri, 22 Nov 2019 at 4:26 PM, Amit Kapila wrote: > >> > >> I think this is exactly the reason for the problem. In my test [1], > >> the error "permission denied" occurred when I second time executed > >> pg_logical_slot_get_changes() which means on first execution the > >> unlink would have been successful but the files are still not removed > >> as they were not closed. Then on second execution, it gets an error > >> "Permission denied" when it again tries to unlink files via > >> ReorderBufferCleanupSerializedTXNs(). > >> > >> > >> . > >> > But what you are seeing is "Permission denied" errors. Not sure why > >> > unlink() is failing. > >> > > >> > >> In your test program, if you try to unlink the file second time, you > >> should see the error "Permission denied". > > > > I tested using the sample program and indeed I got the error 5 (access > > denied) when I called unlink the second time. > > > > So, what is the next step here? How about if we somehow check whether > the file exists before doing unlink, say by using stat? But the thing is, the behaviour is so much in a grey area, that we cannot reliably say for instance that when stat() says there is no such file, there is indeed no such file, and if we re-create the same file when it is still open, it is always going to open a new file, etc. > If that doesn't work, I think we might need to go in the direction of tracking > file handles in some way, so that they can be closed during an abort. Yeah, that is one way. I am still working on different approaches. WIll get back with proposals. -- Thanks, -Amit Khandekar EnterpriseDB Corporation The Postgres Database Company
Re: logical decoding : exceeded maxAllocatedDescs for .spill files
On Fri, 22 Nov 2019 at 4:26 PM, Amit Kapila wrote: > On Fri, Nov 22, 2019 at 11:00 AM Amit Khandekar > wrote: > > > > On Fri, 22 Nov 2019 at 09:08, Amit Kapila > wrote: > > > Have you tried before that fix , if not, can you once try by > > > temporarily reverting that fix in your environment and share the > > > output of each step? After you get the error due to EOF, check that > > > you have .spill files in pg_replslot// and then again try > > > to get changes by pg_logical_slot_get_changes(). If you want, you > > > can use the test provided in Amit Khandekar's patch. > > > > On my Linux machine, I added elog() in ReorderBufferRestoreChanges(), > > just after FileRead() returns 0. This results in error. But the thing > is, in > > ReorderBufferCommit(), the error is already handled using PG_CATCH : > > > > PG_CATCH(); > > { > > . > >AbortCurrentTransaction(); > > ... > >if (using_subtxn) > > RollbackAndReleaseCurrentSubTransaction(); > > > > > >/* remove potential on-disk data, and deallocate */ > >ReorderBufferCleanupTXN(rb, txn); > > } > > > > So ReorderBufferCleanupTXN() removes all the .spill files using unlink(). > > > > And on Windows, what should happen is : unlink() should succeed > > because the file is opened using FILE_SHARE_DELETE. But the files > > should still remain there because these are still open. It is just > > marked for deletion until there is no one having opened the file. That > > is what is my conclusion from running a sample attached program test.c > > > > I think this is exactly the reason for the problem. In my test [1], > the error "permission denied" occurred when I second time executed > pg_logical_slot_get_changes() which means on first execution the > unlink would have been successful but the files are still not removed > as they were not closed. Then on second execution, it gets an error > "Permission denied" when it again tries to unlink files via > ReorderBufferCleanupSerializedTXNs(). > > > . > > But what you are seeing is "Permission denied" errors. Not sure why > > unlink() is failing. > > > > In your test program, if you try to unlink the file second time, you > should see the error "Permission denied". I tested using the sample program and indeed I got the error 5 (access denied) when I called unlink the second time. > > > [1] - > https://www.postgresql.org/message-id/CAA4eK1%2Bcey6i6a0zD9kk_eaDXb4RPNZqu4UwXO9LbHAgMpMBkg%40mail.gmail.com > > > > -- > With Regards, > Amit Kapila. > EnterpriseDB: http://www.enterprisedb.com > -- Thanks, -Amit Khandekar EnterpriseDB Corporation The Postgres Database Company
Re: logical decoding : exceeded maxAllocatedDescs for .spill files
On Fri, 22 Nov 2019 at 09:08, Amit Kapila wrote: > Have you tried before that fix , if not, can you once try by > temporarily reverting that fix in your environment and share the > output of each step? After you get the error due to EOF, check that > you have .spill files in pg_replslot// and then again try > to get changes by pg_logical_slot_get_changes(). If you want, you > can use the test provided in Amit Khandekar's patch. On my Linux machine, I added elog() in ReorderBufferRestoreChanges(), just after FileRead() returns 0. This results in error. But the thing is, in ReorderBufferCommit(), the error is already handled using PG_CATCH : PG_CATCH(); { . AbortCurrentTransaction(); ... if (using_subtxn) RollbackAndReleaseCurrentSubTransaction(); /* remove potential on-disk data, and deallocate */ ReorderBufferCleanupTXN(rb, txn); } So ReorderBufferCleanupTXN() removes all the .spill files using unlink(). And on Windows, what should happen is : unlink() should succeed because the file is opened using FILE_SHARE_DELETE. But the files should still remain there because these are still open. It is just marked for deletion until there is no one having opened the file. That is what is my conclusion from running a sample attached program test.c . But what you are seeing is "Permission denied" errors. Not sure why unlink() is failing. The thing that is still a problem is : On Windows, if the file remains open, and later even when the unlink() succeeds, the file will be left there until it is closed. So subsequent operations will open the same old file. Not sure what happens if we open a file that is marked for deletion. - Thanks, -Amit Khandekar EnterpriseDB Corporation The Postgres Database Company //#include #include #include void pgwin32_open(const char *fileName, int fileFlags,...) { HANDLE h; SECURITY_ATTRIBUTES sa; sa.nLength = sizeof(sa); sa.bInheritHandle = TRUE; sa.lpSecurityDescriptor = NULL; h = CreateFile(fileName, GENERIC_READ, /* These flags allow concurrent rename/unlink */ (FILE_SHARE_READ | FILE_SHARE_WRITE | FILE_SHARE_DELETE), , OPEN_EXISTING, FILE_ATTRIBUTE_NORMAL , NULL); if (h == INVALID_HANDLE_VALUE) { printf("File could not be opened. Error %d\n", GetLastError()); exit(1); } /* CloseHandle(h); */ if (_unlink(fileName) != 0) { printf("File could not be deleted. Error %d\n", GetLastError()); exit(1); } } int main(void) { pgwin32_open("c:/temp/amit/file", 0 /* O_RDONLY | PG_BINARY */); }
Re: logical decoding : exceeded maxAllocatedDescs for .spill files
On Wed, 20 Nov 2019 at 19:24, Alvaro Herrera wrote: > > On 2019-Nov-20, Juan José SantamarÃa Flecha wrote: > > > I was not able to reproduce the Permission denied error with current HEAD, > > until I opened another CMD inside the "pg_replslot/regression_slot" folder. > > This will be problematic, is the deletion of the folder actually needed? > > Yes :-( The code assumes that if the directory is there, then it's > valid. Trying to remove that assumption is probably a more invasive > fix. > > I think ReplicationSlotDropAcquired is too pessimistic (no recourse if > the rename fails) and too optimistic (this will almost never happen). > We could change it so that the rename is retried a few times, and avoid > the failure. (Naturally, the rmtree should also be retried.) The code > seems written with the POSIX semantics in mind, but it seems easy to > improve. Just to be clear, there are two issues being discussed here : 1. Issue with the patch, where pg_replslot/slotname/xid-*.spill files can't be removed because the same backend process has left these files opened because of an abort. This is happening despite the file being opened using FILE_SHARE_DELETE flag. I am going to investigate (possibly the flag is not applicable in case a single process is involved) 2. This existing issue where pg_replslot/slotname directory removal will fail if someone else is accessing this directory. This has nothing to do with the patch. -- Thanks, -Amit Khandekar EnterpriseDB Corporation The Postgres Database Company
Re: logical decoding : exceeded maxAllocatedDescs for .spill files
On Wed, 20 Nov 2019 at 13:10, Amit Kapila wrote: > See comment in pgunlink() "We need to loop because even though > PostgreSQL uses flags that allow unlink while the file is open, other > applications might have the file > open without those flags.". Can you once see if there is any flag > that you have missed to pass to allow this? > If there is nothing we > can do about it, then we might need to use some different API or maybe > define a new API that can handle this. There were objections against modifying the vfd api only for this replication-related use-case. Having a new API will require all the changes required to enable the virtual FDs feature that we need from vfd. If nothing works out from the FILE_SHARE_DELETE thing, I am thinking, we can use VFD, plus we can keep track of per-subtransaction vfd handles, and do something similar to AtEOSubXact_Files(). -- Thanks, -Amit Khandekar EnterpriseDB Corporation The Postgres Database Company
Re: logical decoding : exceeded maxAllocatedDescs for .spill files
On Wed, 20 Nov 2019 at 13:10, Amit Kapila wrote: > > On Tue, Nov 19, 2019 at 4:58 PM Amit Khandekar wrote: > > > > On Tue, 19 Nov 2019 at 14:07, Amit Kapila wrote: > > > > > > > > > Have you tried by injecting some error? After getting the error > > > mentioned above in email, when I retried the same query, I got the > > > below message. > > > > > > postgres=# SELECT 1 from > > > pg_logical_slot_get_changes('regression_slot', NULL,NULL) LIMIT 1; > > > ERROR: could not remove file > > > "pg_replslot/regression_slot/xid-1693-lsn-0-1800.spill" during > > > removal of pg_replslot/regression_slot/xid*: Permission denied > > > > > > And, then I tried to drop the replication slot and I got below error. > > > postgres=# SELECT * FROM pg_drop_replication_slot('regression_slot'); > > > ERROR: could not rename file "pg_replslot/regression_slot" to > > > "pg_replslot/regression_slot.tmp": Permission denied > > > > > > It might be something related to Windows > > > > Oh ok, I missed the fact that on Windows we can't delete the files > > that are already open, unlike Linux/Unix. > > > > See comment in pgunlink() "We need to loop because even though > PostgreSQL uses flags that allow unlink while the file is open, other > applications might have the file > open without those flags.". Can you once see if there is any flag > that you have missed to pass to allow this? If there is nothing we > can do about it, then we might need to use some different API or maybe > define a new API that can handle this. Hmm, looks like there is one such flag: FILE_SHARE_DELETE. When file is opened with this flag, other processes can delete as well as rename the file. But it turns out that in pgwin32_open(), we already use FILE_SHARE_DELETE. So, this is again confusing. -- Thanks, -Amit Khandekar EnterpriseDB Corporation The Postgres Database Company
Re: logical decoding : exceeded maxAllocatedDescs for .spill files
On Wed, 20 Nov 2019 at 01:05, Thomas Munro wrote: > > On Wed, Nov 20, 2019 at 7:58 AM Thomas Munro wrote: > > On Wed, Nov 20, 2019 at 1:14 AM Juan José SantamarÃa Flecha > > > https://devblogs.microsoft.com/oldnewthing/20150121-00/?p=44863 > > > > !?! Thanks Juan and Thomas for pointing to these links where already this was discussed. > > One thing I don't understand (besides, apparently, the documentation): > how did this problem escape detection by check-world for such a long > time? Surely we expect to hit the end of various temporary files in > various tests. Is it intermittent, or dependent on Windows version, > or something like that? Possibly there aren't any callers who try to pread() at end-of-file using FileRead/pg_pread : - mdread() seems to read from an offset which it seems to know that it is inside the end-of file, including the whole BLCKSZ. - BufFileLoadBuffer() seems to deliberately ignore FileRead()'s return value if it is -1 if (file->nbytes < 0) file->nbytes = 0; - XLogPageRead() also seems to know that the offset is a valid offset. -- Thanks, -Amit Khandekar EnterpriseDB Corporation The Postgres Database Company
Re: logical decoding : exceeded maxAllocatedDescs for .spill files
On Tue, 19 Nov 2019 at 14:07, Amit Kapila wrote: > > On Mon, Nov 18, 2019 at 5:50 PM Amit Khandekar wrote: > > > > On Mon, 18 Nov 2019 at 17:20, Amit Kapila wrote: > > > I see that you have made changes in ReorderBufferRestoreChanges to use > > > PathNameOpenFile, but not in ReorderBufferSerializeTXN. Is there a > > > reason for the same? In my test environment, with the test provided > > > by you, I got the error (reported in this thread) via > > > ReorderBufferSerializeTXN. > > > > You didn't get this error with the patch applied, did you ? > > > > No, I got this before applying the patch. However, after applying the > patch, I got below error in the same test: > > postgres=# SELECT 1 from > pg_logical_slot_get_changes('regression_slot', NULL,NULL) LIMIT 1; > ERROR: could not read from reorderbuffer spill file: Invalid argument > > It seems to me that FileRead API used in the patch can return value < > 0 on EOF. See the API usage in BufFileLoadBuffer. I got this error > on a windows machine and in the server log the message was "LOG: > unrecognized win32 error code: 38" which indicates "Reached the end of > the file." On Windows, it is documented that ReadFile() (which is called by pg_pread) will return false on EOF but only when the file is open for asynchronous reads/writes. But here we are just dealing with usual synchronous reads. So pg_pread() code should indeed return 0 on EOF on Windows. Not yet able to figure out how FileRead() managed to return this error on Windows. But from your symptoms, it does look like pg_pread()=>ReadFile() returned false (despite doing asynchronous reads), and so _dosmaperr() gets called, and then it does not find the eof error in doserrors[], so the "unrecognized win32 error code" message is printed. May have to dig up more on this. > > > If you were debugging this without the patch applied, I suspect that > > the reason why ReorderBufferSerializeTXN() => OpenTransientFile() is > > generating this error is because the max limit must be already crossed > > because of earlier calls to ReorderBufferRestoreChanges(). > > > > Note that in ReorderBufferSerializeTXN(), OpenTransientFile() is > > sufficient because the code in that function has made sure the fd gets > > closed there itself. > > > > Okay, then we might not need it there, but we should at least add a > comment in ReorderBufferRestoreChanges to explain why we have used a > different function to operate on the file at that place. Yeah, that might make sense. > > > > > For the API's that use VFDs (like PathNameOpenFile), the files opened > > are always recorded in the VfdCache array. So it is not required to do > > the cleanup at (sub)transaction end, because the kernel fds get closed > > dynamically in ReleaseLruFiles() whenever they reach max_safe_fds > > limit. So if a transaction aborts, the fds might remain open, but > > those will get cleaned up whenever we require more fds, through > > ReleaseLruFiles(). Whereas, for files opened through > > OpenTransientFile(), VfdCache is not involved, so this needs > > transaction end cleanup. > > > > Have you tried by injecting some error? After getting the error > mentioned above in email, when I retried the same query, I got the > below message. > > postgres=# SELECT 1 from > pg_logical_slot_get_changes('regression_slot', NULL,NULL) LIMIT 1; > ERROR: could not remove file > "pg_replslot/regression_slot/xid-1693-lsn-0-1800.spill" during > removal of pg_replslot/regression_slot/xid*: Permission denied > > And, then I tried to drop the replication slot and I got below error. > postgres=# SELECT * FROM pg_drop_replication_slot('regression_slot'); > ERROR: could not rename file "pg_replslot/regression_slot" to > "pg_replslot/regression_slot.tmp": Permission denied > > It might be something related to Windows Oh ok, I missed the fact that on Windows we can't delete the files that are already open, unlike Linux/Unix. I guess, I may have to use FD_CLOSE_AT_EOXACT flags; or simply use OpenTemporaryFile(). I wonder though if this same issue might come up for the other use-case of PathNameOpenFile() : logical_rewrite_log_mapping(). > but you can once try by > injecting some error after reading a few files in the code path and > see the behavior. Yeah, will check the behaviour, although on Linux, I think I won't get this error. But yes, like I mentioned above, I think we might have to arrange for something. -- Thanks, -Amit Khandekar EnterpriseDB Corporation The Postgres Database Company
Re: logical decoding : exceeded maxAllocatedDescs for .spill files
On Mon, 18 Nov 2019 at 17:52, Amit Kapila wrote: > I have one more question regarding this patch. It seems to me that > the files opened via OpenTransientFile or OpenTemporaryFile are > automatically closed at transaction end(abort), but that doesn't seem > to be the case for files opened with PathNameOpenFile. See > AtEOXact_Files and AtEOSubXact_Files. So, now with the change > proposed by this patch, don't we need to deal it in some other way? For the API's that use VFDs (like PathNameOpenFile), the files opened are always recorded in the VfdCache array. So it is not required to do the cleanup at (sub)transaction end, because the kernel fds get closed dynamically in ReleaseLruFiles() whenever they reach max_safe_fds limit. So if a transaction aborts, the fds might remain open, but those will get cleaned up whenever we require more fds, through ReleaseLruFiles(). Whereas, for files opened through OpenTransientFile(), VfdCache is not involved, so this needs transaction end cleanup. -- Thanks, -Amit Khandekar EnterpriseDB Corporation The Postgres Database Company
Re: logical decoding : exceeded maxAllocatedDescs for .spill files
On Mon, 18 Nov 2019 at 17:20, Amit Kapila wrote: > I see that you have made changes in ReorderBufferRestoreChanges to use > PathNameOpenFile, but not in ReorderBufferSerializeTXN. Is there a > reason for the same? In my test environment, with the test provided > by you, I got the error (reported in this thread) via > ReorderBufferSerializeTXN. You didn't get this error with the patch applied, did you ? If you were debugging this without the patch applied, I suspect that the reason why ReorderBufferSerializeTXN() => OpenTransientFile() is generating this error is because the max limit must be already crossed because of earlier calls to ReorderBufferRestoreChanges(). Note that in ReorderBufferSerializeTXN(), OpenTransientFile() is sufficient because the code in that function has made sure the fd gets closed there itself. If you are getting this error even with the patch applied, then this needs investigation. -- Thanks, -Amit Khandekar EnterpriseDB Corporation The Postgres Database Company
Re: Minimal logical decoding on standbys
On Thu, 7 Nov 2019 at 14:02, Rahila Syed wrote: > > Hi Amit, > > I am reading about this feature and reviewing it. > To start with, I reviewed the patch: > 0005-Doc-changes-describing-details-about-logical-decodin.patch. Thanks for picking up the patch review. Your reply somehow spawned a new mail thread, so I reverted back to this thread for replying. > > >prevent VACUUM from removing required rows from the system catalogs, > >hot_standby_feedback should be set on the standby. In spite of that, > >if any required rows get removed on standby, the slot gets dropped. > IIUC, you mean `if any required rows get removed on *the master* the slot gets > dropped`, right? Yes, you are right. In fact, I think it is not necessary to explicitly mention where the rows get removed. So I have just omitted "on standby". Will include this change in the next patch versions. -- Thanks, -Amit Khandekar EnterpriseDB Corporation The Postgres Database Company
Re: Minimal logical decoding on standbys
On Thu, 10 Oct 2019 at 05:49, Craig Ringer wrote: > > On Tue, 1 Oct 2019 at 02:08, Robert Haas wrote: >> >> >> Why does create_logical_slot_on_standby include sleep(1)? > > > Yeah, we really need to avoid sleeps in regression tests. Yeah, have already got rid of the sleeps from the patch-series version 4 onwards. By the way, the couple of patches out of the patch series had bitrotten. Attached is the rebased version. Thanks -Amit Khandekar logicaldecodng_standby_v4_rebased.tar.gz Description: application/gzip
Re: logical decoding : exceeded maxAllocatedDescs for .spill files
On Wed, 18 Sep 2019 at 12:24, Amit Khandekar wrote: > Probably, for now at least, what everyone seems to agree is to take my > earlier attached patch forward. > > I am going to see if I can add a TAP test for the patch, and will add > the patch into the commitfest soon. Attached is an updated patch v2. Has a new test scenario added in contrib/test_decoding/sql/spill test, and some minor code cleanup. Going to add this into Nov commitfest. -- Thanks, -Amit Khandekar EnterpriseDB Corporation The Postgres Database Company diff --git a/contrib/test_decoding/expected/spill.out b/contrib/test_decoding/expected/spill.out index 10734bd..e1d150b6 100644 --- a/contrib/test_decoding/expected/spill.out +++ b/contrib/test_decoding/expected/spill.out @@ -247,6 +247,25 @@ GROUP BY 1 ORDER BY 1; 'serialize-nested-subbig-subbigabort-subbig-3 | 5000 | table public.spill_test: INSERT: data[text]:'serialize-nested-subbig-subbigabort-subbig-3:5001' | table public.spill_test: INSERT: data[text]:'serialize-nested-subbig-subbigabort-subbig-3:1' (2 rows) +-- large number of spilling subxacts. Should not exceed maxAllocatedDescs +BEGIN; +DO $$ +BEGIN +FOR i IN 1..600 LOOP +BEGIN +INSERT INTO spill_test select generate_series(1, 5000) ; +EXCEPTION +when division_by_zero then perform 'dummy'; +END; +END LOOP; +END $$; +COMMIT; +SELECT 1 from pg_logical_slot_get_changes('regression_slot', NULL,NULL) LIMIT 1; + ?column? +-- +1 +(1 row) + DROP TABLE spill_test; SELECT pg_drop_replication_slot('regression_slot'); pg_drop_replication_slot diff --git a/contrib/test_decoding/sql/spill.sql b/contrib/test_decoding/sql/spill.sql index e638cac..715d1f9 100644 --- a/contrib/test_decoding/sql/spill.sql +++ b/contrib/test_decoding/sql/spill.sql @@ -174,6 +174,21 @@ SELECT (regexp_split_to_array(data, ':'))[4] COLLATE "C", COUNT(*), (array_agg(d FROM pg_logical_slot_get_changes('regression_slot', NULL,NULL) WHERE data ~ 'INSERT' GROUP BY 1 ORDER BY 1; +-- large number of spilling subxacts. Should not exceed maxAllocatedDescs +BEGIN; +DO $$ +BEGIN +FOR i IN 1..600 LOOP +BEGIN +INSERT INTO spill_test select generate_series(1, 5000) ; +EXCEPTION +when division_by_zero then perform 'dummy'; +END; +END LOOP; +END $$; +COMMIT; +SELECT 1 from pg_logical_slot_get_changes('regression_slot', NULL,NULL) LIMIT 1; + DROP TABLE spill_test; SELECT pg_drop_replication_slot('regression_slot'); diff --git a/src/backend/replication/logical/reorderbuffer.c b/src/backend/replication/logical/reorderbuffer.c index 8ce28ad..1e5a725 100644 --- a/src/backend/replication/logical/reorderbuffer.c +++ b/src/backend/replication/logical/reorderbuffer.c @@ -103,13 +103,21 @@ typedef struct ReorderBufferTupleCidEnt CommandId combocid; /* just for debugging */ } ReorderBufferTupleCidEnt; +/* Virtual file descriptor with file offset tracking */ +typedef struct TXNEntryFile +{ + File vfd; /* -1 when the file is closed */ + off_t curOffset; /* offset for next write or read. Reset to 0 + * when vfd is opened. */ +} TXNEntryFile; + /* k-way in-order change iteration support structures */ typedef struct ReorderBufferIterTXNEntry { XLogRecPtr lsn; ReorderBufferChange *change; ReorderBufferTXN *txn; - int fd; + TXNEntryFile file; XLogSegNo segno; } ReorderBufferIterTXNEntry; @@ -194,7 +202,7 @@ static void ReorderBufferSerializeTXN(ReorderBuffer *rb, ReorderBufferTXN *txn); static void ReorderBufferSerializeChange(ReorderBuffer *rb, ReorderBufferTXN *txn, int fd, ReorderBufferChange *change); static Size ReorderBufferRestoreChanges(ReorderBuffer *rb, ReorderBufferTXN *txn, - int *fd, XLogSegNo *segno); + TXNEntryFile *file, XLogSegNo *segno); static void ReorderBufferRestoreChange(ReorderBuffer *rb, ReorderBufferTXN *txn, char *change); static void ReorderBufferRestoreCleanup(ReorderBuffer *rb, ReorderBufferTXN *txn); @@ -988,7 +996,7 @@ ReorderBufferIterTXNInit(ReorderBuffer *rb, ReorderBufferTXN *txn) for (off = 0; off < state->nr_txns; off++) { - state->entries[off].fd = -1; + state->entries[off].file.vfd = -1; state->entries[off].segno = 0; } @@ -1013,7 +1021,7 @@ ReorderBufferIterTXNInit(ReorderBuffer *rb, ReorderBufferTXN *txn) { /* serialize remaining changes */ ReorderBufferSerializeTXN(rb, txn); - ReorderBufferRestoreChanges(rb, txn, >entries[off].fd, + ReorderBufferRestoreChanges(rb, txn, >entries[off].file, >entries[off].segno); } @@ -1043,7 +1051,7 @@ ReorderBufferIterTXNInit(ReorderBuffer *rb, ReorderBufferTXN *txn) /* serialize remaining changes */ ReorderBufferSerializeTXN(rb, cur_txn); ReorderBufferRestoreChanges(rb, cur_txn, - >entries[off].fd, + >entries[off].file, >entr
Re: Minimal logical decoding on standbys
On Fri, 27 Sep 2019 at 23:21, Robert Haas wrote: > > On Fri, Sep 27, 2019 at 12:41 PM Amit Khandekar > wrote: > > Preferably I want wait_for_xmins() to get rid of the $node parameter, > > because we can deduce it using slot name. But that requires having > > get_node_from_slotname(). Your suggestion was to remove > > get_node_from_slotname() and add back the $node param so as to reduce > > duplicate code. Instead, how about keeping wait_for_xmins() in the > > PostgresNode.pm() ? This way, we won't have duplication, and also we > > can get rid of param $node. This is just my preference; if you are > > quite inclined to not have get_node_from_slotname(), I will go with > > your suggestion. > > I'd be inclined not to have it. I think having a lookup function to > go from slot name -> node is strange; it doesn't really simplify > things that much for the caller, and it makes the logic harder to > follow. It would break outright if you had the same slot name on > multiple nodes, which is a perfectly reasonable scenario. Alright. Attached is the updated patch that splits the file into two files, one that does only xmin related testing, and the other test file that tests conflict recovery scenarios, and also one scenario where drop-database drops the slots on the database on standby. Removed get_slot_xmins() and get_node_from_slotname(). Renamed 'replica' to 'standby'. Used node->backup() function instead of pg_basebackup command. Renamed $master_slot to $master_slotname, similarly for $standby_slot. -- Thanks, -Amit Khandekar EnterpriseDB Corporation The Postgres Database Company logicaldecodng_standby_v3.tar.gz Description: GNU Zip compressed data
Re: Minimal logical decoding on standbys
On Fri, 27 Sep 2019 at 01:57, Robert Haas wrote: > > On Thu, Sep 26, 2019 at 5:14 AM Amit Khandekar wrote: > > Actually in some of the conflict-recovery testcases, I am still using > > wait_for_xmins() so that we could test the xmin values back after we > > drop the slots. So xmin-related testing is embedded in these recovery > > tests as well. We can move the wait_for_xmins() function to some > > common file and then do the split of this file, but then effectively > > some of the xmin-testing would go into the recovery-related test file, > > which did not sound sensible to me. What do you say ? > > I agree we don't want code duplication, but I think we could reduce > the code duplication to a pretty small amount with a few cleanups. > > I don't think wait_for_xmins() looks very well-designed. It goes to > trouble of returning a value, but only 2 of the 6 call sites pay > attention to the returned value. I think we should change the > function so that it doesn't return anything and have the callers that > want a return value call get_slot_xmins() after wait_for_xmins(). Yeah, that can be done. > > And then I think we should turn around and get rid of get_slot_xmins() > altogether. Instead of: > > my ($xmin, $catalog_xmin) = get_slot_xmins($master_slot); > is($xmin, '', "xmin null"); > is($catalog_xmin, '', "catalog_xmin null"); > > We can write: > > my $slot = $node_master->slot($master_slot); > is($slot->{'xmin'}, '', "xmin null"); > is($slot->{'catalog_xmin'}, '', "catalog xmin null"); > > ...which is not really any longer or harder to read, but does > eliminate the need for one function definition. Agreed. > > Then I think we should change wait_for_xmins so that it takes three > arguments rather than two: $node, $slotname, $check_expr. With that > and the previous change, we can get rid of get_node_from_slotname(). > > At that point, the body of wait_for_xmins() would consist of a single > call to $node->poll_query_until() or die(), which doesn't seem like > too much code to duplicate into a new file. Earlier it used to have 3 params, the same ones you mentioned. I removed $node for caller convenience. > > Looking at it at a bit more, though, I wonder why the recovery > conflict scenario is even using wait_for_xmins(). It's hard-coded to > check the state of the master_physical slot, which isn't otherwise > manipulated by the recovery conflict tests. What's the point of > testing that a slot which had xmin and catalog_xmin NULL before the > test started (line 414) and which we haven't changed since still has > those values at two different points during the test (lines 432, 452)? > Perhaps I'm missing something here, but it seems like this is just an > inadvertent entangling of these scenarios with the previous scenarios, > rather than anything that necessarily needs to be connected together. In the "Drop slot" test scenario, we are testing that after we manually drop the slot on standby, the master catalog_xmin should be back to NULL. Hence, the call to wait_for_xmins(). And in the "Scenario 1 : hot_standby_feedback off", wait_for_xmins() is called the first time only as a mechanism to ensure that "hot_standby_feedback = off" has taken effect. At the end of this test, wait_for_xmins() again is called only to ensure that hot_standby_feedback = on has taken effect. Preferably I want wait_for_xmins() to get rid of the $node parameter, because we can deduce it using slot name. But that requires having get_node_from_slotname(). Your suggestion was to remove get_node_from_slotname() and add back the $node param so as to reduce duplicate code. Instead, how about keeping wait_for_xmins() in the PostgresNode.pm() ? This way, we won't have duplication, and also we can get rid of param $node. This is just my preference; if you are quite inclined to not have get_node_from_slotname(), I will go with your suggestion. -- Thanks, -Amit Khandekar EnterpriseDB Corporation The Postgres Database Company
Re: Minimal logical decoding on standbys
do the split of this file, but then effectively some of the xmin-testing would go into the recovery-related test file, which did not sound sensible to me. What do you say ? Attached patch series has the test changes addressed. -- Thanks, -Amit Khandekar EnterpriseDB Corporation The Postgres Database Company logicaldecodng_standby_v2.tar.gz Description: GNU Zip compressed data
Re: logical decoding : exceeded maxAllocatedDescs for .spill files
On Tue, 17 Sep 2019 at 21:19, Andres Freund wrote: > On 2019-09-14 14:34:21 -0400, Tom Lane wrote: > > Amit Khandekar writes: > > > Yeah, something like the attached patch. I think this tracking of > > > offsets would have been cleaner if we add in-built support in VFD. But > > > yeah, for bank branches at least, we need to handle it outside of VFD. > > > Or may be we would add it if we find one more use-case. > > > > Again, we had that and removed it, for what seem to me to be solid > > reasons. It adds cycles when we're forced to close/reopen a file, > > and it also adds failure modes that we could do without (ie, failure > > of either the ftell or the lseek, which are particularly nasty because > > they shouldn't happen according to the VFD abstraction). Ok. So you mean, when the caller would call FileRead() for sequential reading, underneath VFD would do a pread(), but if pread() returns error, the errno can belong to read() or it might as well belong to lseek(). If it's due to lseek(), it's not expected from the caller because for the caller it's just a sequential read. Yeah, makes sense. >> I do not > > think there is going to be any argument strong enough to make us > > put it back, especially not for non-mainstream callers like logical > > decoding. Ok. Also, more about putting back is in the below comments ... > > Yea, I think that's the right call. Avoiding kernel seeks is quite > worthwhile, and we shouldn't undo it just because of this usecase. And > that'll become more and more important performance-wise (and has already > done so, with all the intel fixes making syscalls much slower). By the way, I was not thinking about adding back the read() and lseek() calls. I was saying we continue to use the pread() call, so it's just a single system call. FileReadAt(..., offset) would do pread() with user-supplied offset, and FileRead() would do pread() using internally tracked offset. So for the user, FileReadAt() is like pread(), and FileRead() would be like read(). But I agree with Tom's objection about having to unnecessarily handle lseek error codes. > > I could see an argument for adding a separate generic layer providing > position tracking ontop of the VFD abstraction however. Seems quite > possible that there's some other parts of the system that could benefit > from using VFDs rather than plain fds. And they'd probably also need the > positional tracking. Yeah, that also could be done. Probably, for now at least, what everyone seems to agree is to take my earlier attached patch forward. I am going to see if I can add a TAP test for the patch, and will add the patch into the commitfest soon. -- Thanks, -Amit Khandekar EnterpriseDB Corporation The Postgres Database Company
Re: logical decoding : exceeded maxAllocatedDescs for .spill files
On Fri, 13 Sep 2019 at 22:01, Robert Haas wrote: > On Fri, Sep 13, 2019 at 12:14 PM Tom Lane wrote: > > Again, though, the advice that's been given here is that we should > > fix logical decoding to use the VFD API as it stands, not change > > that API. I concur with that. > > A reasonable position. So I guess logical decoding has to track the > file position itself, but perhaps use the VFD layer for managing FD > pooling. Yeah, something like the attached patch. I think this tracking of offsets would have been cleaner if we add in-built support in VFD. But yeah, for bank branches at least, we need to handle it outside of VFD. Or may be we would add it if we find one more use-case. -- Thanks, -Amit Khandekar EnterpriseDB Corporation The Postgres Database Company use_vfd_for_logrep.patch Description: Binary data
Re: logical decoding : exceeded maxAllocatedDescs for .spill files
On Thu, 12 Sep 2019 at 19:11, Robert Haas wrote: > > On Thu, Sep 12, 2019 at 5:31 AM Tomas Vondra > wrote: > > I don't see how the current API could do that transparently - it does > > track the files, but the user only gets a file descriptor. With just a > > file descriptor, how could the code know to do reopen/seek when it's going > > just through the regular fopen/fclose? > > > > Anyway, I agree we need to do something, to fix this corner case (many > > serialized in-progress transactions). ISTM we have two options - either do > > something in the context of reorderbuffer.c, or extend the transient file > > API somehow. I'd say the second option is the right thing going forward, > > because it does allow doing it transparently and without leaking details > > about maxAllocatedDescs etc. There are two issues, though - it does > > require changes / extensions to the API, and it's not backpatchable. > > > > So maybe we should start with the localized fix in reorderbuffer, and I > > agree tracking offset seems reasonable. > > We've already got code that knows how to track this sort of thing. You mean tracking excess kernel fds right ? Yeah, we can use VFDs so that excess fds are automatically closed. But Alvaro seems to be talking in context of tracking of file seek position. VFD does not have a mechanism to track file offsets if one of the vfd cached file is closed and reopened. Robert, are you suggesting to add this capability to VFD ? I agree that we could do it, but for back-patching, offhand I couldn't think of a simpler way. -- Thanks, -Amit Khandekar EnterpriseDB Corporation The Postgres Database Company
Re: Minimal logical decoding on standbys
On Mon, 9 Sep 2019 at 16:06, Amit Khandekar wrote: > > On Tue, 3 Sep 2019 at 23:10, Alvaro Herrera wrote: > > > > On 2019-Jul-19, Amit Khandekar wrote: > > > > > Attached are the split patches. Included is an additional patch that > > > has doc changes. Here is what I have added in the docs. Pasting it > > > here so that all can easily spot how it is supposed to behave, and to > > > confirm that we are all on the same page : > > > > ... Apparently, this patch was not added to the commitfest for some > > reason; and another patch that *is* in the commitfest has been said to > > depend on this one (Petr's https://commitfest.postgresql.org/24/1961/ > > which hasn't been updated in quite a while and has received no feedback > > at all, not even from the listed reviewer Shaun Thomas). To make > > matters worse, Amit's patchset no longer applies. > > > > What I would like to do is add a link to this thread to CF's 1961 entry > > and then update all these patches, in order to get things moving. > > Hi Alvaro, > > Thanks for notifying about this. Will work this week on rebasing this > patchset and putting it into the 2019-11 commit fest. Rebased patch set attached. Added in the Nov commitfest : https://commitfest.postgresql.org/25/2283/ -- Thanks, -Amit Khandekar EnterpriseDB Corporation The Postgres Database Company logicaldecodng_standby_v1_rebased.tar.gz Description: GNU Zip compressed data
logical decoding : exceeded maxAllocatedDescs for .spill files
Hi, I reproduced the error "exceeded maxAllocatedDescs (492) while trying to open file ...", which was also discussed about in the thread [1]. This issue is similar but not exactly the same as [1]. In [1], the file for which this error used to show up was "pg_logical/mappings/map" , while here it's the .spill file. And here the issue , in short, seems to be : The .spill file does not get closed there and then, unlike in [1] where there was a file descriptor leak. I could reproduce it using a transaction containing a long series of sub-transactions (possibly could be reproduced without sub-transactions, but looking at the code I could come up with this script using sub-transactions easily) : create table tab(id int); -- Function that does huge changes in a single transaction create or replace function f(id int) returns int as $$ begin -- Iterate this more than 492 times (max transient file descriptors PG would typically allow) -- This will create that many sub-transactions due to presence of exception block. FOR i IN 1..600 LOOP BEGIN -- Iterate more than 4096 times (so that changes spill to disk: max_changes_in_memory) FOR j IN 1..5000 LOOP insert into tab values (1); END LOOP; EXCEPTION when division_by_zero then perform 'dummy'; END; END LOOP; return id; end $$ language plpgsql; SELECT * FROM pg_create_logical_replication_slot('logical', 'test_decoding'); begin; select f(1); -- Do huge changes in a single transaction commit; \! pg_recvlogical -d postgres --slot=logical --verbose --start -f - pg_recvlogical: starting log streaming at 0/0 (slot logical) pg_recvlogical: streaming initiated pg_recvlogical: confirming write up to 0/0, flush to 0/0 (slot logical) BEGIN 1869 pg_recvlogical: confirming write up to 0/1B6D6E38, flush to 0/1B6D6E38 (slot logical) pg_recvlogical: error: unexpected termination of replication stream: ERROR: exceeded maxAllocatedDescs (492) while trying to open file "pg_replslot/logical/xid-2362-lsn-0-2400.spill" pg_recvlogical: disconnected; waiting 5 seconds to try again Looking at the code, what might be happening is, ReorderBufferIterTXNInit()=>ReorderBufferRestoreChanges() opens the files, but leaves them open if end of file is not reached. Eventually if end of file is reached, it gets closed. The function returns back without closing the file descriptor if reorder buffer changes being restored are more than max_changes_in_memory. Probably later on, the rest of the changes get restored in another ReorderBufferRestoreChanges() call. But meanwhile, if there are a lot of such files opened, we can run out of the max files that PG decides to keep open (it has some logic that takes into account system files allowed to be open, and already opened). Offhand, what I am thinking is, we need to close the file descriptor before returning from ReorderBufferRestoreChanges(), and keep track of the file offset and file path, so that next time we can resume reading from there. Comments ? [1] https://www.postgresql.org/message-id/flat/738a590a-2ce5-9394-2bef-7b1caad89b37%402ndquadrant.com -- Thanks, -Amit Khandekar EnterpriseDB Corporation The Postgres Database Company
Re: Minimal logical decoding on standbys
On Tue, 3 Sep 2019 at 23:10, Alvaro Herrera wrote: > > On 2019-Jul-19, Amit Khandekar wrote: > > > Attached are the split patches. Included is an additional patch that > > has doc changes. Here is what I have added in the docs. Pasting it > > here so that all can easily spot how it is supposed to behave, and to > > confirm that we are all on the same page : > > ... Apparently, this patch was not added to the commitfest for some > reason; and another patch that *is* in the commitfest has been said to > depend on this one (Petr's https://commitfest.postgresql.org/24/1961/ > which hasn't been updated in quite a while and has received no feedback > at all, not even from the listed reviewer Shaun Thomas). To make > matters worse, Amit's patchset no longer applies. > > What I would like to do is add a link to this thread to CF's 1961 entry > and then update all these patches, in order to get things moving. Hi Alvaro, Thanks for notifying about this. Will work this week on rebasing this patchset and putting it into the 2019-11 commit fest.
Re: POC: Cleaning up orphaned files using undo logs
ssCompleted(uur->uur_txn->urec_progress)) { if (dbid_exists(uur->uur_txn->urec_dbid)) { (void) RegisterRollbackReq(InvalidUndoRecPtr, undo_recptr, uur->uur_txn->urec_dbid, uur->uur_fxid); pending_abort = true; } } - +UndoRecordRelease(uur); +uur = NULL; +} . . +Assert(uur == NULL); + +/* If we reach here, this means there is something to discard. */ +need_discard = true; +} while (true); Looks like it is neither necessary to set uur to NULL, nor is it necessary to have the Assert(uur == NULL). At the start of each iteration uur is anyway assigned a fresh value, which may or may not be NULL. - + * over undo logs is complete, new undo can is allowed to be written in the new undo can is allowed => new undo is allowed + * hash table size. So before start allowing any new transaction to write the before start allowing => before allowing any new transactions to start writing the - +/* Get the smallest of 'xid having pending undo' and 'oldestXmin' */ +oldestXidHavingUndo = RollbackHTGetOldestFullXid(oldestXidHavingUndo); + + +if (FullTransactionIdIsValid(oldestXidHavingUndo)) +pg_atomic_write_u64(>oldestFullXidHavingUnappliedUndo, +U64FromFullTransactionId(oldestXidHavingUndo)); Is it possible that the FullTransactionId returned by RollbackHTGetOldestFullXid() could be invalid ? If not, then the if condition above can be changed to an Assert(). - + * If the log is already discarded, then we are done. It is important + * to first check this to ensure that tablespace containing this log + * doesn't get dropped concurrently. + */ +LWLockAcquire(>mutex, LW_SHARED); +/* + * We don't have to worry about slot recycling and check the logno + * here, since we don't care about the identity of this slot, we're + * visiting all of them. I guess, it's accidental that the LWLockAcquire() call is *between* the two comments ? --- +if (UndoRecPtrGetCategory(undo_recptr) == UNDO_SHARED) +{ +/* + * For the "shared" category, we only discard when the + * rm_undo_status callback tells us we can. + */ +status = RmgrTable[uur->uur_rmid].rm_undo_status(uur, _xid); status variable could be declared in this block itself. - Some variable declaration alignments and comments spacing need changes as per pgindent. -- Thanks, -Amit Khandekar EnterpriseDB Corporation The Postgres Database Company
Re: POC: Cleaning up orphaned files using undo logs
On Tue, 23 Jul 2019 at 08:48, Amit Kapila wrote: > > On Mon, Jul 22, 2019 at 8:39 PM Amit Khandekar wrote: > > > > On Mon, 22 Jul 2019 at 14:21, Amit Kapila wrote: > > > > - > > > > +UndoWorkerGetSlotInfo(int slot, UndoRequestInfo *urinfo) > > +{ > > + /* Block concurrent access. */ > > + LWLockAcquire(UndoWorkerLock, LW_EXCLUSIVE); > > + > > + MyUndoWorker = >workers[slot]; > > Not sure why MyUndoWorker is used here. Can't we use a local variable > > ? Or do we intentionally attach to the slot as a side-operation ? > > > > - > > > > I think here, we can use a local variable as well. Do you see any > problem with the current code or do you think it is better to use a > local variable here? I think, even though there might not be a correctness issue with the current code as it stands, we should still use a local variable. Updating MyUndoWorker is a big side-effect, which the caller is not supposed to be aware of, because all that function should do is just get the slot info. > > > -- > > > > + if (!InsertRequestIntoErrorUndoQueue(urinfo)) > > I was thinking what happens if for some reason > > InsertRequestIntoErrorUndoQueue() itself errors out. In that case, the > > entry will not be marked invalid, and so there will be no undo action > > carried out because I think the undo worker will exit. What happens > > next with this entry ? > > The same entry is present in two queues xid and size, so next time it > will be executed from the second queue based on it's priority in that > queue. However, if it fails again a second time in the same way, then > we will be in trouble because now the hash table has entry, but none > of the queues has entry, so none of the workers will attempt to > execute again. Also, when discard worker again tries to register it, > we won't allow adding the entry to queue thinking either some backend > is executing the same or it must be part of some queue. > > The one possibility to deal with this could be that we somehow allow > discard worker to register it again in the queue or we can do this in > critical section so that it allows system restart on error. However, > the main thing is it possible that InsertRequestIntoErrorUndoQueue > will fail unless there is some bug in the code? If so, we might want > to have an Assert for this rather than handling this condition. Yes, I also think that the function would error out only because of can't-happen cases, like "too many locks taken" or "out of binary heap slots" or "out of memory" (this last one is not such a can't happen case). These cases happen probably due to some bugs, I suppose. But I was wondering : Generally when the code errors out with such can't-happen elog() calls, worst thing that happens is that the transaction gets aborted. Whereas, in this case, the worst thing that could happen is : the undo action would never get executed, which means selects for this tuple will keep on accessing the undo log ? This does not sound like any data consistency issue, so we should be fine after all ? Some further review comments for undoworker.c : +/* Sets the worker's lingering status. */ +static void +UndoWorkerIsLingering(bool sleep) The function name sounds like "is the worker lingering ?". Can we rename it to something like "UndoWorkerSetLingering" ? - + errmsg("undo worker slot %d is empty, cannot attach", + slot))); + } + + if (MyUndoWorker->proc) + { + LWLockRelease(UndoWorkerLock); + ereport(ERROR, + (errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE), + errmsg("undo worker slot %d is already used by " + "another worker, cannot attach", slot))); These two error messages can have a common error message "could not attach to worker slot", with errdetail separate for each of them : slot %d is empty. slot %d is already used by another worker. -- +static int +IsUndoWorkerAvailable(void) +{ + int i; + int alive_workers = 0; + + LWLockAcquire(UndoWorkerLock, LW_EXCLUSIVE); Should have bool return value. Also, why is it keeping track of number of alive workers ? Sounds like earlier it used to return number of alive workers ? If it indeed needs to just return true/false, we can do away with alive_workers. Also, *exclusive* lock is unnecessary. -- +if (UndoGetWork(false, false, , NULL) && +IsUndoWorkerAvailable()) +UndoWorkerLaunch(urinfo); There is no lock acquired between IsUndoWorkerAvailable() and UndoWorkerLaunch(); that means even though IsUndoWorkerAvailable() returns true, there is a small window where UndoWorkerLaunch() does not find any worker slot with in_use fal
Re: POC: Cleaning up orphaned files using undo logs
On Fri, 19 Jul 2019 at 17:24, Amit Khandekar wrote: > > On Thu, 9 May 2019 at 12:04, Dilip Kumar wrote: > > > Patches can be applied on top of undo branch [1] commit: > > (cb777466d008e656f03771cf16ec7ef9d6f2778b) > > > > [1] https://github.com/EnterpriseDB/zheap/tree/undo > > Below are some review points for 0009-undo-page-consistency-checker.patch : Another point that I missed : +* Process the undo record of the page and mask their cid filed. +*/ + while (next_record < page_end) + { + UndoRecordHeader *header = (UndoRecordHeader *) next_record; + + /* If this undo record has cid present, then mask it */ + if ((header->urec_info & UREC_INFO_CID) != 0) Here, even though next record starts in the current page, the urec_info itself may or may not lie on this page. I hope this possibility is also considered when populating the partial-record-specific details in the page header. -- Thanks, -Amit Khandekar EnterpriseDB Corporation The Postgres Database Company
Re: POC: Cleaning up orphaned files using undo logs
On Mon, 22 Jul 2019 at 14:21, Amit Kapila wrote: I have started review of 0014-Allow-execution-and-discard-of-undo-by-background-wo.patch. Below are some quick comments to start with: +++ b/src/backend/access/undo/undoworker.c +#include "access/xact.h" +#include "access/undorequest.h" Order is not alphabetical + * Each undo worker then start reading from one of the queue the requests for start=>starts queue=>queues - + rc = WaitLatch(MyLatch, +WL_LATCH_SET | WL_TIMEOUT | WL_POSTMASTER_DEATH, +10L, WAIT_EVENT_BGWORKER_STARTUP); + + /* emergency bailout if postmaster has died */ + if (rc & WL_POSTMASTER_DEATH) + proc_exit(1); I think now, thanks to commit cfdf4dc4fc9635a, you don't have to explicitly handle postmaster death; instead you can use WL_EXIT_ON_PM_DEATH. Please check at all such places where this is done in this patch. - +UndoWorkerGetSlotInfo(int slot, UndoRequestInfo *urinfo) +{ + /* Block concurrent access. */ + LWLockAcquire(UndoWorkerLock, LW_EXCLUSIVE); + + MyUndoWorker = >workers[slot]; Not sure why MyUndoWorker is used here. Can't we use a local variable ? Or do we intentionally attach to the slot as a side-operation ? - + * Get the dbid where the wroker should connect to and get the worker wroker=>worker - + BackgroundWorkerInitializeConnectionByOid(urinfo.dbid, 0, 0); 0, 0 => InvalidOid, 0 + * Set the undo worker request queue from which the undo worker start + * looking for a work. start => should start a work => work -- + if (!InsertRequestIntoErrorUndoQueue(urinfo)) I was thinking what happens if for some reason InsertRequestIntoErrorUndoQueue() itself errors out. In that case, the entry will not be marked invalid, and so there will be no undo action carried out because I think the undo worker will exit. What happens next with this entry ?
Re: POC: Cleaning up orphaned files using undo logs
On Thu, 9 May 2019 at 12:04, Dilip Kumar wrote: > Patches can be applied on top of undo branch [1] commit: > (cb777466d008e656f03771cf16ec7ef9d6f2778b) > > [1] https://github.com/EnterpriseDB/zheap/tree/undo Below are some review points for 0009-undo-page-consistency-checker.patch : + /* Calculate the size of the partial record. */ + partial_rec_size = UndoRecordHeaderSize(phdr->uur_info) + + phdr->tuple_len + phdr->payload_len - + phdr->record_offset; There is already an UndoPagePartialRecSize() function which calculates the size of partial record, which seems to do the same as above. If this is same, you can omit the above code, and instead down below where you increment next_record, you can do "next_record += UndoPagePartialRecSize()". Also, I see an extra sizeof(uint16) added in UndoPagePartialRecSize(). Not sure which one is correct, and which one wrong, unless I am wrong in assuming that the above calculation and the function definition do the same thing. -- + * We just want to mask the cid in the undo record header. So + * only if the partial record in the current page include the undo + * record header then we need to mask the cid bytes in this page. + * Otherwise, directly jump to the next record. Here, I think you mean : "So only if the partial record in the current page includes the *cid* bytes", rather than "includes the undo record header" May be we can say : We just want to mask the cid. So do the partial record masking only if the current page includes the cid bytes from the partial record header. + if (phdr->record_offset < (cid_offset + sizeof(CommandId))) + { +char*cid_data; +Size mask_size; + +mask_size = Min(cid_offset - phdr->record_offset, +sizeof(CommandId)); + +cid_data = next_record + cid_offset - phdr->record_offset; +memset(_data, MASK_MARKER, mask_size); + Here, if record_offset lies *between* cid start and cid end, then cid_offset - phdr->record_offset will be negative, and so will be mask_size. Probably abs() should do the work. Also, an Assert(cid_data + mask_size <= page_end) would be nice. I know cid position of a partial record cannot go beyond the page boundary, but it's better to have this Assert to do sanity check. + * Process the undo record of the page and mask their cid filed. filed => field -- Thanks, -Amit Khandekar EnterpriseDB Corporation The Postgres Database Company
Re: Minimal logical decoding on standbys
On Wed, 10 Jul 2019 at 17:12, Amit Khandekar wrote: > > On Wed, 10 Jul 2019 at 08:44, Andres Freund wrote: > > > > Hi, > > > > Thanks for the new version! Looks like we're making progress towards > > something committable here. > > > > I think it'd be good to split the patch into a few pieces. I'd maybe do > > that like: > > 1) WAL format changes (plus required other changes) > > 2) Recovery conflicts with slots > > 3) logical decoding on standby > > 4) tests > > All right. Will do that in the next patch set. For now, I have quickly > done the below changes in a single patch again (attached), in order to > get early comments if any. Attached are the split patches. Included is an additional patch that has doc changes. Here is what I have added in the docs. Pasting it here so that all can easily spot how it is supposed to behave, and to confirm that we are all on the same page : "A logical replication slot can also be created on a hot standby. To prevent VACUUM from removing required rows from the system catalogs, hot_standby_feedback should be set on the standby. In spite of that, if any required rows get removed on standby, the slot gets dropped. Existing logical slots on standby also get dropped if wal_level on primary is reduced to less than 'logical'. For a logical slot to be created, it builds a historic snapshot, for which information of all the currently running transactions is essential. On primary, this information is available, but on standby, this information has to be obtained from primary. So, slot creation may wait for some activity to happen on the primary. If the primary is idle, creating a logical slot on standby may take a noticeable time." logicaldecodng_standby.tar.gz Description: GNU Zip compressed data
Re: Minimal logical decoding on standbys
On Tue, 16 Jul 2019 at 22:56, Andres Freund wrote: > > Hi, > > On 2019-07-12 14:53:21 +0530, tushar wrote: > > On 07/10/2019 05:12 PM, Amit Khandekar wrote: > > > All right. Will do that in the next patch set. For now, I have quickly > > > done the below changes in a single patch again (attached), in order to > > > get early comments if any. > > Thanks Amit for your patch. i am able to see 1 issues on Standby server - > > (where logical replication slot created ) , > > a)size of pg_wal folder is NOT decreasing even after firing get_changes > > function > > Even after calling pg_logical_slot_get_changes() multiple times? What > does > SELECT * FROM pg_replication_slots; before and after multiple calls return? > > Does manually forcing a checkpoint with CHECKPOINT; first on the primary > and then the standby "fix" the issue? I independently tried to reproduce this issue on my machine yesterday. I observed that : sometimes, the files get cleaned up after two or more pg_logical_slot_get_changes(). Sometimes, I have to restart the server to see the pg_wal files cleaned up. This happens more or less the same even for logical slot on *primary*. Will investigate further with Tushar. > > > > b)pg_wal files are not recycling and every time it is creating new files > > after firing get_changes function > > I'm not sure what you mean by this. Are you saying that > pg_logical_slot_get_changes() causes WAL to be written? > > Greetings, > > Andres Freund -- Thanks, -Amit Khandekar EnterpriseDB Corporation The Postgres Database Company
Re: Minimal logical decoding on standbys
;will". > > > > > @@ -2879,6 +2882,25 @@ RecoveryConflictInterrupt(ProcSignalReason reason) > > case PROCSIG_RECOVERY_CONFLICT_LOCK: > > case PROCSIG_RECOVERY_CONFLICT_TABLESPACE: > > case PROCSIG_RECOVERY_CONFLICT_SNAPSHOT: > > + case PROCSIG_RECOVERY_CONFLICT_LOGICALSLOT: > > + /* > > + * For conflicts that require a logical slot > > to be dropped, the > > + * requirement is for the signal receiver to > > release the slot, > > + * so that it could be dropped by the signal > > sender. So for > > + * normal backends, the transaction should be > > aborted, just > > + * like for other recovery conflicts. But if > > it's walsender on > > + * standby, then it has to be killed so as to > > release an > > + * acquired logical slot. > > + */ > > + if (am_cascading_walsender && > > + reason == > > PROCSIG_RECOVERY_CONFLICT_LOGICALSLOT && > > + MyReplicationSlot && > > SlotIsLogical(MyReplicationSlot)) > > + { > > + RecoveryConflictPending = true; > > + QueryCancelPending = true; > > + InterruptPending = true; > > + break; > > + } > > Huh, I'm not following as to why that's needed for walsenders? For normal backends, we ignore this signal if we aren't in a transaction (block). But for walsender, there is no transaction, but we cannot ignore the signal. This is because walsender can keep a logical slot acquired when it was spawned by "pg_recvlogical --start". So we can't ignore the signal. So the only way that we can make it release the acquired slot is to kill it. > > > > @@ -1499,6 +1499,7 @@ pg_stat_get_db_conflict_all(PG_FUNCTION_ARGS) > > > > dbentry->n_conflict_tablespace + > > dbentry->n_conflict_lock + > > > > dbentry->n_conflict_snapshot + > > + > > dbentry->n_conflict_logicalslot + > > > > dbentry->n_conflict_bufferpin + > > > > dbentry->n_conflict_startup_deadlock); > > I think this probably needs adjustments in a few more places, > e.g. monitoring.sgml... Oops, yeah, to search for similar additions, I had looked for "conflict_snapshot" using cscope. I should have done the same using "git grep". Done now. -- Thanks, -Amit Khandekar EnterpriseDB Corporation The Postgres Database Company logical-decoding-on-standby_v12.patch Description: Binary data
Re: Minimal logical decoding on standbys
On Thu, 4 Jul 2019 at 17:21, Amit Khandekar wrote: > > On Thu, 4 Jul 2019 at 15:52, tushar wrote: > > > > On 07/01/2019 11:04 AM, Amit Khandekar wrote: > > > > Also, in the updated patch (v11), I have added some scenarios that > > verify that slot is dropped when either master wal_level is > > insufficient, or when slot is conflicting. Also organized the test > > file a bit. > > > > One scenario where replication slot removed even after fixing the problem > > (which Error message suggested to do) > > Which specific problem are you referring to ? Removing a conflicting > slot, itself is the part of the fix for the conflicting slot problem. > > > > > Please refer this below scenario > > > > Master cluster- > > postgresql,conf file > > wal_level=logical > > hot_standby_feedback = on > > port=5432 > > > > Standby cluster- > > postgresql,conf file > > wal_level=logical > > hot_standby_feedback = on > > port=5433 > > > > both Master/Slave cluster are up and running and are in SYNC with each other > > Create a logical replication slot on SLAVE ( SELECT * from > > pg_create_logical_replication_slot('m', 'test_decoding'); ) > > > > change wal_level='hot_standby' on Master postgresql.conf file / restart the > > server > > Run get_changes function on Standby - > > postgres=# select * from pg_logical_slot_get_changes('m',null,null); > > ERROR: logical decoding on standby requires wal_level >= logical on master > > > > Correct it on Master postgresql.conf file ,i.e set wal_level='logical' > > again / restart the server > > and again fire get_changes function on Standby - > > postgres=# select * from pg_logical_slot_get_changes('m',null,null); > > ERROR: replication slot "m" does not exist > > > > This looks little weird as slot got dropped/removed internally . i guess it > > should get invalid rather than removed automatically. > > Lets user's delete the slot themself rather than automatically removed as > > a surprise. > > It was earlier discussed about what action should be taken when we > find conflicting slots. Out of the options, one was to drop the slot, > and we chose that because that was simple. See this : > https://www.postgresql.org/message-id/flat/20181212204154.nsxf3gzqv3gesl32%40alap3.anarazel.de Sorry, the above link is not the one I wanted to refer to. Correct one is this : https://www.postgresql.org/message-id/20181214005521.jaty2d24lz4nroil%40alap3.anarazel.de -- Thanks, -Amit Khandekar EnterpriseDB Corporation The Postgres Database Company
Re: Minimal logical decoding on standbys
On Thu, 4 Jul 2019 at 15:52, tushar wrote: > > On 07/01/2019 11:04 AM, Amit Khandekar wrote: > > Also, in the updated patch (v11), I have added some scenarios that > verify that slot is dropped when either master wal_level is > insufficient, or when slot is conflicting. Also organized the test > file a bit. > > One scenario where replication slot removed even after fixing the problem > (which Error message suggested to do) Which specific problem are you referring to ? Removing a conflicting slot, itself is the part of the fix for the conflicting slot problem. > > Please refer this below scenario > > Master cluster- > postgresql,conf file > wal_level=logical > hot_standby_feedback = on > port=5432 > > Standby cluster- > postgresql,conf file > wal_level=logical > hot_standby_feedback = on > port=5433 > > both Master/Slave cluster are up and running and are in SYNC with each other > Create a logical replication slot on SLAVE ( SELECT * from > pg_create_logical_replication_slot('m', 'test_decoding'); ) > > change wal_level='hot_standby' on Master postgresql.conf file / restart the > server > Run get_changes function on Standby - > postgres=# select * from pg_logical_slot_get_changes('m',null,null); > ERROR: logical decoding on standby requires wal_level >= logical on master > > Correct it on Master postgresql.conf file ,i.e set wal_level='logical' > again / restart the server > and again fire get_changes function on Standby - > postgres=# select * from pg_logical_slot_get_changes('m',null,null); > ERROR: replication slot "m" does not exist > > This looks little weird as slot got dropped/removed internally . i guess it > should get invalid rather than removed automatically. > Lets user's delete the slot themself rather than automatically removed as a > surprise. It was earlier discussed about what action should be taken when we find conflicting slots. Out of the options, one was to drop the slot, and we chose that because that was simple. See this : https://www.postgresql.org/message-id/flat/20181212204154.nsxf3gzqv3gesl32%40alap3.anarazel.de By the way, you are getting the "logical decoding on standby requires wal_level >= logical on master" error while using the slot, which is because we reject the command even before checking the existence of the slot. Actually the slot is already dropped due to master wal_level. Then when you correct the master wal_level, the command is not rejected, and proceeds to use the slot and then it is found that the slot does not exist. -- Thanks, -Amit Khandekar EnterpriseDB Corporation The Postgres Database Company