Re: [dynahash] do not refill the hashkey after hash_search
On Wed, Oct 25, 2023 at 12:21 PM Tom Lane wrote: > > John Naylor writes: > > I'd prefer just adding "Assert(hentry->event == oldn);" and declaring > > hentry PG_USED_FOR_ASSERTS_ONLY. > > I'm not aware of any other places where we have Asserts checking > that hash_search() honored its contract. Why do we need one here? [removing old CC] The author pointed out here that we're not consistent in this regard: https://www.postgresql.org/message-id/CAEG8a3KEO_Kdt2Y5hFNWMEX3DpCXi9jtZOJY-GFUEE9QLgF%2Bbw%40mail.gmail.com ...but I didn't try seeing where the balance lay. We can certainly just remove redundant assignments.
Re: Guiding principle for dropping LLVM versions?
Here are some systematic rules I'd like to propose to anchor this stuff to reality and avoid future doubt and litigation: 1. Build farm animals testing LLVM determine the set of OSes and LLVM versions we consider. 2. We exclude OSes that will be out of full vendor support when a release ships. 3. We exclude OSes that don't bless an LLVM release (eg macOS running an arbitrarily picked version), and animals running only to cover ancient LLVM compiled from source for coverage (Andres's sid menagerie). By these rules we can't require LLVM 14 for another year, because Ubuntu and Amazon Linux are standing in the way*: animal | arch | llvm_version | os | os_release | end_of_support ---+-+--+++ branta| s390x | 10.0.0 | Ubuntu | 20.04 | 2025-04-01 splitfin | aarch64 | 10.0.0 | Ubuntu | 20.04 | 2025-04-01 urutau| s390x | 10.0.0 | Ubuntu | 20.04 | 2025-04-01 massasauga| aarch64 | 11.1.0 | Amazon | 2 | 2025-06-30 snakefly | aarch64 | 11.1.0 | Amazon | 2 | 2025-06-30 sarus | s390x | 14.0.0 | Ubuntu | 22.04 | 2027-06-01 shiner| aarch64 | 14.0.0 | Ubuntu | 22.04 | 2027-06-01 turbot| aarch64 | 14.0.0 | Ubuntu | 22.04 | 2027-06-01 lora | s390x | 15.0.7 | RHEL | 9 | 2027-05-31 mamushi | s390x | 15.0.7 | RHEL | 9 | 2027-05-31 nicator | ppc64le | 15.0.7 | Alma | 9 | 2027-05-31 oystercatcher | aarch64 | 15.0.7 | Alma | 9 | 2027-05-31 Ideally more distros would be present in this vacuum-horizon decision table, but I don't think it'd change the conclusion: 10 is the trailing edge. Therefore the attached patch scales back its ambition to that release. Tested on LLVM 10-18. If I pushed this we'd need to disable or upgrade the following to avoid failure in configure on master: animal|arch| llvm_version | os | os_release | end_of_support -++--+++ dragonet| x86_64 | 3.9.1| Debian | sid| phycodurus | x86_64 | 3.9.1| Debian | sid| desmoxytes | x86_64 | 4.0.1| Debian | sid| petalura| x86_64 | 4.0.1| Debian | sid| mantid | x86_64 | 5.0.1| CentOS | 7 | 2019-08-06 idiacanthus | x86_64 | 5.0.2| Debian | sid| pogona | x86_64 | 5.0.2| Debian | sid| cotinga | s390x | 6.0.0| Ubuntu | 18.04 | 2023-06-01 vimba | aarch64| 6.0.0| Ubuntu | 18.04 | 2023-06-01 komodoensis | x86_64 | 6.0.1| Debian | sid| topminnow | mips64el; -mabi=32 | 6.0.1| Debian | 8 | 2018-06-17 xenodermus | x86_64 | 6.0.1| Debian | sid| alimoche| aarch64| 7.0.1| Debian | 10 | 2022-09-10 blackneck | aarch64| 7.0.1| Debian | 10 | 2022-09-10 bonito | ppc64le| 7.0.1| Fedora | 29 | 2019-11-26 *Some distros announce EOL date by month without saying which day, so in my data collecting operation I just punched in the first of the month, *shrug* From 581b88879316065dab113d1512aeac7e9932b0af Mon Sep 17 00:00:00 2001 From: Thomas Munro Date: Thu, 19 Oct 2023 04:45:46 +1300 Subject: [PATCH v4] jit: Require at least LLVM 10. Remove support for older LLVM versions. The default on common software distributions will be at least LLVM 10 when PostgreSQL 17 ships. Discussion: https://postgr.es/m/CA%2BhUKGLhNs5geZaVNj2EJ79Dx9W8fyWUU3HxcpZy55sMGcY%3DiA%40mail.gmail.com --- config/llvm.m4 | 10 ++-- configure | 43 ++--- doc/src/sgml/installation.sgml | 4 +- meson.build | 2 +- src/backend/jit/llvm/llvmjit.c | 53 ++--- src/backend/jit/llvm/llvmjit_error.cpp | 10 src/backend/jit/llvm/llvmjit_expr.c | 6 +-- src/backend/jit/llvm/llvmjit_inline.cpp | 38 +-- src/backend/jit/llvm/llvmjit_wrap.cpp | 61 - src/include/jit/llvmjit.h | 17 --- src/include/pg_config.h.in | 12 - src/tools/msvc/Solution.pm | 3 -- 12 files changed, 17 insertions(+), 242 deletions(-) diff --git a/config/llvm.m4 b/config/llvm.m4 index 21d8cd4f90..44769d819a 100644 --- a/config/llvm.m4 +++ b/config/llvm.m4 @@ -13,7 +13,7 @@ AC_DEFUN([PGAC_LLVM_SUPPORT], AC_REQUIRE([AC_PROG_AWK]) AC_ARG_VAR(LLVM_CONFIG, [path to llvm-config command]) - PGAC_PATH_PROGS(LLVM_CONFIG, llvm-config llvm-config-7 llvm-config-6.0 llvm-config-5.0
Re: [dynahash] do not refill the hashkey after hash_search
John Naylor writes: > I'd prefer just adding "Assert(hentry->event == oldn);" and declaring > hentry PG_USED_FOR_ASSERTS_ONLY. I'm not aware of any other places where we have Asserts checking that hash_search() honored its contract. Why do we need one here? regards, tom lane
Re: [dynahash] do not refill the hashkey after hash_search
On Fri, Oct 13, 2023 at 10:07 AM Nathan Bossart wrote: > > On Thu, Sep 14, 2023 at 04:28:26PM +0800, Junwang Zhao wrote: > > Add a v2 with some change to fix warnings about unused-parameter. > > > > I will add this to Commit Fest. > > This looks reasonable to me. I've marked the commitfest entry as > ready-for-committer. I will plan on committing it in a couple of days > unless John has additional feedback or would like to do the honors. (I've been offline for a few weeks, and have been catching up this week.) I agree it's reasonable, but there are a couple small loose ends I'd like to see addressed. - strlcpy(hentry->name, name, sizeof(hentry->name)); This might do with a comment stating we already set the value, (we've seen in this thread that some other code does this), but I don't feel strongly about it. do { - hash_search(uncommitted_enums, serialized++, HASH_ENTER, NULL); + (void) hash_search(uncommitted_enums, serialized++, HASH_ENTER, NULL); } while (OidIsValid(*serialized)); I still consider this an unrelated and unnecessary cosmetic change. - NotificationHash *hentry; bool found; - hentry = (NotificationHash *) hash_search(pendingNotifies->hashtab, - , - HASH_ENTER, - ); + (void) hash_search(pendingNotifies->hashtab, +, +HASH_ENTER, +); Assert(!found); - hentry->event = oldn; I'd prefer just adding "Assert(hentry->event == oldn);" and declaring hentry PG_USED_FOR_ASSERTS_ONLY.
Re: SLRU optimization - configurable buffer pool and partitioning the SLRU lock
On Tue, Oct 24, 2023 at 9:34 PM Alvaro Herrera wrote: > > On 2023-Oct-11, Dilip Kumar wrote: > > > In my last email, I forgot to give the link from where I have taken > > the base path for dividing the buffer pool in banks so giving the same > > here[1]. And looking at this again it seems that the idea of that > > patch was from Andrey M. Borodin and the idea of the SLRU scale factor > > were introduced by Yura Sokolov and Ivan Lazarev. Apologies for > > missing that in the first email. > > You mean [1]. > [1] https://postgr.es/m/452d01f7e331458f56ad79bef537c31b%40postgrespro.ru > I don't like this idea very much, because of the magic numbers that act > as ratios for numbers of buffers on each SLRU compared to other SLRUs. > These values, which I took from the documentation part of the patch, > appear to have been selected by throwing darts at the wall: > > NUM_CLOG_BUFFERS= Min(128 << slru_buffers_size_scale, > shared_buffers/256) > NUM_COMMIT_TS_BUFFERS = Min(128 << slru_buffers_size_scale, > shared_buffers/256) > NUM_SUBTRANS_BUFFERS= Min(64 << slru_buffers_size_scale, > shared_buffers/256) > NUM_NOTIFY_BUFFERS = Min(32 << slru_buffers_size_scale, > shared_buffers/256) > NUM_SERIAL_BUFFERS = Min(32 << slru_buffers_size_scale, > shared_buffers/256) > NUM_MULTIXACTOFFSET_BUFFERS = Min(32 << slru_buffers_size_scale, > shared_buffers/256) > NUM_MULTIXACTMEMBER_BUFFERS = Min(64 << slru_buffers_size_scale, > shared_buffers/256) > > ... which look pretty random already, if similar enough to the current > hardcoded values. In reality, the code implements different values than > what the documentation says. > > I don't see why would CLOG have the same number as COMMIT_TS, when the > size for elements of the latter is like 32 times bigger -- however, the > frequency of reads for COMMIT_TS is like 1000x smaller than for CLOG. > SUBTRANS is half of CLOG, yet it is 16 times larger, and it covers the > same range. The MULTIXACT ones appear to keep the current ratio among > them (8/16 gets changed to 32/64). > > ... and this whole mess is scaled exponentially without regard to the > size that each SLRU requires. This is just betting that enough memory > can be wasted across all SLRUs up to the point where the one that is > actually contended has sufficient memory. This doesn't sound sensible > to me. > > Like everybody else, I like having less GUCs to configure, but going > this far to avoid them looks rather disastrous to me. IMO we should > just use Munro's older patches that gave one GUC per SLRU, and users > only need to increase the one that shows up in pg_wait_event sampling. > Someday we will get the (much more complicated) patches to move these > buffers to steal memory from shared buffers, and that'll hopefully let > use get rid of all this complexity. Overall I agree with your comments, actually, I haven't put that much thought into the GUC part and how it scales the SLRU buffers w.r.t. this single configurable parameter. Yeah, so I think it is better that we take the older patch version as our base patch where we have separate GUC per SLRU. > I'm inclined to use Borodin's patch last posted here [2] instead of your > proposed 0001. > [2] https://postgr.es/m/93236d36-b91c-4dfa-af03-99c083840...@yandex-team.ru I will rebase my patches on top of this. > I did skim patches 0002 and 0003 without going into too much detail; > they look reasonable ideas. I have not tried to reproduce the claimed > performance benefits. I think measuring this patch set with the tests > posted by Shawn Debnath in [3] is important, too. > [3] https://postgr.es/m/yemddpmrsojfq...@f01898859afd.ant.amazon.com Thanks for taking a look. > > On the other hand, here's a somewhat crazy idea. What if, instead of > stealing buffers from shared_buffers (which causes a lot of complexity), Currently, we do not steal buffers from shared_buffers, computation is dependent upon Nbuffers though. I mean for each SLRU we are computing separate memory which is additional than the shared_buffers no? > we allocate a common pool for all SLRUs to use? We provide a single > knob -- say, non_relational_buffers=32MB as default -- and we use a LRU > algorithm (or something) to distribute that memory across all the SLRUs. > So the ratio to use for this SLRU or that one would depend on the nature > of the workload: maybe more for multixact in this server here, but more > for subtrans in that server there; it's just the total amount that the > user would have to configure, side by side with shared_buffers (and > perhaps scale with it like wal_buffers), and the LRU would handle the > rest. The "only" problem here is finding a distribution algorithm that > doesn't further degrade performance, of course ... Yeah, this could be an idea, but are you talking about that all the SLRUs will share the single buffer pool and based on the LRU algorithm it will be decided which page will stay in the
Re: Synchronizing slots from primary to standby
On Tue, Oct 24, 2023 at 3:35 PM Drouvot, Bertrand wrote: > > On 10/24/23 7:44 AM, Ajin Cherian wrote: > > On Mon, Oct 23, 2023 at 11:22 PM Drouvot, Bertrand > > wrote: > > 2) When we create a subscription, another slot is created during the > subscription synchronization, namely > like "pg_16397_sync_16388_7293447291374081805" (coming from > ReplicationSlotNameForTablesync()). > > This extra slot appears to have failover also set to true. > > So, If the standby refresh the list of slot to sync when the subscription is > still synchronizing we'd see things like > on the standby: > > LOG: waiting for remote slot "mysub" LSN (0/C0034808) and catalog xmin (756) > to pass local slot LSN (0/C0034840) and and catalog xmin (756) > LOG: wait over for remote slot "mysub" as its LSN (0/C00368B0) and catalog > xmin (756) has now passed local slot LSN (0/C0034840) and catalog xmin (756) > LOG: waiting for remote slot "pg_16397_sync_16388_7293447291374081805" LSN > (0/C0034808) and catalog xmin (756) to pass local slot LSN (0/C00368E8) and > and catalog xmin (756) > WARNING: slot "pg_16397_sync_16388_7293447291374081805" disappeared from the > primary, aborting slot creation > > I'm not sure this "pg_16397_sync_16388_7293447291374081805" should have > failover set to true. If there is a failover > during the subscription creation, better to re-launch the subscription > instead? > But note that the subscription doesn't wait for the completion of tablesync. So, how will we deal with that? Also, this situation is the same for non-tablesync slots as well. I have given another option in the email [1] which is to enable failover even for the main slot after all tables are in ready state, something similar to what we do for two_phase. [1] - https://www.postgresql.org/message-id/CAA4eK1J6BqO5%3DueFAQO%2BaYyHLaU-oCHrrVFJqHS-i0Ce9aPY2w%40mail.gmail.com -- With Regards, Amit Kapila.
Re: Adding facility for injection points (or probe points?) for more advanced tests
On Wed, Oct 25, 2023 at 10:06:17AM +0530, Amul Sul wrote: > +1 for the feature. > > TWIMW, here[1] is an interesting talk from pgconf.in 2020 on the similar > topic. > > 1] https://pgconf.in/conferences/pgconfin2020/program/proposals/101 Right, this uses a shared hash table. There is a patch from 2019 that summarizes this presentation as well: https://www.postgresql.org/message-id/CANXE4TdxdESX1jKw48xet-5GvBFVSq%3D4cgNeioTQff372KO45A%40mail.gmail.com A different idea is that this patch could leverage a bgworker instead of having a footprint in the postmaster. FWIW, I think that my patch is more flexible than the modes added by faultinjector.h (see 0001), because the actions that can be taken should not be limited by the core code: the point registered could just use what it wants as callback, so an extension could register a custom thing as well. -- Michael signature.asc Description: PGP signature
Re: Adding facility for injection points (or probe points?) for more advanced tests
On Wed, Oct 25, 2023 at 9:43 AM Michael Paquier wrote: > Hi all, > > I don't remember how many times in the last few years when I've had to > hack the backend to produce a test case that involves a weird race > condition across multiple processes running in the backend, to be able > to prove a point or just test a fix (one recent case: 2b8e5273e949). > Usually, I come to hardcoding stuff for the following situations: > - Trigger a PANIC, to force recovery. > - A FATAL, to take down a session, or just an ERROR. > - palloc() failure injection. > - Sleep to slow down a code path. > - Pause and release with condition variable. +1 for the feature. TWIMW, here[1] is an interesting talk from pgconf.in 2020 on the similar topic. 1] https://pgconf.in/conferences/pgconfin2020/program/proposals/101 Regards, Amul Sul
Adding facility for injection points (or probe points?) for more advanced tests
Hi all, I don't remember how many times in the last few years when I've had to hack the backend to produce a test case that involves a weird race condition across multiple processes running in the backend, to be able to prove a point or just test a fix (one recent case: 2b8e5273e949). Usually, I come to hardcoding stuff for the following situations: - Trigger a PANIC, to force recovery. - A FATAL, to take down a session, or just an ERROR. - palloc() failure injection. - Sleep to slow down a code path. - Pause and release with condition variable. And, while that's helpful to prove a point on a thread, nothing comes out of it in terms of regression test coverage in the tree because these tests are usually too slow and expensive, as they usually rely on hardcoded timeouts. So that's pretty much attempting to emulate what one would do with a debugger in a predictable way, without the manual steps because human hands don't scale well. The reason behind that is of course more advanced testing, to be able to expand coverage when we have weird and non-deterministic race issues to deal with, and the code becoming more complex every year makes that even harder. Fault and failure injection in specific paths comes into mind, additionally, particularly if you manage complex projects based on Postgres. So, please find attached a patch set that introduces an in-core facility to be able to set what I'm calling here an "injection point", that consists of being able to register in shared memory a callback that can be run within a defined location of the code. It means that it is not possible to trigger a callback before shared memory is set, but I've faced far more the case where I wanted to trigger something after shmem is set anyway. Persisting an injection point across restarts is also possible by adding some through an extension's shmem hook, as we do for custom LWLocks for example, as long as a library is loaded. This will remind a bit of what Alexander Korotkov has proposed here: https://www.postgresql.org/message-id/CAPpHfdtSEOHX8dSk9Qp%2BZ%2B%2Bi4BGQoffKip6JDWngEA%2Bg7Z-XmQ%40mail.gmail.com Also, this is much closee to what Craig Ringer is mentioning here, where it is named probe points, but I am using a minimal design that allows to achieve the same: https://www.postgresql.org/message-id/CAPpHfdsn-hzneYNbX4qcY5rnwr-BA1ogOCZ4TQCKQAw9qa48kA%40mail.gmail.com A difference is that I don't really see a point in passing to the callback triggered an area of data coming from the hash table itself, as at the end a callback could just refer to an area in shared memory or a static set of variables depending on what it wants, with one or more injection points (say a location to set a state, and a second to check it). So, at the end, the problem comes down in my opinion to two things: - Possibility to trigger a condition defined by some custom code, in the backend (core code or even out-of-core). - Possibility to define a location in the code where a named point would be checked. 0001 introduces three APIs to create, run, and drop injection points: +extern void InjectionPointCreate(const char *name, +InjectionPointCallback callback); +extern void InjectionPointRun(const char *name); +extern void InjectionPointDrop(const char *name); Then one just needs to add a macro like that to trigger the callback registered in the code to test: INJECTION_POINT_RUN("String"); So the footprint in the core tree is not zero, but it is as minimal as it can be. I have added some documentation to explain that, as well. I am not wedded to the name proposed in the patch, so if you feel there is better, feel free to propose ideas. This facility is hidden behind a specific configure/Meson switch, making it a no-op by default: --enable-injection-points -Dinjection_points={ true | false } 0002 is a test module to test these routines, that I have kept a maximum simple to ease review of the basics proposed here. This could be extended further to propose more default modes with TAP tests on its own, as I don't see a real point in having the SQL bits or some common callbacks (like for the PANIC or the FATAL cases) in core. Thoughts and comments are welcome. -- Michael From 4f7a0627eb0735874b53c90576243f8003a64b4a Mon Sep 17 00:00:00 2001 From: Michael Paquier Date: Wed, 25 Oct 2023 10:24:28 +0900 Subject: [PATCH 1/2] Base facility for injection points Extension and backend developers can register custom callbacks that would be run on a user-defined fashion. This is hidden behind a ./configure and a meson switch, disabled by default. --- src/include/pg_config.h.in| 3 + src/include/utils/injection_point.h | 35 src/backend/storage/ipc/ipci.c| 3 + src/backend/storage/lmgr/lwlocknames.txt | 1 + .../utils/activity/wait_event_names.txt | 1 + src/backend/utils/misc/Makefile | 1 + src/backend/utils/misc/injection_point.c | 158
Re: [PATCH] Clarify the behavior of the system when approaching XID wraparound
On Tue, Oct 17, 2023 at 9:39 PM Robert Haas wrote: > > Cool. Seems we are all in agreement, so committed these. Thanks! Thank you for getting this across the finish line!
Re: RFC: Logging plan of the running query
On Wed, Oct 18, 2023 at 10:04 PM James Coleman wrote: > > While that decision as regards auto_explain has long since been made > (and there would be work to undo it), I don't believe that we should > repeat that choice here. I'm -10 on moving this into auto_explain. > Right. > However perhaps there is still an opportunity for moving some of the > auto_explain code into core so as to enable deduplicating the code. > Right. That's what I also think. -- Best Wishes, Ashutosh Bapat
RE: CRC32C Parallel Computation Optimization on ARM
Thanks for your suggestion, this is the modified patch and two test files. -Original Message- From: Michael Paquier Sent: Friday, October 20, 2023 4:19 PM To: Xiang Gao Cc: pgsql-hackers@lists.postgresql.org Subject: Re: CRC32C Parallel Computation Optimization on ARM On Fri, Oct 20, 2023 at 07:08:58AM +, Xiang Gao wrote: > This patch uses a parallel computing optimization algorithm to improve > crc32c computing performance on ARM. The algorithm comes from Intel > whitepaper: > crc-iscsi-polynomial-crc32-instruction-paper. Input data is divided > into three equal-sized blocks.Three parallel blocks (crc0, crc1, > crc2) for 1024 Bytes.One Block: 42(BLK_LENGTH) * 8(step length: > crc32c_u64) bytes > > Crc32c unitest: > https://gist.github.com/gaoxyt/138fd53ca1eead8102eeb9204067f7e4 > Crc32c benchmark: > https://gist.github.com/gaoxyt/4506c10fc06b3501445e32c4257113e9 > It gets ~2x speedup compared to linear Arm crc32c instructions. Interesting. Could you attached to this thread the test files you used and the results obtained please? If this data gets deleted from github, then it would not be possible to refer back to what you did at the related benchmark results. Note that your patch is forgetting about meson; it just patches ./configure. -- Michael IMPORTANT NOTICE: The contents of this email and any attachments are confidential and may also be privileged. If you are not the intended recipient, please notify the sender immediately and do not disclose the contents to any other person, use it for any purpose, or store or copy the information in any medium. Thank you. 0002-crc32c-parallel-computation-optimization-on-arm.patch Description: 0002-crc32c-parallel-computation-optimization-on-arm.patch /* * compile postgres first with different crc32c implementation(use arm vmull_p64 or not) * we should comment out some codes about elog in pg_crc32c_armv8_choose.c to compile correctly and simply. * $ gcc -I ../postgres/_install/include -I ../postgres/_install/include/server main.c \ * -L ../postgres/build/src/port -l pgport_srv -O2 -o main * this test was run on Neoverse-N1 * $ ./main.no_vmull * data size is 512 bytes, and compute crc cost 139 us totally, 0.135742 us per loop * data size is 4096 bytes, and compute crc cost 1061 us totally, 1.036133 us per loop * $ ./main.use_vmull * data size is 512 bytes, and compute crc cost 101 us totally, 0.098633 us per loop * data size is 4096 bytes, and compute crc cost 540 us totally, 0.527344 us per loop * We can see that the cost of computing crc32c without vmull_p64 is about two times than * the cost that using vmull_p64 when data size is large. and the cost is almost same when * data size is small. */ #include #include #include #include #include #include "c.h" #include "port/pg_crc32c.h" uint64_t GetTickCount() { struct timeval tv; gettimeofday(, NULL); return tv.tv_sec * 100 + tv.tv_usec; } int main() { #define CASE_CNT 2 uint32_ttest_size[CASE_CNT] = {512, 1024 * 4}; for (int case_cnt = 0; case_cnt < CASE_CNT; case_cnt++) { uint8_t*buf = (uint8_t *) malloc(test_size[case_cnt] * sizeof(uint8_t)); srand(0); for (int i = 0; i < test_size[case_cnt]; i++) { *(buf + i) = (uint8_t) (rand() % 256u); } static const uint32_t kLoop = 1024; uint32_tcrc = 0; uint64_tstart = GetTickCount(); INIT_CRC32C(crc); for (int i = 0; i < kLoop; i++) { COMP_CRC32C(crc, buf, test_size[case_cnt]); } FIN_CRC32C(crc); uint64_tstop = GetTickCount(); printf("data size is %d bytes, and compute crc cost %ld us totally, %f us per loop\n", test_size[case_cnt], stop - start, (double) (stop - start) / kLoop); free(buf); } #undef CASE_CNT return 0; } /*** * We use libcheck(https://github.com/libcheck/check) as unit testing framework. * compile postgres first with different crc32c implementation(use arm crc32c * and vmull intrisics or not). we should comment out some codes about elog in * pg_crc32c_armv8_choose.c to compile correctly and simply. * $ gcc -I ../postgres/_install/include -I ../postgres/_install/include/server \ crc32c_unittest.c -L ../postgres/build/src/port -l pgport_srv -L /usr/local/lib \ -lcheck -o crc32c_unittest * this test was run on Neoverse-N1 * $ ./crc32c_unittest * Running suite(s): CRC32C * 100%: Checks: 3, Failures: 0, Errors: 0
Re: post-recovery amcheck expectations
On Tue, Oct 24, 2023 at 07:03:34PM -0700, Peter Geoghegan wrote: > On Mon, Oct 23, 2023 at 7:28 PM Noah Misch wrote: > > > That makes sense to me. I believe that it's not possible to have a > > > string of consecutive sibling pages that are all half-dead (regardless > > > of the BlockNumber order of sibling pages, even). But I'd probably > > > have written the fix in roughly the same way. Although...maybe you > > > should try to detect a string of half-dead pages? Hard to say if it's > > > worth the trouble. > > > > I imagined a string of half-dead siblings could arise in structure like > > this: > > > > * 1 > > * /|\ > > * 4 <-> 2 <-> 3 > > > > With events like this: > > > > - DELETE renders blk 4 deletable. > > - Crash with concurrent VACUUM, leaving 4 half-dead after having visited > > 1-4. > > - DELETE renders blk 2 deletable. > > - Crash with concurrent VACUUM, leaving 2 half-dead after having visited > > 1-2. > > > > I didn't try to reproduce that, and something may well prevent it. > > FWIW a couple of factors prevent it (in the absence of corruption). These are: > > 1. Only VACUUM can delete pages, and in general the only possible > source of half-dead pages is an unfortunately timed crash/error within > VACUUM. Each interrupted VACUUM can leave behind at most one half-dead > page. Agreed. > 2. One thing that makes VACUUM back out of deleting an empty page is > the presence of a half-dead right sibling leaf page left behind by > some VACUUM that was interrupted at some point in the past -- see > _bt_rightsib_halfdeadflag() for details. > > Obviously, factors 1 and 2 together make three consecutive half-dead > sibling pages impossible. Can't it still happen if the sequence of unfortunately timed crashes causes deletions from left to right? Take this example, expanding the one above. Half-kill 4, crash, half-kill 3, crash, half-kill 2 in: * 1 * // | \\ * 4 <-> 3 <-> 2 <-> 1 (That's not to say it has ever happened outside of a test.)
Re: Synchronizing slots from primary to standby
On Tue, Oct 24, 2023 at 11:54 AM Drouvot, Bertrand wrote: > > Hi, > > On 10/23/23 2:56 PM, shveta malik wrote: > > On Mon, Oct 23, 2023 at 5:52 PM Drouvot, Bertrand > > wrote: > > >> We are waiting for DEFAULT_NAPTIME_PER_CYCLE (3 minutes) before checking > >> if there > >> is new synced slot(s) to be created on the standby. Do we want to keep > >> this behavior > >> for V1? > >> > > > > I think for the slotsync workers case, we should reduce the naptime in > > the launcher to say 30sec and retain the default one of 3mins for > > subscription apply workers. Thoughts? > > > > Another option could be to keep DEFAULT_NAPTIME_PER_CYCLE and create a new > API on the standby that would refresh the list of sync slot at wish, thoughts? > Do you mean API to refresh list of DBIDs rather than sync-slots? As per current design, launcher gets DBID lists for all the failover slots from the primary at intervals of DEFAULT_NAPTIME_PER_CYCLE. These dbids are then distributed among max slot-sync workers and then they fetch slots for the concerned DBIDs at regular intervals of 10ms (WORKER_DEFAULT_NAPTIME_MS) and create/update those locally. thanks Shveta
Re: Add new for_each macros for iterating over a List that do not require ListCell pointer
On Wed, 25 Oct 2023 at 06:00, Jelte Fennema wrote: > Attached is a fixed version. With foreach(), we commonly do "if (lc == NULL)" at the end of loops as a way of checking if we did "break" to terminate the loop early. Doing the equivalent with the new macros won't be safe as the list element's value we broke on may be set to NULL. I think it might be a good idea to document the fact that this wouldn't be safe with the new macros, or better yet, document the correct way to determine if we broke out the loop early. I imagine someone will want to do some conversion work at some future date and it would be good if we could avoid introducing bugs during that process. I wonder if we should even bother setting the variable to NULL at the end of the loop. It feels like someone might just end up mistakenly checking for NULLs even if we document that it's not safe. If we left the variable pointing to the last list element then the user of the macro is more likely to notice their broken code. It'd also save a bit of instruction space. David
Re: Row pattern recognition
> Great. I will look into this. I am impressed the simple NFA implementation. It would be nicer if it could be implemented without using recursion. > By the way, I tested my patch (v10) to handle more large data set and > tried to following query with pgbench database. On my laptop it works > with 100k rows pgbench_accounts table but with beyond the number I got ~~~ I meant 10k. > OOM killer. I would like to enhance this in the next patch. > > SELECT aid, first_value(aid) OVER w, > count(*) OVER w > FROM pgbench_accounts > WINDOW w AS ( > PARTITION BY bid > ORDER BY aid > ROWS BETWEEN CURRENT ROW AND UNBOUNDED FOLLOWING > AFTER MATCH SKIP PAST LAST ROW > INITIAL > PATTERN (START UP+) > DEFINE > START AS TRUE, > UP AS aid > PREV(aid) > ); I ran this against your patch. It failed around > 60k rows. Best reagards, -- Tatsuo Ishii SRA OSS LLC English: http://www.sraoss.co.jp/index_en/ Japanese:http://www.sraoss.co.jp
Re: Patch: Improve Boolean Predicate JSON Path Docs
On Oct 23, 2023, at 20:20, Erik Wienhold wrote: > I thought that you may have missed that one because I saw this change > that removes the article: > >> -In the strict mode, the specified path must exactly match the structure >> of >> +In strict mode, the specified path must exactly match the structure of Oh, didn’t realize. Fixed. > LGTM. Would you create a commitfest entry? I'll set the status to RfC. Done. https://commitfest.postgresql.org/45/4624/ Best, David v7-0001-Improve-boolean-predicate-JSON-Path-docs.patch Description: Binary data
Re: post-recovery amcheck expectations
On Mon, Oct 23, 2023 at 7:28 PM Noah Misch wrote: > > That makes sense to me. I believe that it's not possible to have a > > string of consecutive sibling pages that are all half-dead (regardless > > of the BlockNumber order of sibling pages, even). But I'd probably > > have written the fix in roughly the same way. Although...maybe you > > should try to detect a string of half-dead pages? Hard to say if it's > > worth the trouble. > > I imagined a string of half-dead siblings could arise in structure like this: > > * 1 > * /|\ > * 4 <-> 2 <-> 3 > > With events like this: > > - DELETE renders blk 4 deletable. > - Crash with concurrent VACUUM, leaving 4 half-dead after having visited 1-4. > - DELETE renders blk 2 deletable. > - Crash with concurrent VACUUM, leaving 2 half-dead after having visited 1-2. > > I didn't try to reproduce that, and something may well prevent it. FWIW a couple of factors prevent it (in the absence of corruption). These are: 1. Only VACUUM can delete pages, and in general the only possible source of half-dead pages is an unfortunately timed crash/error within VACUUM. Each interrupted VACUUM can leave behind at most one half-dead page. 2. One thing that makes VACUUM back out of deleting an empty page is the presence of a half-dead right sibling leaf page left behind by some VACUUM that was interrupted at some point in the past -- see _bt_rightsib_halfdeadflag() for details. Obviously, factors 1 and 2 together make three consecutive half-dead sibling pages impossible. I'm not quite prepared to say that even two neighboring half-dead sibling pages are an impossibility right now, but I think that it might well be. Possibly for reasons that are more accidental than anything else (is the _bt_rightsib_halfdeadflag thing a "real invariant" or just something we do because we don't want to add additional complicated handling to nbtpage.c?), so I'll avoid going into further detail for now. I'm pointing this out because it argues for softening the wording about "accept[ing] an arbitrarily-long chain of half-dead, sibling-linked pages to the left" from your patch. I was also wondering (mostly to myself) about the relationship (if any) between the _bt_rightsib_halfdeadflag/_bt_leftsib_splitflag "invariants" and the bt_child_highkey_check() check. But I don't think that it makes sense to put that in scope -- your fix seems like a strict improvement. This relationship is perhaps in scope here to the limited extent that talking about strings of consecutive half-dead pages might make it even harder to understand the design of the bt_child_highkey_check() check. On the other hand...I'm not sure that I understand every nuance of it myself. > > Suggest adding a CHECK_FOR_INTERRUPTS() call to the loop, too, just > > for good luck. > > Added. That gave me the idea to check for circular links, like other parts of > amcheck do. Good idea. -- Peter Geoghegan
Re: Synchronizing slots from primary to standby
On Tue, Oct 24, 2023 at 3:35 PM Drouvot, Bertrand wrote: > > Hi, > > On 10/24/23 7:44 AM, Ajin Cherian wrote: > > On Mon, Oct 23, 2023 at 11:22 PM Drouvot, Bertrand > > wrote: > >> > >> @@ -602,6 +602,9 @@ CreateDecodingContext(XLogRecPtr start_lsn, > >> SnapBuildSetTwoPhaseAt(ctx->snapshot_builder, start_lsn); > >> } > >> > >> + /* set failover in the slot, as requested */ > >> + slot->data.failover = ctx->failover; > >> + > >> > >> I think we can get rid of this change in CreateDecodingContext(). > >> > > Yes, I too noticed this in my testing, however just removing this from > > CreateDecodingContext will not allow us to change the slot's failover flag > > using Alter subscription. > > Oh right. > > > I am thinking of moving this change to > > StartLogicalReplication prior to calling CreateDecodingContext by > > parsing the command options in StartReplicationCmd > > without adding it to the LogicalDecodingContext. > > > > Yeah, that looks like a good place to update "failover". > > Doing more testing and I have a couple of remarks about he current behavior. > > 1) Let's imagine that: > > - there is no standby > - standby_slot_names is set to a valid slot on the primary (but due to the > above, not linked to any standby) > - then a create subscription on a subscriber WITH (failover = true) would > start the > synchronisation but never finish (means leaving a "synchronisation" slot like > "pg_32811_sync_24576_7293415241672430356" > in place coming from ReplicationSlotNameForTablesync()). > > That's expected, but maybe we should emit a warning in > WalSndWaitForStandbyConfirmation() on the primary when there is > a slot part of standby_slot_names which is not active/does not have an > active_pid attached to it? > Agreed, Will do that. > 2) When we create a subscription, another slot is created during the > subscription synchronization, namely > like "pg_16397_sync_16388_7293447291374081805" (coming from > ReplicationSlotNameForTablesync()). > > This extra slot appears to have failover also set to true. > > So, If the standby refresh the list of slot to sync when the subscription is > still synchronizing we'd see things like > on the standby: > > LOG: waiting for remote slot "mysub" LSN (0/C0034808) and catalog xmin (756) > to pass local slot LSN (0/C0034840) and and catalog xmin (756) > LOG: wait over for remote slot "mysub" as its LSN (0/C00368B0) and catalog > xmin (756) has now passed local slot LSN (0/C0034840) and catalog xmin (756) > LOG: waiting for remote slot "pg_16397_sync_16388_7293447291374081805" LSN > (0/C0034808) and catalog xmin (756) to pass local slot LSN (0/C00368E8) and > and catalog xmin (756) > WARNING: slot "pg_16397_sync_16388_7293447291374081805" disappeared from the > primary, aborting slot creation > > I'm not sure this "pg_16397_sync_16388_7293447291374081805" should have > failover set to true. If there is a failover > during the subscription creation, better to re-launch the subscription > instead? > 'Failover' property of subscription is carried to the tablesync-slot in recent versions only with the intent that if failover happens during create-sub during table-sync of large tables, then users should be able to start from that point onward on the new primary. But yes, the above scenario is highly probable where-in no activity is happening on primary and thus the table-sync slot is waiting for its creation during sync on standby. So I agree, to simplify the stuff we can skip table-sync slot syncing on standby and document this behaviour. thanks Shveta
Re: Improve WALRead() to suck data directly from WAL buffers when possible
On Tue, Oct 24, 2023 at 05:15:19PM -0700, Jeff Davis wrote: > * Are you sure that reducing the number of calls to memcpy() is a win? > I would expect that to be true only if the memcpy()s are tiny, but here > they are around XLOG_BLCKSZ. I believe this was done based on a comment > from Nathan Bossart, but I didn't really follow why that's important. > Also, if we try to use one memcpy for all of the data, it might not > interact well with my idea above to avoid taking the lock. I don't recall exactly why I suggested this, but if additional memcpy()s help in some way and don't negatively impact performance, then I retract my previous comment. -- Nathan Bossart Amazon Web Services: https://aws.amazon.com
Re: Document aggregate functions better w.r.t. ORDER BY
On Tue, Oct 24, 2023 at 1:39 PM Bruce Momjian wrote: > On Tue, Dec 13, 2022 at 07:38:15PM -0700, David G. Johnston wrote: > > All, > > > > The recent discussion surrounding aggregates and ORDER BY moved me to > look over > > our existing documentation, especially now that we've reworked the > function > > tables, to see what improvements can be had by simply documenting those > > functions where ORDER BY may change the user-visible output. I skipped > range > > aggregates for the moment but handled the others on the aggregates page > (not > > window functions). This includes the float types for sum and avg. > > > > I added a note just before the table linking back to the syntax chapter > and > > describing the newly added rules and syntax choice in the table. > > > > The nuances of floating point math suggest to me that specifying order > by for > > those is in some kind of gray area and so I've marked it optional...any > > suggestions for wording (or an xref) to explain those nuances or should > it just > > be shown non-optional like the others? Or not shown at all? > > > > The novelty of my examples is up for bikeshedding. I didn't want > anything too > > long so a subquery didn't make sense, and I was trying to avoid > duplication as > > well as multiple lines - hence creating a CTE that can be copied onto > all of > > the example queries to produce the noted result. > > > > I added a DISTINCT example to array_agg because it is the first > aggregate on > > the page and so hopefully will be seen during a cursory reading. Plus, > > array_agg is the go-to function for doing this kind of experimentation. > > I like this idea, though the examples seemed too detailed so I skipped > them. Here is the trimmed-down patch I would like to apply. > > I'd prefer to keep pointing out that the ones documented are those whose outputs will vary due to ordering. I've been sympathetic to the user comments that we don't have enough examples. Just using array_agg for that purpose, showing both DISTINCT and ORDER BY seems like a fair compromise (removes two from my original proposal). The examples in the section we tell them to go see aren't of that great quality. If you strongly dislike having the function table contain the examples we should at least improve the page we are sending them to. (As an aside to this, I've personally always found the syntax block with the 5 syntaxes shown there to be intimidating/hard-to-read). I'd at least suggest you reconsider the commentary and examples surrounding jsonb_object_agg. The same goes for the special knowledge of floating point behavior for why we've chosen to document avg/sum, something that typically doesn't care about order, as having an optional order by. David J.
Re: CRC32C Parallel Computation Optimization on ARM
On Wed, Oct 25, 2023 at 07:17:55AM +0900, Michael Paquier wrote: > If you are looking at computing the CRC of records with arbitrary > sizes, why not just generating a series with > pg_logical_emit_message() before doing a comparison with pg_waldump or > a custom replay loop to go through the records? At least it would > make the results more predictible. I tried this. pg_waldump on 2 million ~8kB records took around 8.1 seconds without the patch and around 7.4 seconds with it (an 8% improvement). pg_waldump on 1 million ~16kB records took around 3.2 seconds without the patch and around 2.4 seconds with it (a 25% improvement). Given the performance characteristics and relative simplicity of the patch, I think this could be worth doing. I suspect we'll want to do something similar for x86, too. -- Nathan Bossart Amazon Web Services: https://aws.amazon.com
Re: MERGE ... RETURNING
On Tue, Oct 24, 2023 at 2:11 PM Jeff Davis wrote: > On Wed, 2023-08-23 at 11:58 +0100, Dean Rasheed wrote: > > Updated version attached, fixing an uninitialized-variable warning > > from the cfbot. > > I took another look and I'm still not comfortable with the special > IsMergeSupportFunction() functions. I don't object necessarily -- if > someone else wants to commit it, they can -- but I don't plan to commit > it in this form. > > Can we revisit the idea of a per-WHEN RETURNING clause? The returning > clauses could be treated kind of like a UNION, which makes sense > because it really is a union of different results (the returned tuples > from an INSERT are different than the returned tuples from a DELETE). > You can just add constants to the target lists to distinguish which > WHEN clause they came from. > > I know you rejected that approach early on, but perhaps it's worth > discussing further? > Yeah. Side benefit, the 'action_number' felt really out of place, and that neatly might solve it. It doesn't match tg_op, for example. With the current approach, return a text, or an enum? Why doesn't it match concepts that are pretty well established elsewhere? SQL has a pretty good track record for not inventing weird numbers with no real meaning (sadly, not so much the developers). Having said that, pg_merge_action() doesn't feel too bad if the syntax issues can be worked out. Very supportive of the overall goal. merlin
Re: Making aggregate deserialization (and WAL receive) functions slightly faster
On Tue, 17 Oct 2023 at 20:39, David Rowley wrote: > > On Mon, 16 Oct 2023 at 05:56, Tom Lane wrote: > > I guess the next question is whether we want to stop here or > > try to relax the requirement about NUL-termination. I'd be inclined > > to call that a separate issue deserving a separate commit, so maybe > > we should go ahead and commit this much anyway. > > I am keen to see this relaxed. I agree that a separate effort is best. I looked at the latest posted patch again today with thoughts about pushing it but there's something I'm a bit unhappy with that makes me think we should maybe do the NUL-termination relaxation in the same commit. The problem is in LogicalRepApplyLoop() the current patch adjusts the manual building of the StringInfoData to make use of initReadOnlyStringInfo() instead. The problem I have with that is that the string that's given to initReadOnlyStringInfo() comes from walrcv_receive() and on looking at the API spec for walrcv_receive_fn I see: /* * walrcv_receive_fn * * Receive a message available from the WAL stream. 'buffer' is a pointer * to a buffer holding the message received. Returns the length of the data, * 0 if no data is available yet ('wait_fd' is a socket descriptor which can * be waited on before a retry), and -1 if the cluster ended the COPY. */ i.e, no mention that the buffer will be NUL terminated upon return. Looking at pqGetCopyData3(), is see the buffer does get NUL terminated, but without the API spec mentioning this I'm not feeling good about going ahead with wrapping that up in initReadOnlyStringInfo() which Asserts the buffer will be NUL terminated. I've attached a patch which builds on the previous patch and relaxes the rule that the StringInfo must be NUL-terminated. The rule is only relaxed for StringInfos that are initialized with initReadOnlyStringInfo. On working on this I went over the locations where we've added code to add a '\0' char to the buffer. If you look at, for example, record_recv() and array_agg_deserialize() in master, we modify the StringInfo's data to set a \0 at the end of the string. I've removed that code as I *believe* this isn't required for the type's receive function. There's also an existing confusing comment in logicalrep_read_tuple() which seems to think we're just setting the NUL terminator to conform to StringInfo's practises. This is misleading as the NUL is required for LOGICALREP_COLUMN_TEXT mode as we use the type's input function instead of the receive function. You don't have to look very hard to find an input function that needs a NUL terminator. I'm a bit less confident that the type's receive function will never need to be NUL terminated. cstring_recv() came to mind as one I should look at, but on looking I see it's not required as it just reads the remaining bytes from the input StringInfo. Is it safe to assume this? or could there be some UDF receive function which requires this? David diff --git a/src/backend/replication/logical/applyparallelworker.c b/src/backend/replication/logical/applyparallelworker.c index 82f48a488e..9b37736f8e 100644 --- a/src/backend/replication/logical/applyparallelworker.c +++ b/src/backend/replication/logical/applyparallelworker.c @@ -774,10 +774,7 @@ LogicalParallelApplyLoop(shm_mq_handle *mqh) if (len == 0) elog(ERROR, "invalid message length"); - s.cursor = 0; - s.maxlen = -1; - s.data = (char *) data; - s.len = len; + initReadOnlyStringInfo(, data, len); /* * The first byte of messages sent from leader apply worker to diff --git a/src/backend/replication/logical/proto.c b/src/backend/replication/logical/proto.c index d52c8963eb..4eaeb14148 100644 --- a/src/backend/replication/logical/proto.c +++ b/src/backend/replication/logical/proto.c @@ -879,6 +879,7 @@ logicalrep_read_tuple(StringInfo in, LogicalRepTupleData *tuple) /* Read the data */ for (i = 0; i < natts; i++) { + char *buff; charkind; int len; StringInfo value = >colvalues[i]; @@ -899,19 +900,18 @@ logicalrep_read_tuple(StringInfo in, LogicalRepTupleData *tuple) len = pq_getmsgint(in, 4); /* read length */ /* and data */ - value->data = palloc(len + 1); - pq_copymsgbytes(in, value->data, len); + buff = palloc(len + 1); + pq_copymsgbytes(in, buff, len); /* -* Not strictly necessary for LOGICALREP_COLUMN_BINARY, but -* per StringInfo practice. +
Re: Show version of OpenSSL in ./configure output
On Mon, Oct 23, 2023 at 08:44:01PM -0400, Tom Lane wrote: > OK by me. Cool, done down to 16 as this depends on c8e4030d1bdd. -- Michael signature.asc Description: PGP signature
Re: Improve WALRead() to suck data directly from WAL buffers when possible
On Sat, 2023-10-21 at 23:59 +0530, Bharath Rupireddy wrote: > I'm attaching v12 patch set with just pgperltidy ran on the new TAP > test added in 0002. No other changes from that of v11 patch set. Thank you. Comments: * It would be good to document that this is partially an optimization (read from memory first) and partially an API difference that allows reading unflushed data. For instance, walsender may benefit performance-wise (and perhaps later with the ability to read unflushed data) whereas pg_walinspect benefits primarily from reading unflushed data. * Shouldn't there be a new method in XLogReaderRoutine (e.g. read_unflushed_data), rather than having logic in WALRead()? The callers can define the method if it makes sense (and that would be a good place to document why); or leave it NULL if not. * I'm not clear on the "partial hit" case. Wouldn't that mean you found the earliest byte in the buffers, but not the latest byte requested? Is that possible, and if so under what circumstances? I added an "Assert(nread == 0 || nread == count)" in WALRead() after calling XLogReadFromBuffers(), and it wasn't hit. * If the partial hit case is important, wouldn't XLogReadFromBuffers() fill in the end of the buffer rather than the beginning? * Other functions that use xlblocks, e.g. GetXLogBuffer(), use more effort to avoid acquiring WALBufMappingLock. Perhaps you can avoid it, too? One idea is to check that XLogCtl->xlblocks[idx] is equal to expectedEndPtr both before and after the memcpy(), with appropriate barriers. That could mitigate concerns expressed by Kyotaro Horiguchi and Masahiko Sawada. * Are you sure that reducing the number of calls to memcpy() is a win? I would expect that to be true only if the memcpy()s are tiny, but here they are around XLOG_BLCKSZ. I believe this was done based on a comment from Nathan Bossart, but I didn't really follow why that's important. Also, if we try to use one memcpy for all of the data, it might not interact well with my idea above to avoid taking the lock. * Style-wise, the use of "unlikely" seems excessive, unless there's a reason to think it matters. Regards, Jeff Davis
Re: Row pattern recognition
> On Sat, Oct 21, 2023 at 7:39 PM Tatsuo Ishii wrote: >> Attached is the v10 patch. This version enhances the performance of >> pattern matching. > > Nice! I've attached a couple of more stressful tests (window > partitions of 1000 rows each). Beware that the second one runs my > desktop out of memory fairly quickly with the v10 implementation. > > I was able to carve out some time this week to implement a very basic > recursive NFA, which handles both the + and * qualifiers (attached). Great. I will look into this. > It's not production quality -- a frame on the call stack for every row > isn't going to work Yeah. > -- but with only those two features, it's pretty > tiny, and it's able to run the new stress tests with no issue. If I > continue to have time, I hope to keep updating this parallel > implementation as you add features to the StringSet implementation, > and we can see how it evolves. I expect that alternation and grouping > will ratchet up the complexity. Sounds like a plan. By the way, I tested my patch (v10) to handle more large data set and tried to following query with pgbench database. On my laptop it works with 100k rows pgbench_accounts table but with beyond the number I got OOM killer. I would like to enhance this in the next patch. SELECT aid, first_value(aid) OVER w, count(*) OVER w FROM pgbench_accounts WINDOW w AS ( PARTITION BY bid ORDER BY aid ROWS BETWEEN CURRENT ROW AND UNBOUNDED FOLLOWING AFTER MATCH SKIP PAST LAST ROW INITIAL PATTERN (START UP+) DEFINE START AS TRUE, UP AS aid > PREV(aid) ); Best reagards, -- Tatsuo Ishii SRA OSS LLC English: http://www.sraoss.co.jp/index_en/ Japanese:http://www.sraoss.co.jp
walwriter interacts quite badly with synchronous_commit=off
Hi, I recently mentioned to Heikki that I was seeing latch related wakeups being frequent enough to prevent walwriter from doing a whole lot of work. He asked me to write that set of concerns up, which seems quite fair... Here's a profile of walwriter while the following pgbench run was ongoing: c=1;psql -Xq -c 'drop table if exists testtable_logged; CREATE TABLE testtable_logged(v int not null default 0);' && PGOPTIONS='-c synchronous_commit=off' pgbench -n -c$c -j$c -Mprepared -T150 -f <(echo 'INSERT INTO testtable_logged DEFAULT VALUES;') -P1 Looking at top, walwriter is around 15-20% busy with this workload. Unfortunately a profile quickly shows that little of that work is useful: perf record --call-graph dwarf -m16M -p $(pgrep -f 'walwriter') sleep 3 - 94.42% 0.00% postgres postgres [.] AuxiliaryProcessMain AuxiliaryProcessMain - WalWriterMain + 78.26% WaitLatch + 14.01% XLogBackgroundFlush + 0.51% pgstat_report_wal 0.29% ResetLatch 0.13% pgstat_flush_io + 0.02% asm_sysvec_apic_timer_interrupt 0.01% HandleWalWriterInterrupts (inlined) Confirmed by the distribution of what syscalls are made: perf trace -m128M --summary -p $(pgrep -f 'walwriter') sleep 5 syscallcalls errors total min avg max stddev (msec)(msec)(msec)(msec) (%) --- -- - - - -- epoll_wait216610 0 3744.984 0.000 0.017 0.113 0.03% read 216602 0 333.905 0.001 0.002 0.029 0.03% fdatasync 27 094.703 1.939 3.50811.279 8.83% pwrite642998 015.646 0.004 0.005 0.027 0.45% openat 2 0 0.019 0.006 0.010 0.013 34.84% close 2 0 0.004 0.002 0.002 0.003 25.76% We're doing far more latch related work than actual work. The walwriter many many times wakes up without having to do anything. And if you increase the number of clients to e.g. c=8, it gets worse in some ways: perf trace: epoll_wait291512 0 2364.067 0.001 0.008 0.693 0.10% read 290938 0 479.837 0.001 0.002 0.020 0.05% fdatasync146 0 410.043 2.508 2.809 7.006 1.90% futex 56384 43982 183.896 0.001 0.003 2.791 1.65% pwrite64 17058 0 105.625 0.004 0.006 4.015 4.61% clock_nanosleep1 0 1.063 1.063 1.063 1.063 0.00% openat 9 0 0.072 0.006 0.008 0.014 14.35% close 9 0 0.018 0.002 0.002 0.003 5.55% Note that we 5x more lock waits (the futex calls) than writes! I think the problem is mainly that XLogSetAsyncXactLSN() wakes up walwriter whenever it is sleeping, regardless of whether the modified asyncXactLSN will lead to a write. We even wake up walwriter when we haven't changed asyncXactLSN, because our LSN is older than some other backends! So often we'll just wake up walwriter, which finds no work, immediately goes to sleep, just to be woken again. Because of the inherent delay between the checks of XLogCtl->WalWriterSleeping and Latch->is_set, we also sometimes end up with multiple processes signalling walwriter, which can be bad, because it increases the likelihood that some of the signals may be received when we are already holding WALWriteLock, delaying its release... Because of the frequent wakeups, we do something else that's not smart: We write out 8k blocks individually, many times a second. Often thousands of 8k pwrites a second. We also acquire WALWriteLock and call WaitXLogInsertionsToFinish(), even if could already know we're not going to flush! Not cheap, when you do it this many times a second. There is an absolutely basic optimization, helping a it at higher client counts: Don't wake if the new asyncXactLSN is <= the old one. But it doesn't help that much. I think the most important optimization we need is to have XLogSetAsyncXactLSN() only wake up if there is a certain amount of unflushed WAL. Unless walsender is hibernating, walsender will wake up on its own after wal_writer_delay. I think we can just reuse WalWriterFlushAfter for this. E.g. a condition like if (WriteRqstPtr <= LogwrtResult.Write + WalWriterFlushAfter * XLOG_BLCKSZ) return; drastically cuts down on the amount of wakeups, without - I think - loosing guarantees around synchronous_commit=off. 1 client: before: tps = 42926.288765 (without initial connection time) syscallcalls errors total min avg max stddev
Re: CRC32C Parallel Computation Optimization on ARM
On Wed, Oct 25, 2023 at 12:37:45AM +0300, Heikki Linnakangas wrote: > On 25/10/2023 00:18, Nathan Bossart wrote: >> Actually, since the pg_waldump benchmark likely only involves very small >> WAL records, it would make sense that there isn't much difference. >> *facepalm* > > No need to guess, pg_waldump -z will tell you what the record size is. And > you can vary it by changing the checkpoint interval and/or pgbench scale > factor: if you checkpoint frequently or if the database is larger, you get > more full-page images which makes the records larger on average, and vice > versa. If you are looking at computing the CRC of records with arbitrary sizes, why not just generating a series with pg_logical_emit_message() before doing a comparison with pg_waldump or a custom replay loop to go through the records? At least it would make the results more predictible. -- Michael signature.asc Description: PGP signature
Re: CRC32C Parallel Computation Optimization on ARM
On 25/10/2023 00:18, Nathan Bossart wrote: On Tue, Oct 24, 2023 at 04:09:54PM -0500, Nathan Bossart wrote: I'm able to reproduce the speedup with the provided benchmark on an Apple M1 Pro (which appears to have the required instructions). There was almost no change for the 512-byte case, but there was a ~60% speedup for the 4096-byte case. However, I couldn't produce any noticeable speedup with Heikki's pg_waldump benchmark [0]. I haven't had a chance to dig further, unfortunately. Assuming I'm not doing something wrong, I don't think such a result should necessarily disqualify this optimization, though. Actually, since the pg_waldump benchmark likely only involves very small WAL records, it would make sense that there isn't much difference. *facepalm* No need to guess, pg_waldump -z will tell you what the record size is. And you can vary it by changing the checkpoint interval and/or pgbench scale factor: if you checkpoint frequently or if the database is larger, you get more full-page images which makes the records larger on average, and vice versa. -- Heikki Linnakangas Neon (https://neon.tech)
Re: Add new for_each macros for iterating over a List that do not require ListCell pointer
On Tue, Oct 24, 2023 at 06:58:04PM +0200, Jelte Fennema wrote: > On Tue, 24 Oct 2023 at 18:47, Nathan Bossart wrote: >> BTW after applying your patches, initdb began failing with the following >> for me: >> >> TRAP: failed Assert("n >= 0 && n < list->length"), File: "list.c", >> Line: 770, PID: 902807 > > Oh oops... That was an off by one error in the modified > foreach_delete_current implementation. > Attached is a fixed version. Thanks, that fixed it for me, too. -- Nathan Bossart Amazon Web Services: https://aws.amazon.com
Re: CRC32C Parallel Computation Optimization on ARM
On Tue, Oct 24, 2023 at 04:09:54PM -0500, Nathan Bossart wrote: > I'm able to reproduce the speedup with the provided benchmark on an Apple > M1 Pro (which appears to have the required instructions). There was almost > no change for the 512-byte case, but there was a ~60% speedup for the > 4096-byte case. > > However, I couldn't produce any noticeable speedup with Heikki's pg_waldump > benchmark [0]. I haven't had a chance to dig further, unfortunately. > Assuming I'm not doing something wrong, I don't think such a result should > necessarily disqualify this optimization, though. Actually, since the pg_waldump benchmark likely only involves very small WAL records, it would make sense that there isn't much difference. *facepalm* -- Nathan Bossart Amazon Web Services: https://aws.amazon.com
Re: CRC32C Parallel Computation Optimization on ARM
On Fri, Oct 20, 2023 at 05:18:56PM +0900, Michael Paquier wrote: > On Fri, Oct 20, 2023 at 07:08:58AM +, Xiang Gao wrote: >> This patch uses a parallel computing optimization algorithm to >> improve crc32c computing performance on ARM. The algorithm comes >> from Intel whitepaper: >> crc-iscsi-polynomial-crc32-instruction-paper. Input data is divided >> into three equal-sized blocks.Three parallel blocks (crc0, crc1, >> crc2) for 1024 Bytes.One Block: 42(BLK_LENGTH) * 8(step length: >> crc32c_u64) bytes >> >> Crc32c unitest: >> https://gist.github.com/gaoxyt/138fd53ca1eead8102eeb9204067f7e4 >> Crc32c benchmark: >> https://gist.github.com/gaoxyt/4506c10fc06b3501445e32c4257113e9 >> It gets ~2x speedup compared to linear Arm crc32c instructions. > > Interesting. Could you attached to this thread the test files you > used and the results obtained please? If this data gets deleted from > github, then it would not be possible to refer back to what you did at > the related benchmark results. > > Note that your patch is forgetting about meson; it just patches > ./configure. I'm able to reproduce the speedup with the provided benchmark on an Apple M1 Pro (which appears to have the required instructions). There was almost no change for the 512-byte case, but there was a ~60% speedup for the 4096-byte case. However, I couldn't produce any noticeable speedup with Heikki's pg_waldump benchmark [0]. I haven't had a chance to dig further, unfortunately. Assuming I'm not doing something wrong, I don't think such a result should necessarily disqualify this optimization, though. [0] https://postgr.es/m/ec487192-f6aa-509a-cacb-6642dad14209%40iki.fi -- Nathan Bossart Amazon Web Services: https://aws.amazon.com
Re: Cirrus-ci is lowering free CI cycles - what to do with cfbot, etc?
> On 24 Oct 2023, at 22:34, Tom Lane wrote: > > Daniel Gustafsson writes: >> I went ahead and applied this on master, thanks for review! Now to see if >> there will be any noticeable difference in resource usage. > > I think that tools like Coverity are likely to whine about your > use of sprintf instead of snprintf. Sure, it's perfectly safe, > but that won't stop the no-sprintf-ever crowd from complaining. Fair point, that's probably quite likely to happen. I can apply an snprintf() conversion change like this in the two places introduced by this: -sprintf(s, "%d", port); +sprintf(s, sizeof(s), "%d", port); -- Daniel Gustafsson
Re: Document aggregate functions better w.r.t. ORDER BY
On Tue, Dec 13, 2022 at 07:38:15PM -0700, David G. Johnston wrote: > All, > > The recent discussion surrounding aggregates and ORDER BY moved me to look > over > our existing documentation, especially now that we've reworked the function > tables, to see what improvements can be had by simply documenting those > functions where ORDER BY may change the user-visible output. I skipped range > aggregates for the moment but handled the others on the aggregates page (not > window functions). This includes the float types for sum and avg. > > I added a note just before the table linking back to the syntax chapter and > describing the newly added rules and syntax choice in the table. > > The nuances of floating point math suggest to me that specifying order by for > those is in some kind of gray area and so I've marked it optional...any > suggestions for wording (or an xref) to explain those nuances or should it > just > be shown non-optional like the others? Or not shown at all? > > The novelty of my examples is up for bikeshedding. I didn't want anything too > long so a subquery didn't make sense, and I was trying to avoid duplication as > well as multiple lines - hence creating a CTE that can be copied onto all of > the example queries to produce the noted result. > > I added a DISTINCT example to array_agg because it is the first aggregate on > the page and so hopefully will be seen during a cursory reading. Plus, > array_agg is the go-to function for doing this kind of experimentation. I like this idea, though the examples seemed too detailed so I skipped them. Here is the trimmed-down patch I would like to apply. -- Bruce Momjian https://momjian.us EDB https://enterprisedb.com Only you can decide what is important to you. diff --git a/doc/src/sgml/func.sgml b/doc/src/sgml/func.sgml index 7c3e940afe..35d8924c6b 100644 --- a/doc/src/sgml/func.sgml +++ b/doc/src/sgml/func.sgml @@ -20273,6 +20273,18 @@ SELECT NULLIF(value, '(none)') ... aggregation. + + +The aggregates in this section all accept an optional ORDER +BY clause as described in the general syntax rules +for aggregates (see ). +input_sort_expr specifies the column names to sort +by before being supplied to the aggreagate function. For example, +array_agg(... ORDER BY ...) produces an array with +ordered values. + + + General-Purpose Aggregate Functions @@ -20310,7 +20322,7 @@ SELECT NULLIF(value, '(none)') ... array_agg -array_agg ( anynonarray ) +array_agg ( anynonarray ORDER BY input_sort_expr) anyarray @@ -20321,7 +20333,7 @@ SELECT NULLIF(value, '(none)') ... -array_agg ( anyarray ) +array_agg ( anyarray ORDER BY input_sort_expr ) anyarray @@ -20356,11 +20368,11 @@ SELECT NULLIF(value, '(none)') ... numeric -avg ( real ) +avg ( real ORDER BY input_sort_expr ) double precision -avg ( double precision ) +avg ( double precision ORDER BY input_sort_expr ) double precision @@ -20526,14 +20538,14 @@ SELECT NULLIF(value, '(none)') ... json_agg -json_agg ( anyelement ) +json_agg ( anyelement ORDER BY input_sort_expr ) json jsonb_agg -jsonb_agg ( anyelement ) +jsonb_agg ( anyelement ORDER BY input_sort_expr ) jsonb @@ -20573,7 +20585,8 @@ SELECT NULLIF(value, '(none)') ... json_object_agg ( key "any", value - "any" ) + "any" + ORDER BY input_sort_expr ) json @@ -20582,7 +20595,8 @@ SELECT NULLIF(value, '(none)') ... jsonb_object_agg ( key "any", value - "any" ) + "any" + ORDER BY input_sort_expr ) jsonb @@ -20819,7 +20833,8 @@ SELECT NULLIF(value, '(none)') ... string_agg ( value - bytea, delimiter bytea ) + bytea, delimiter bytea + ORDER BY input_sort_expr ) bytea @@ -20851,11 +20866,11 @@ SELECT NULLIF(value, '(none)') ... numeric -sum ( real ) +sum ( real ORDER BY input_sort_expr ) real -sum ( double precision ) +sum ( double precision ORDER BY input_sort_expr ) double precision @@ -20877,7 +20892,7 @@ SELECT NULLIF(value, '(none)') ... xmlagg -xmlagg ( xml ) +xmlagg ( xml ORDER BY input_sort_expr ) xml
RFC: Pluggable TOAST
Hi hackers! We need community feedback on previously discussed topic [1]. There are some long-live issues in Postgres related to the TOAST mechanics, like [2]. Some time ago we already proposed a set of patches with an API allowing to plug in different TOAST implementations into a live database. The patch set introduced a lot of code and was quite crude in some places, so after several implementations we decided to try to implement it in the production environment for further check-up. The main idea behind pluggable TOAST is make it possible to easily plug in and use different implementations of large values storage, preserving existing mechanics to keep backward compatibilitну provide easy Postgres-way give users alternative mechanics for storing large column values in a more effective way - we already have custom and very effective (up to tens and even hundreds of times faster) TOAST implementations for bytea and JSONb data types. As we see it - Pluggable TOAST proposes 1) changes in TOAST pointer itself, extending it to store custom data - current limitations of TOAST pointer were discussed in [1] and [4]; 2) API which allows calls of custom TOAST implementations for certain table columns and (a topic for discussion) certain datatypes. Custom TOAST could be also used in a not so trivial way - for example, limited columnar storage could be easily implemented and plugged in without heavy core modifications of implementation of Pluggable Storage (Table Access Methods), preserving existing data and database structure, be upgraded, replicated and so on. Any thoughts and proposals are welcome. [1] Pluggable TOAST https://www.postgresql.org/message-id/flat/224711f9-83b7-a307-b17f-4457ab73aa0a%40sigaev.ru [2] Infinite loop while acquiring new TOAST Oid https://www.postgresql.org/message-id/flat/CAN-LCVPRvRzxeUdYdDCZ7UwZQs1NmZpqBUCd%3D%2BRdMPFTyt-bRQ%40mail.gmail.com [3] JSONB Compression dictionaries https://www.postgresql.org/message-id/flat/CAJ7c6TOtAB0z1UrksvGTStNE-herK-43bj22%3D5xVBg7S4vr5rQ%40mail.gmail.com [4] Extending the TOAST pointer https://www.postgresql.org/message-id/flat/CAN-LCVMq2X%3Dfhx7KLxfeDyb3P%2BBXuCkHC0g%3D9GF%2BJD4izfVa0Q%40mail.gmail.com -- Regards, Nikita Malakhov Postgres Professional The Russian Postgres Company https://postgrespro.ru/
Re: Cirrus-ci is lowering free CI cycles - what to do with cfbot, etc?
Daniel Gustafsson writes: > I went ahead and applied this on master, thanks for review! Now to see if > there will be any noticeable difference in resource usage. I think that tools like Coverity are likely to whine about your use of sprintf instead of snprintf. Sure, it's perfectly safe, but that won't stop the no-sprintf-ever crowd from complaining. regards, tom lane
Re: Cirrus-ci is lowering free CI cycles - what to do with cfbot, etc?
> On 13 Sep 2023, at 01:49, Andres Freund wrote: > My patch increased user/sys time a bit (likely due to a higher number of > futile psql forks), but Daniel's doesn't. And it does show a nice overall wall > clock time saving. I went ahead and applied this on master, thanks for review! Now to see if there will be any noticeable difference in resource usage. -- Daniel Gustafsson
Re: Replace references to malloc() in libpq documentation with generic language
> On 24 Oct 2023, at 17:07, Tom Lane wrote: > Daniel Gustafsson writes: >> I do agree with this proposed change though: > >> - all the space that will be freed by . >> + all the memory that will be freed by . > > +1, seems harmless. I've pushed this part, skipping the rest. -- Daniel Gustafsson
Re: [PATCH] Improve amcheck to also check UNIQUE constraint in btree index.
On Wed, Sep 28, 2022 at 11:44 AM Aleksander Alekseev wrote: > > I think, this patch was marked as "Waiting on Author", probably, by > > mistake. Since recent changes were done without any significant code > > changes and CF bot how happy again. > > > > I'm going to move it to RfC, could I? If not, please tell why. > > I restored the "Ready for Committer" state. I don't think it's a good > practice to change the state every time the patch has a slight > conflict or something. This is not helpful at all. Such things happen > quite regularly and typically are fixed in a couple of days. This patch seems useful to me. I went through the thread, it seems that all the critics are addressed. I've rebased this patch. Also, I've run perltidy for tests, split long errmsg() into errmsg(), errdetail() and errhint(), and do other minor enchantments. I think this patch is ready to go. I'm going to push it if there are no objections. -- Regards, Alexander Korotkov 0001-Teach-contrib-amcheck-to-check-the-unique-constr-v18.patch Description: Binary data
Re: MERGE ... RETURNING
On Wed, 2023-08-23 at 11:58 +0100, Dean Rasheed wrote: > Updated version attached, fixing an uninitialized-variable warning > from the cfbot. I took another look and I'm still not comfortable with the special IsMergeSupportFunction() functions. I don't object necessarily -- if someone else wants to commit it, they can -- but I don't plan to commit it in this form. Can we revisit the idea of a per-WHEN RETURNING clause? The returning clauses could be treated kind of like a UNION, which makes sense because it really is a union of different results (the returned tuples from an INSERT are different than the returned tuples from a DELETE). You can just add constants to the target lists to distinguish which WHEN clause they came from. I know you rejected that approach early on, but perhaps it's worth discussing further? Regards, Jeff Davis
Re: Row pattern recognition
On Sat, Oct 21, 2023 at 7:39 PM Tatsuo Ishii wrote: > Attached is the v10 patch. This version enhances the performance of > pattern matching. Nice! I've attached a couple of more stressful tests (window partitions of 1000 rows each). Beware that the second one runs my desktop out of memory fairly quickly with the v10 implementation. I was able to carve out some time this week to implement a very basic recursive NFA, which handles both the + and * qualifiers (attached). It's not production quality -- a frame on the call stack for every row isn't going to work -- but with only those two features, it's pretty tiny, and it's able to run the new stress tests with no issue. If I continue to have time, I hope to keep updating this parallel implementation as you add features to the StringSet implementation, and we can see how it evolves. I expect that alternation and grouping will ratchet up the complexity. Thanks! --Jacob From fb3cbb6f99f0fe7b05027759454d7e0013225929 Mon Sep 17 00:00:00 2001 From: Jacob Champion Date: Fri, 20 Oct 2023 16:11:14 -0700 Subject: [PATCH 2/2] squash! Row pattern recognition patch (executor). - Extract pattern matching logic into match_pattern(). - Replace the str_set/initials implementation with a recursive backtracker. - Rework match_pattern in preparation for * quantifier: separate success from number of matched rows. - Handle `*` quantifier - Simplify the evaluate_pattern() signature. - Remove defineInitial and related code. --- src/backend/executor/nodeWindowAgg.c| 871 +++- src/backend/optimizer/plan/createplan.c | 4 - src/backend/parser/parse_clause.c | 31 - src/include/nodes/execnodes.h | 1 - src/include/nodes/parsenodes.h | 2 - src/include/nodes/plannodes.h | 3 - 6 files changed, 91 insertions(+), 821 deletions(-) diff --git a/src/backend/executor/nodeWindowAgg.c b/src/backend/executor/nodeWindowAgg.c index 2e1baef7ea..30abe60159 100644 --- a/src/backend/executor/nodeWindowAgg.c +++ b/src/backend/executor/nodeWindowAgg.c @@ -161,40 +161,6 @@ typedef struct WindowStatePerAggData boolrestart;/* need to restart this agg in this cycle? */ } WindowStatePerAggData; -/* - * Set of StringInfo. Used in RPR. - */ -typedef struct StringSet { - StringInfo *str_set; - Sizeset_size; /* current array allocation size in number of items */ - int set_index; /* current used size */ -} StringSet; - -/* - * Allowed subsequent PATTERN variables positions. - * Used in RPR. - * - * pos represents the pattern variable defined order in DEFINE caluase. For - * example. "DEFINE START..., UP..., DOWN ..." and "PATTERN START UP DOWN UP" - * will create: - * VariablePos[0].pos[0] = 0; START - * VariablePos[1].pos[0] = 1; UP - * VariablePos[1].pos[1] = 3; UP - * VariablePos[2].pos[0] = 2; DOWN - * - * Note that UP has two pos because UP appears in PATTERN twice. - * - * By using this strucrture, we can know which pattern variable can be followed - * by which pattern variable(s). For example, START can be followed by UP and - * DOWN since START's pos is 0, and UP's pos is 1 or 3, DOWN's pos is 2. - * DOWN can be followed by UP since UP's pos is either 1 or 3. - * - */ -#define NUM_ALPHABETS 26 /* we allow [a-z] variable initials */ -typedef struct VariablePos { - int pos[NUM_ALPHABETS]; /* postion(s) in PATTERN */ -} VariablePos; - static void initialize_windowaggregate(WindowAggState *winstate, WindowStatePerFunc perfuncstate, WindowStatePerAgg peraggstate); @@ -249,27 +215,11 @@ static void register_reduced_frame_map(WindowAggState *winstate, int64 pos, int static void clear_reduced_frame_map(WindowAggState *winstate); static void update_reduced_frame(WindowObject winobj, int64 pos); -static int64 evaluate_pattern(WindowObject winobj, int64 current_pos, - char *vname, StringInfo encoded_str, bool *result); +static bool evaluate_pattern(WindowObject winobj, int64 current_pos, +char *vname); static bool get_slots(WindowObject winobj, int64 current_pos); -//static int search_str_set(char *pattern, StringSet *str_set, int *initial_orders); -static int search_str_set(char *pattern, StringSet *str_set, VariablePos *variable_pos); -static charpattern_initial(WindowAggState *winstate, char *vname); -static int do_pattern_match(char *pattern, char *encoded_str); - -static StringSet *string_set_init(void); -static void string_set_add(StringSet *string_set, StringInfo str); -static StringInfo string_set_get(StringSet *string_set, int index); -static int
Re: Bug: RLS policy FOR SELECT is used to check new rows
Greetings, On Tue, Oct 24, 2023 at 14:42 Robert Haas wrote: > On Tue, Oct 24, 2023 at 1:46 PM Jeff Davis wrote: > > Perhaps the idea is that if there are constraints involved, the failure > > or success of an INSERT/UPDATE/DELETE could leak information that you > > don't have privileges to read. > > My recollection of this topic is pretty hazy, but like Tom, I seem to > remember it being intentional, and I think the reason had something to > do with wanting the slice of a RLS-protect table that you can see to > feel like a complete table. When you update a row in a table all of > which is visible to you, the updated row can never vanish as a result > of that update, so it was thought, if I remember correctly, that this > should also be true here. It's also similar to what happens if an > updatable view has WITH CHECK OPTION, and I think that was part of the > precedent as well. I don't know whether or not the constraint issue > that you mention here was also part of the concern, but it may have > been. This was all quite a while ago... Yes, having it be similar to a view WITH CHECK OPTION was intentional, also on not wishing for things to be able to disappear or to not get saved. The risk of a constraint possibly causing the leak of information is better than either having data just thrown away or having the constraint not provide the guarantee it’s supposed to … Thanks, Stephen (On my phone at an event currently, sorry for not digging in deeper on this..) >
Re: Bug: RLS policy FOR SELECT is used to check new rows
On Tue, Oct 24, 2023 at 1:46 PM Jeff Davis wrote: > Perhaps the idea is that if there are constraints involved, the failure > or success of an INSERT/UPDATE/DELETE could leak information that you > don't have privileges to read. My recollection of this topic is pretty hazy, but like Tom, I seem to remember it being intentional, and I think the reason had something to do with wanting the slice of a RLS-protect table that you can see to feel like a complete table. When you update a row in a table all of which is visible to you, the updated row can never vanish as a result of that update, so it was thought, if I remember correctly, that this should also be true here. It's also similar to what happens if an updatable view has WITH CHECK OPTION, and I think that was part of the precedent as well. I don't know whether or not the constraint issue that you mention here was also part of the concern, but it may have been. This was all quite a while ago... -- Robert Haas EDB: http://www.enterprisedb.com
Re: Bug: RLS policy FOR SELECT is used to check new rows
On Tue, 2023-10-24 at 11:59 -0400, Tom Lane wrote: > I'm fairly sure that it was intentional, but I don't recall the > reasoning; perhaps Stephen does. In any case, I grasp your point > that maybe we should distinguish RETURNING from not-RETURNING cases. Perhaps the idea is that if there are constraints involved, the failure or success of an INSERT/UPDATE/DELETE could leak information that you don't have privileges to read. Regards, Jeff Davis
Re: SLRU optimization - configurable buffer pool and partitioning the SLRU lock
On Tue, Oct 24, 2023 at 06:04:13PM +0200, Alvaro Herrera wrote: > Like everybody else, I like having less GUCs to configure, but going > this far to avoid them looks rather disastrous to me. IMO we should > just use Munro's older patches that gave one GUC per SLRU, and users > only need to increase the one that shows up in pg_wait_event sampling. > Someday we will get the (much more complicated) patches to move these > buffers to steal memory from shared buffers, and that'll hopefully let > use get rid of all this complexity. +1 > On the other hand, here's a somewhat crazy idea. What if, instead of > stealing buffers from shared_buffers (which causes a lot of complexity), > we allocate a common pool for all SLRUs to use? We provide a single > knob -- say, non_relational_buffers=32MB as default -- and we use a LRU > algorithm (or something) to distribute that memory across all the SLRUs. > So the ratio to use for this SLRU or that one would depend on the nature > of the workload: maybe more for multixact in this server here, but more > for subtrans in that server there; it's just the total amount that the > user would have to configure, side by side with shared_buffers (and > perhaps scale with it like wal_buffers), and the LRU would handle the > rest. The "only" problem here is finding a distribution algorithm that > doesn't further degrade performance, of course ... I think it's worth a try. It does seem simpler, and it might allow us to sidestep some concerns about scaling when the SLRU pages are in shared_buffers [0]. [0] https://postgr.es/m/ZPsaEGRvllitxB3v%40tamriel.snowman.net -- Nathan Bossart Amazon Web Services: https://aws.amazon.com
Re: Add new for_each macros for iterating over a List that do not require ListCell pointer
On Tue, 24 Oct 2023 at 18:47, Nathan Bossart wrote: > BTW after applying your patches, initdb began failing with the following > for me: > > TRAP: failed Assert("n >= 0 && n < list->length"), File: "list.c", > Line: 770, PID: 902807 Oh oops... That was an off by one error in the modified foreach_delete_current implementation. Attached is a fixed version. v2-0002-Use-new-for_each_xyz-macros-in-a-few-places.patch Description: Binary data v2-0001-Add-macros-for-looping-through-a-list-without-nee.patch Description: Binary data
Re: LLVM 16 (opaque pointers)
On Mon, Oct 23, 2023 at 10:47:24PM -0400, Tom Lane wrote: > Andres Freund writes: > > FC 29 is well out of support, so I don't think it makes sense to invest any > > further time in this. Personally, I don't think it's useful to have years > > old > > fedora in the buildfarm... > > +1. It's good to test old LTS distros, but Fedora releases have a > short shelf life by design. I'll start retiring those old Fedora ones I have. :) Regards, Mark
Re: Add new for_each macros for iterating over a List that do not require ListCell pointer
On Tue, Oct 24, 2023 at 06:03:48PM +0200, Jelte Fennema wrote: > Many usages of the foreach macro in the Postgres codebase only use the > ListCell variable to then get its value. This adds macros that > simplify iteration code for that very common use case. Instead of > passing a ListCell you can pass a variable of the type of its > contents. This IMHO improves readability of the code by reducing the > total amount of code while also essentially forcing the use of useful > variable names. > > While this might seem like a small quality of life improvement, in > practice it turns out to be very nice to use. At Microsoft we have > been using macros very similar to these ones in the Citus codebase for > a long time now and we pretty much never use plain foreach anymore for > new code. This seems reasonable to me. > Finally, I guess there needs to be some bikeshedding on the naming. In > the Citus codebase we call them foreach_xyz instead of the > for_each_xyz naming pattern that is used in this patchset. I'm not > sure what the current stance is on if foreach should be written with > or without an underscore between for and each. Currently pg_list.h > uses both. I don't have a strong opinion on the matter, but if I had to choose, I guess I'd pick foreach_*() because these macros are most closely related to foreach(). BTW after applying your patches, initdb began failing with the following for me: TRAP: failed Assert("n >= 0 && n < list->length"), File: "list.c", Line: 770, PID: 902807 -- Nathan Bossart Amazon Web Services: https://aws.amazon.com
Re: Adding a pg_get_owned_sequence function?
On Tue, Sep 12, 2023 at 03:53:28PM +0100, Dagfinn Ilmari Mannsåker wrote: > Tom Lane writes: >> It's possible that we could get away with just summarily changing >> the argument type from text to regclass. ISTR that we did exactly >> that with nextval() years ago, and didn't get too much push-back. >> But we couldn't do the same for the return type. Also, this >> approach does nothing for the concern about the name being >> misleading. > > Maybe we should go all the way the other way, and call it > pg_get_identity_sequence() and claim that "serial" is a legacy form of > identity columns? Hm. Could we split it into two functions, pg_get_owned_sequence() and pg_get_identity_sequence()? I see that commit 3012061 [0] added support for identity columns to pg_get_serial_sequence() because folks expected that to work, so maybe that's a good reason to keep them together. If we do elect to keep them combined, I'd be okay with renaming it to pg_get_identity_sequence() along with your other proposed changes. [0] https://postgr.es/m/20170912212054.25640.55202%40wrigleys.postgresql.org -- Nathan Bossart Amazon Web Services: https://aws.amazon.com
Re: trying again to get incremental backup
On Tue, Oct 24, 2023 at 10:53 AM Peter Eisentraut wrote: > The easiest answer is to have it off by default. Let people figure out > what works for them. There are various factors like storage, network, > server performance, RTO that will determine what combination of full > backup, incremental backup, and WAL replay will satisfy someone's > requirements. I suppose tests could be set up to determine this to some > degree. But we could also start slow and let people figure it out > themselves. When pg_basebackup was added, it was also disabled by default. > > If we think that 7d is a good setting, then I would suggest to consider, > like 10d. Otherwise, if you do a weekly incremental backup and you have > a time change or a hiccup of some kind one day, you lose your backup > sequence. > > Another possible answer is, like, 400 days? Because why not? What is a > reasonable upper limit for this? In concept, I don't think this should even be time-based. What you should do is remove WAL summaries once you know that you've taken as many incremental backups that might use them as you're ever going to do. But PostgreSQL itself doesn't have any way of knowing what your intended backup patterns are. If your incremental backup fails on Monday night and you run it manually on Tuesday morning, you might still rerun it as an incremental backup. If it fails every night for a month and you finally realize and decide to intervene manually, maybe you want a new full backup at that point. It's been a month. But on the other hand maybe you don't. There's no time-based answer to this question that is really correct, and I think it's quite possible that your backup software might want to shut off time-based deletion altogether and make its own decisions about when to nuke summaries. However, I also don't think that's a great default setting. It could easily lead to people wasting a bunch of disk space for no reason. As far as the 7d value, I figured that nighty incremental backups would be fairly common. If we think weekly incremental backups would be common, then pushing this out to 10d would make sense. While there's no reason you couldn't take an annual incremental backup, and thus want a 400d value, it seems like a pretty niche use case. Note that whether to remove summaries is a separate question from whether to generate them in the first place. Right now, I have wal_summarize_mb controlling whether they get generated in the first place, but as I noted in another recent email, that isn't an entirely satisfying solution. > It looks like each file entry in the manifest takes about 150 bytes, so > 1 GB would allow for 1024**3/150 = 7158278 files. That seems fine for now? I suspect a few people have more files than that. They'll just have to wait to use this feature until we get incremental JSON parsing (or undo the decision to use JSON for the manifest). > The current user experience of pg_basebackup is that it waits possibly a > long time for a checkpoint, and there is an option to make it go faster, > but there is no timeout AFAICT. Is this substantially different? Could > we just let it wait forever? We could. I installed the timeout because the first versions of the feature were buggy, and I didn't like having my tests hang forever with no indication of what had gone wrong. At least in my experience so far, the time spent waiting for WAL summarization is typically quite short, because only the WAL that needs to be summarized is whatever was emitted since the last time it woke up up through the start LSN of the backup. That's probably not much, and the next time the summarizer wakes up, the file should appear within moments. So, it's a little different from the checkpoint case, where long waits are expected. > Also, does waiting for checkpoint and WAL summarization happen in > parallel? If so, what if it starts a checkpoint that might take 15 min > to complete, and then after 60 seconds it kicks you off because the WAL > summarization isn't ready. That might be wasteful. It is not parallel. The trouble is, we don't really have any way to know whether WAL summarization is going to fail for whatever reason. We don't expect that to happen, but if somebody changes the permissions on the WAL summary directory or attaches gdb to the WAL summarizer process or something of that sort, it might. We could check at the outset whether we seem to be really far behind and emit a warning. For instance, if we're 1TB behind on WAL summarization when the checkpoint is requested, chances are something is busted and we're probably not going to catch up any time soon. We could warn the user about that and let them make their own decision about whether to cancel. But, that idea won't help in unattended operation, and the threshold for "really far behind" is not very clear. It might be better to wait until we get more experience with how things actually fail before doing too much engineering here, but I'm also open to
Re: SLRU optimization - configurable buffer pool and partitioning the SLRU lock
On 2023-Oct-11, Dilip Kumar wrote: > In my last email, I forgot to give the link from where I have taken > the base path for dividing the buffer pool in banks so giving the same > here[1]. And looking at this again it seems that the idea of that > patch was from Andrey M. Borodin and the idea of the SLRU scale factor > were introduced by Yura Sokolov and Ivan Lazarev. Apologies for > missing that in the first email. You mean [1]. [1] https://postgr.es/m/452d01f7e331458f56ad79bef537c31b%40postgrespro.ru I don't like this idea very much, because of the magic numbers that act as ratios for numbers of buffers on each SLRU compared to other SLRUs. These values, which I took from the documentation part of the patch, appear to have been selected by throwing darts at the wall: NUM_CLOG_BUFFERS= Min(128 << slru_buffers_size_scale, shared_buffers/256) NUM_COMMIT_TS_BUFFERS = Min(128 << slru_buffers_size_scale, shared_buffers/256) NUM_SUBTRANS_BUFFERS= Min(64 << slru_buffers_size_scale, shared_buffers/256) NUM_NOTIFY_BUFFERS = Min(32 << slru_buffers_size_scale, shared_buffers/256) NUM_SERIAL_BUFFERS = Min(32 << slru_buffers_size_scale, shared_buffers/256) NUM_MULTIXACTOFFSET_BUFFERS = Min(32 << slru_buffers_size_scale, shared_buffers/256) NUM_MULTIXACTMEMBER_BUFFERS = Min(64 << slru_buffers_size_scale, shared_buffers/256) ... which look pretty random already, if similar enough to the current hardcoded values. In reality, the code implements different values than what the documentation says. I don't see why would CLOG have the same number as COMMIT_TS, when the size for elements of the latter is like 32 times bigger -- however, the frequency of reads for COMMIT_TS is like 1000x smaller than for CLOG. SUBTRANS is half of CLOG, yet it is 16 times larger, and it covers the same range. The MULTIXACT ones appear to keep the current ratio among them (8/16 gets changed to 32/64). ... and this whole mess is scaled exponentially without regard to the size that each SLRU requires. This is just betting that enough memory can be wasted across all SLRUs up to the point where the one that is actually contended has sufficient memory. This doesn't sound sensible to me. Like everybody else, I like having less GUCs to configure, but going this far to avoid them looks rather disastrous to me. IMO we should just use Munro's older patches that gave one GUC per SLRU, and users only need to increase the one that shows up in pg_wait_event sampling. Someday we will get the (much more complicated) patches to move these buffers to steal memory from shared buffers, and that'll hopefully let use get rid of all this complexity. I'm inclined to use Borodin's patch last posted here [2] instead of your proposed 0001. [2] https://postgr.es/m/93236d36-b91c-4dfa-af03-99c083840...@yandex-team.ru I did skim patches 0002 and 0003 without going into too much detail; they look reasonable ideas. I have not tried to reproduce the claimed performance benefits. I think measuring this patch set with the tests posted by Shawn Debnath in [3] is important, too. [3] https://postgr.es/m/yemddpmrsojfq...@f01898859afd.ant.amazon.com On the other hand, here's a somewhat crazy idea. What if, instead of stealing buffers from shared_buffers (which causes a lot of complexity), we allocate a common pool for all SLRUs to use? We provide a single knob -- say, non_relational_buffers=32MB as default -- and we use a LRU algorithm (or something) to distribute that memory across all the SLRUs. So the ratio to use for this SLRU or that one would depend on the nature of the workload: maybe more for multixact in this server here, but more for subtrans in that server there; it's just the total amount that the user would have to configure, side by side with shared_buffers (and perhaps scale with it like wal_buffers), and the LRU would handle the rest. The "only" problem here is finding a distribution algorithm that doesn't further degrade performance, of course ... -- Álvaro HerreraBreisgau, Deutschland — https://www.EnterpriseDB.com/ "The problem with the facetime model is not just that it's demoralizing, but that the people pretending to work interrupt the ones actually working." -- Paul Graham, http://www.paulgraham.com/opensource.html
Add new for_each macros for iterating over a List that do not require ListCell pointer
Many usages of the foreach macro in the Postgres codebase only use the ListCell variable to then get its value. This adds macros that simplify iteration code for that very common use case. Instead of passing a ListCell you can pass a variable of the type of its contents. This IMHO improves readability of the code by reducing the total amount of code while also essentially forcing the use of useful variable names. While this might seem like a small quality of life improvement, in practice it turns out to be very nice to use. At Microsoft we have been using macros very similar to these ones in the Citus codebase for a long time now and we pretty much never use plain foreach anymore for new code. Finally, I guess there needs to be some bikeshedding on the naming. In the Citus codebase we call them foreach_xyz instead of the for_each_xyz naming pattern that is used in this patchset. I'm not sure what the current stance is on if foreach should be written with or without an underscore between for and each. Currently pg_list.h uses both. P.S. Similar macros for forboth/forthree are also possible, but require an exponential macro count handle all different possibilities, which might not be worth the effort since forboth/forthree are used much less often than foreach. In Citus we do have 3 forboth macros that don't require ListCell for the most common cases (foreach_ptr, foreach_ptr_oid, foreach_int_oid). But I did not want to clutter this patchset with that discussion. v1-0001-Add-macros-for-looping-through-a-list-without-nee.patch Description: Binary data v1-0002-Use-new-for_each_xyz-macros-in-a-few-places.patch Description: Binary data
Re: Bug: RLS policy FOR SELECT is used to check new rows
Dean Rasheed writes: > On Tue, 24 Oct 2023 at 09:36, Laurenz Albe wrote: >> I'd say that this error is wrong. The FOR SELECT policy should be applied >> to the WHERE condition, but certainly not to check new rows. > Yes, I had the same thought recently. I would say that the SELECT > policies should only be used to check new rows if the UPDATE has a > RETURNING clause and SELECT permissions are required on the target > relation. > In other words, it should be OK to UPDATE a row to new values that are > not visible according to the table's SELECT policies, provided that > the UPDATE command does not attempt to return those new values. That > would be consistent with what we do for INSERT. > Note, that the current behaviour goes back a long way, though it's not > quite clear whether this was intentional [1]. I'm fairly sure that it was intentional, but I don't recall the reasoning; perhaps Stephen does. In any case, I grasp your point that maybe we should distinguish RETURNING from not-RETURNING cases. regards, tom lane
Re: [PATCH] Extend ALTER OPERATOR to support adding commutator, negator, hashes, and merges
Christoph Berg writes: > Anyway, if this doesn't raise any "oh we didn't think of this" > concerns, we'll just remove the old operators in pgsphere. Well, the idea was exactly to forbid that sort of setup. However, if we get sufficient pushback maybe we should reconsider --- for example, maybe it'd be sane to enforce the restriction in ALTER but not CREATE? I'm inclined to wait and see if there are more complaints. regards, tom lane
Re: [PATCH] Extend ALTER OPERATOR to support adding commutator, negator, hashes, and merges
Re: Tom Lane > > We might be able to simply delete the @ operators, but doesn't this > > new check break the general possibility to have more than one spelling > > for the same operator? > > You can have more than one operator atop the same function. > But why didn't you make the @ operators commutators of each other, > rather than this mess? Historical something. You are right that the commutators could be fixed that way, but the negators are a different question. There is no legacy spelling for these. Anyway, if this doesn't raise any "oh we didn't think of this" concerns, we'll just remove the old operators in pgsphere. Christoph
RE: CDC/ETL system on top of logical replication with pgoutput, custom client
Hi there, hope to find you well. I have a follow-up question to this already long thread. Upon deploying my PostgreSQL logical replication fed application on a stale database, I ended up running out of space, as the replication slot is being held back till the next time that we receive a data-changing event, and we advance to that new LSN offset. I think that the solution for this is to advance our LSN offset every time a keep-alive message is received ('k' // 107). My doubt is, can the keep-alive messages be received in between open transaction events? I think not, but I would like to get your input to be extra sure as if this happens, and I commit that offset, I may introduce again faulty logic leading to data loss. In sum, something like this wouldn't happen: BEGIN LSN001 INSERT LSN002 KEEP LIVE LSN003 UPDATE LSN004 COMMIT LSN005 Correct? It has to be either: KEEP LIVE LSN001 BEGIN LSN002 INSERT LSN003 UPDATE LSN004 COMMIT LSN005 Or: BEGIN LSN001 INSERT LSN002 UPDATE LSN004 COMMIT LSN005 KEEP LIVE LSN006 LSNXXX are mere representations of LSN offsets. Thank you again. Regards, José Neves De: Amit Kapila Enviado: 8 de agosto de 2023 14:37 Para: José Neves Cc: Andres Freund ; pgsql-hack...@postgresql.org Assunto: Re: CDC/ETL system on top of logical replication with pgoutput, custom client On Mon, Aug 7, 2023 at 1:46 PM José Neves wrote: > > Humm, that's... challenging. I faced some issues after "the fix" because I > had a couple of transactions with 25k updates, and I had to split it to be > able to push to our event messaging system, as our max message size is 10MB. > Relying on commit time would mean that all transaction operations will have > the same timestamp. If something goes wrong while my worker is pushing that > transaction data chunks, I will duplicate some data in the next run, so... > this wouldn't allow me to deal with data duplication. > Is there any other way that you see to deal with it? > > Right now I only see an option, which is to store all processed LSNs on the > other side of the ETL. I'm trying to avoid that overhead. > Sorry, I don't understand your system enough to give you suggestions but if you have any questions related to how logical replication work then I might be able to help. -- With Regards, Amit Kapila.
Re: [PATCH] Extend ALTER OPERATOR to support adding commutator, negator, hashes, and merges
Christoph Berg writes: > This change is breaking pgsphere which has <@ @> operator pairs, but > for historical reasons also includes alternative spellings of these > operators (both called @ with swapped operand types) which now > explodes because we can't add them with the "proper" commutator and > negators declared (which point to the canonical <@ @> !<@ !@> > operators). Should have guessed that somebody might be depending on the previous squishy behavior. Still, I can't see how the above situation is a good idea. Commutators/negators should come in pairs, not have completely random links. I think it's only accidental that this setup isn't triggering other strange behavior. > We might be able to simply delete the @ operators, but doesn't this > new check break the general possibility to have more than one spelling > for the same operator? You can have more than one operator atop the same function. But why didn't you make the @ operators commutators of each other, rather than this mess? regards, tom lane
Re: Replace references to malloc() in libpq documentation with generic language
Daniel Gustafsson writes: >> On 24 Oct 2023, at 07:13, Gurjeet Singh wrote: >> The user does not benefit from knowing that libpq allocates some/all memory >> using malloc(). Mentioning malloc() here has a few downsides, and almost no >> benefits. > I'm not entirely convinced that replacing "malloc" with "allocated on the > heap" > improves the documentation. That was my reaction too. The underlying storage allocator *is* malloc, and C programmers know what that is, and I don't see how obfuscating that improves matters. It's true that on the miserable excuse for a platform that is Windows, you have to use PQfreemem because of Microsoft's inability to supply a standards-compliant implementation of malloc. But I'm not inclined to let that tail wag the dog. > I do agree with this proposed change though: > - all the space that will be freed by . > + all the memory that will be freed by . +1, seems harmless. regards, tom lane
Re: [PATCH] Tracking statements entry timestamp in pg_stat_statements
On 24.10.23 14:40, Julien Rouhaud wrote: On Tue, Oct 24, 2023 at 6:57 PM Peter Eisentraut wrote: On 24.10.23 09:58, Andrei Zubkov wrote: During last moving to the current commitfest this patch have lost its reviewers list. With respect to reviewers contribution in this patch, I think reviewers list should be fixed. I don't think it's the purpose of the commitfest app to track how *has* reviewed a patch. The purpose is to plan and allocate *current* work. If we keep a bunch of reviewers listed on a patch who are not actually reviewing the patch, then that effectively blocks new reviewers from signing up and the patch will not make progress. Past reviewers should of course be listed in the commit message, the release notes, etc. as appropriate. Really? Last time this topic showed up at least one committer said that they tend to believe the CF app more than digging the thread [1], and some other hackers mentioned other usage for being kept in the reviewer list. Since no progress has been made on the CF app since I'm not sure it's the best idea to drop reviewer names from patch entries generally. There is a conflict between the two purposes. But it is clearly the case that reviewers will more likely pick up patches that have no reviewers assigned. So if you keep stale reviewer entries around, then a patch that stays around for a while will never get reviewed again. I think this is a significant problem at the moment, and I made it part of my mission during the last commitfest to clean it up. If people want to put the stale reviewer entries back in, that is possible, but I would caution against that, because that would just self-sabotage those patches.
Re: trying again to get incremental backup
On 04.10.23 22:08, Robert Haas wrote: - I would like some feedback on the generation of WAL summary files. Right now, I have it enabled by default, and summaries are kept for a week. That means that, with no additional setup, you can take an incremental backup as long as the reference backup was taken in the last week. File removal is governed by mtimes, so if you change the mtimes of your summary files or whack your system clock around, weird things might happen. But obviously this might be inconvenient. Some people might not want WAL summary files to be generated at all because they don't care about incremental backup, and other people might want them retained for longer, and still other people might want them to be not removed automatically or removed automatically based on some criteria other than mtime. I don't really know what's best here. I don't think the default policy that the patches implement is especially terrible, but it's just something that I made up and I don't have any real confidence that it's wonderful. The easiest answer is to have it off by default. Let people figure out what works for them. There are various factors like storage, network, server performance, RTO that will determine what combination of full backup, incremental backup, and WAL replay will satisfy someone's requirements. I suppose tests could be set up to determine this to some degree. But we could also start slow and let people figure it out themselves. When pg_basebackup was added, it was also disabled by default. If we think that 7d is a good setting, then I would suggest to consider, like 10d. Otherwise, if you do a weekly incremental backup and you have a time change or a hiccup of some kind one day, you lose your backup sequence. Another possible answer is, like, 400 days? Because why not? What is a reasonable upper limit for this? - It's regrettable that we don't have incremental JSON parsing; I think that means anyone who has a backup manifest that is bigger than 1GB can't use this feature. However, that's also a problem for the existing backup manifest feature, and as far as I can see, we have no complaints about it. So maybe people just don't have databases with enough relations for that to be much of a live issue yet. I'm inclined to treat this as a non-blocker, It looks like each file entry in the manifest takes about 150 bytes, so 1 GB would allow for 1024**3/150 = 7158278 files. That seems fine for now? - Right now, I have a hard-coded 60 second timeout for WAL summarization. If you try to take an incremental backup and the WAL summaries you need don't show up within 60 seconds, the backup times out. I think that's a reasonable default, but should it be configurable? If yes, should that be a GUC or, perhaps better, a pg_basebackup option? The current user experience of pg_basebackup is that it waits possibly a long time for a checkpoint, and there is an option to make it go faster, but there is no timeout AFAICT. Is this substantially different? Could we just let it wait forever? Also, does waiting for checkpoint and WAL summarization happen in parallel? If so, what if it starts a checkpoint that might take 15 min to complete, and then after 60 seconds it kicks you off because the WAL summarization isn't ready. That might be wasteful. - I'm curious what people think about the pg_walsummary tool that is included in 0006. I think it's going to be fairly important for debugging, but it does feel a little bit bad to add a new binary for something pretty niche. This seems fine. Is the WAL summary file format documented anywhere in your patch set yet? My only thought was, maybe the file format could be human-readable (more like backup_label) to avoid this. But maybe not.
Should Explain show Parallel Hash node’s total rows?
Hi, all Shall we show Parallel Hash node’s total rows of a Parallel-aware HashJoin? Ex: a non-parallel plan, table simple has 2 rows. zml=# explain select count(*) from simple r join simple s using (id); QUERY PLAN Aggregate (cost=1309.00..1309.01 rows=1 width=8) -> Hash Join (cost=617.00..1259.00 rows=2 width=0) Hash Cond: (r.id = s.id ) -> Seq Scan on simple r (cost=0.00..367.00 rows=2 width=4) -> Hash (cost=367.00..367.00 rows=2 width=4) -> Seq Scan on simple s (cost=0.00..367.00 rows=2 width=4) (6 rows) While a parallel-aware plan: zml=# explain select count(*) from simple r join simple s using (id); QUERY PLAN Finalize Aggregate (cost=691.85..691.86 rows=1 width=8) -> Gather (cost=691.63..691.84 rows=2 width=8) Workers Planned: 2 -> Partial Aggregate (cost=691.63..691.64 rows=1 width=8) -> Parallel Hash Join (cost=354.50..670.80 rows=8333 width=0) Hash Cond: (r.id = s.id ) -> Parallel Seq Scan on simple r (cost=0.00..250.33 rows=8333 width=4) -> Parallel Hash (cost=250.33..250.33 rows=8333 width=4) -> Parallel Seq Scan on simple s (cost=0.00..250.33 rows=8333 width=4) (9 rows) When initial_cost_hashjoin(), we undo the parallel division when parallel ware. It’s reasonable because a shared hash table should have all the data. And we also take parallel into account for hash plan’s total rows if it’s parallel aware. ``` if (best_path->jpath.path.parallel_aware) { hash_plan->plan.parallel_aware = true; hash_plan->rows_total = best_path->inner_rows_total; } ``` But the Parallel Hash node of plan shows the same rows with subplan, I’m wandering if it’s more reasonable to show rows_total instead of plan_rows for Parallel Hash nodes? For this example, -> Parallel Hash (rows=2) -> Parallel Seq Scan on simple s (rows=8333) Zhang Mingli HashData https://www.hashdata.xyz
Re: run pgindent on a regular basis / scripted manner
On 2023-10-18 We 05:07, Jelte Fennema wrote: I think --enable-indent-checks sounds like a good improvement to the status quo. But I'm not confident that it will help remove the cases where only a comment needs to be re-indented. Do commiters really always run check-world again when only changing a typo in a comment? I know I probably wouldn't (or at least not always). Yeah. In fact I'm betting that a lot of the offending commits we've seen come into this category. You build, you check, then you do some final polish. That's where a pre-commit hook can save you. cheers andrew -- Andrew Dunstan EDB: https://www.enterprisedb.com
Re: [PATCH] Extend ALTER OPERATOR to support adding commutator, negator, hashes, and merges
Re: Tommy Pavlicek > I've added another patch (0002-require_unused_neg_com-v1.patch) that > prevents using a commutator or negator that's already part of a pair. Hmm. I agree with the general idea of adding sanity checks, but this might be overzealous: This change is breaking pgsphere which has <@ @> operator pairs, but for historical reasons also includes alternative spellings of these operators (both called @ with swapped operand types) which now explodes because we can't add them with the "proper" commutator and negators declared (which point to the canonical <@ @> !<@ !@> operators). https://github.com/postgrespro/pgsphere/blob/master/pgs_moc_compat.sql.in We might be able to simply delete the @ operators, but doesn't this new check break the general possibility to have more than one spelling for the same operator? Christoph
Re: run pgindent on a regular basis / scripted manner
On Tue, Oct 24, 2023 at 09:40:01AM -0400, Andrew Dunstan wrote: > Slightly off topic, but apropos your message, maybe we should recommend a > standard git commit template. I use this and then automatically remove any sections that are empty. --- |--- gitweb subject length limit |-email limit-| Reported-by: Diagnosed-by: Bug: Discussion: Author: Reviewed-by: Tested-by: Backpatch-through: -- Bruce Momjian https://momjian.us EDB https://enterprisedb.com Only you can decide what is important to you.
Re: run pgindent on a regular basis / scripted manner
On 2023-10-17 Tu 09:52, Robert Haas wrote: On Tue, Oct 17, 2023 at 6:34 AM Jelte Fennema wrote: I think *it is* dead easy to comply. If you run the following commands before committing/after rebasing, then koel should always be happy: src/tools/pgindent/pgindent src # works always but a bit slow src/tools/pgindent/pgindent $(git diff --name-only --diff-filter=ACMR) # much faster, but only works if you DID NOT change typedefs.list In isolation, that's true, but the list of mistakes that you can make while committing which will inconvenience everyone working on the project is very long. Another one that comes up frequently is forgetting to bump CATALOG_VERSION_NO, but you also need a good commit message, and good comments, and a good Discussion link in the commit message, and the right list of authors and reviewers, and to update the docs (with spaces, not tabs) and the Makefiles (with tabs, not spaces) and the meson stuff and, as if that weren't enough already, you actually need the code to work! And that includes not only working regularly but also with CLOBBER_CACHE_ALWAYS and debug_parallel_query and so on. It's very easy to miss something somewhere. I put a LOT of work into polishing my commits before I push them, and it's still not that uncommon that I screw something up. Yes, there's a lot to look out for, and you're a damn sight better at it than I am. But we should try to automate the things that can be automated, even if that leaves many tasks that can't be. I have three things in my pre-commit hook: a check for catalog updates, a check for new typedefs, and an indent check. And every one of them has saved me from doing things I should not be doing. They aren't perfect but they are useful. Slightly off topic, but apropos your message, maybe we should recommend a standard git commit template. cheers andrew -- Andrew Dunstan EDB: https://www.enterprisedb.com
Re: More new SQL/JSON item methods
On 2023-10-19 Th 02:06, Jeevan Chalke wrote: Thanks, Peter for the comments. On Fri, Oct 6, 2023 at 5:13 PM Peter Eisentraut wrote: On 29.08.23 09:05, Jeevan Chalke wrote: > v1-0001-Implement-jsonpath-.bigint-.integer-and-.number-m.patch > > This commit implements jsonpath .bigint(), .integer(), and .number() > methods. The JSON string or a numeric value is converted to the > bigint, int4, and numeric type representation. A comment that applies to all of these: These add various keywords, switch cases, documentation entries in some order. Are we happy with that? Should we try to reorder all of that for better maintainability or readability? Yeah, that's the better suggestion. While implementing these methods, I was confused about where to put them exactly and tried keeping them in some logical place. I think once these methods get in, we can have a follow-up patch reorganizing all of these. I think it would be better to organize things how we want them before adding in more stuff. cheers andrew -- Andrew Dunstan EDB:https://www.enterprisedb.com
Re: Add trailing commas to enum definitions
On 2023-10-23 Mo 17:04, Tom Lane wrote: Nathan Bossart writes: From a long-term perspective, I think standardizing on the trailing comma style will actually improve git-blame because patches won't need to add a comma to the previous line when adding a value. Yeah, that's a good point. I had been leaning towards "this is unnecessary churn", but with that idea I'm now +1. +1. It's a fairly common practice in Perl code, too, and I often do it for exactly this reason. cheers andrew -- Andrew Dunstan EDB: https://www.enterprisedb.com
Re: Various bugs if segment containing redo pointer does not exist
On Mon, Oct 23, 2023 at 8:43 PM Andres Freund wrote: > The source of the emode=13=DEBUG2 is that that's hardcoded in > WaitForWALToBecomeAvailable(). I guess the error ought to come from > XLogPageRead(), but all that happens is this: > > case XLREAD_FAIL: > if (readFile >= 0) > close(readFile); > readFile = -1; > readLen = 0; > readSource = XLOG_FROM_ANY; > return XLREAD_FAIL; I've been under the impression that the guiding principle here is that we shouldn't error out upon hitting a condition that should only cause us to switch sources. I think WaitForWALToBecomeAvailable() is supposed to set things up so that XLogPageRead()'s call to pg_pread() will succeed. If it says it can't, then XLogPageRead() is only obliged to pass that information up to the caller, who can decide to wait longer for the data to show up, or give up, or whatever it wants to do. On the other hand, if WaitForWALToBecomeAvailable() says that it's fine to go ahead and call pg_pread() and pg_pread() then fails, then that means that we've got a problem with the WAL file other than it just not being available yet, like it's the wrong length or there was an I/O error, and those are reportable errors. Said differently, in the former case, the WAL is not broken, merely not currently available; in the latter case, it's broken. The legibility and maintainability of this code are certainly not great. The xlogreader mechanism is extremely useful, but maybe we should have done more cleanup of the underlying mechanism first. It's a giant ball of spaghetti code that is challenging to understand and almost impossible to validate thoroughly (as you just discovered). -- Robert Haas EDB: http://www.enterprisedb.com
Re: Query execution in Perl TAP tests needs work
On 2023-10-18 We 11:47, Tom Lane wrote: Robert Haas writes: On Wed, Oct 18, 2023 at 10:28 AM Tom Lane wrote: I did a bit of research on this on my favorite platforms, and did not like the results: Hmm. That's unfortunate. Is perl -MCPAN -e 'install Platypus::FFI' a viable alternative? Probably, see my followup. Interesting. OK, here's an attempt to push the cart a bit further down the road. The attached module wraps quite a lot of libpq, at least enough for most of the cases we would be interested in, I think. It also exports some constants such as connection status values, query status values, transaction status values and type Oids. It also makes the absence of FFI::Platypus not instantly fatal, but any attempt to use one of the wrapped functions will die with a message about the module being missing if it's not found. I guess the next step would be for someone to locate some of the hotspots in the TAP tests and try to convert them to using persistent connections with this gadget or similar and see how much faster we can make them. cheers andrew -- Andrew Dunstan EDB: https://www.enterprisedb.com PqFFI.pm Description: Perl program
Re: [PATCH] Tracking statements entry timestamp in pg_stat_statements
On Tue, Oct 24, 2023 at 6:57 PM Peter Eisentraut wrote: > > On 24.10.23 09:58, Andrei Zubkov wrote: > > During last moving to the current commitfest this patch have lost its > > reviewers list. With respect to reviewers contribution in this patch, I > > think reviewers list should be fixed. > > I don't think it's the purpose of the commitfest app to track how *has* > reviewed a patch. The purpose is to plan and allocate *current* work. > If we keep a bunch of reviewers listed on a patch who are not actually > reviewing the patch, then that effectively blocks new reviewers from > signing up and the patch will not make progress. > > Past reviewers should of course be listed in the commit message, the > release notes, etc. as appropriate. Really? Last time this topic showed up at least one committer said that they tend to believe the CF app more than digging the thread [1], and some other hackers mentioned other usage for being kept in the reviewer list. Since no progress has been made on the CF app since I'm not sure it's the best idea to drop reviewer names from patch entries generally. [1] https://www.postgresql.org/message-id/552155.1648737...@sss.pgh.pa.us
make pg_ctl start more friendly
I tried PITR recovery, and the 'recovery_target_action' guc is shutdown. I got a failure, and it told me to check the log, finally I found the result was due to guc. I think pg_ctl should print some information which told users recovery had been done. I developed a commit in my workspace. The steps below: 1. postmaster exits with code 3 if startup shutdowns because of recovery target action 2. add enum POSTMASER_RECOVERY_SHUTDOWN in pg_ctl 3. print information to stderr if the postmaster's exit code is 3 I test, and it's ok. I think this information is very useful, especially for some beginners. A good project not only needs performance, but also ease-of-use.
Re: RFC: Logging plan of the running query
On 2023-10-24 16:12, Étienne BERSAC wrote: Hi, Yeah, and when we have a situation where we want to run pg_log_query_plan(), we can run it in any environment as long as it is bundled with the core. Is it possible to get the plan of a backend programmatically without access to the logs? Something like pg_get_query_plan(pid, format) which output would be the same as EXPLAIN. Do you imagine a function like below? =# select pg_get_query_plan(100, 'plain'); QUERY PLAN --- Limit (cost=0.00..0.04 rows=1 width=273) -> Seq Scan on pg_class (cost=0.00..17.14 rows=414 width=273) If so, we once tried to implement such function for getting memory contexts. However, this attempt didn't succeed because it could lead dead lock situation[1]. I think the same problem can occur when implementing pg_get_query_plan(). [1] https://www.postgresql.org/message-id/9a50371e15e741e295accabc72a41df1%40oss.nttdata.com Regards, Étienne BERSAC Dalibo -- Regards, -- Atsushi Torikoshi NTT DATA Group Corporation
Re: trying again to get incremental backup
On Mon, Oct 23, 2023 at 7:56 PM David Steele wrote: > > I also think a lot of the use of the low-level API is > > driven by it being just too darn slow to copy the whole database, and > > incremental backup can help with that in some circumstances. > > I would argue that restore performance is *more* important than backup > performance and this patch is a step backward in that regard. Backups > will be faster and less space will be used in the repository, but > restore performance is going to suffer. If the deltas are very small the > difference will probably be negligible, but as the deltas get large (and > especially if there are a lot of them) the penalty will be more noticeable. I think an awful lot depends here on whether the repository is local or remote. If you have filesystem access to wherever the backups are stored anyway, I don't think that using pg_combinebackup to write out a new data directory is going to be much slower than copying one data directory from the repository to wherever you'd actually use the backup. It may be somewhat slower because we do need to access some data in every involved backup, but I don't think it should be vastly slower because we don't have to read every backup in its entirety. For each file, we read the (small) header of the newest incremental file and every incremental file that precedes it until we find a full file. Then, we construct a map of which blocks need to be read from which sources and read only the required blocks from each source. If all the blocks are coming from a single file (because there are no incremental for a certain file, or they contain no blocks) then we just copy the entire source file in one shot, which can be optimized using the same tricks we use elsewhere. Inevitably, this is going to read more data and do more random I/O than just a flat copy of a directory, but it's not terrible. The overall amount of I/O should be a lot closer to the size of the output directory than to the sum of the sizes of the input directories. Now, if the repository is remote, and you have to download all of those backups first, and then run pg_combinebackup on them afterward, that is going to be unpleasant, unless the incremental backups are all quite small. Possibly this could be addressed by teaching pg_combinebackup to do things like accessing data over HTTP and SSH, and relatedly, looking inside tarfiles without needing them unpacked. For now, I've left those as ideas for future improvement, but I think potentially they could address some of your concerns here. A difficulty is that there are a lot of protocols that people might want to use to push bytes around, and it might be hard to keep up with the march of progress. I do agree, though, that there's no such thing as a free lunch. I wouldn't recommend to anyone that they plan to restore from a chain of 100 incremental backups. Not only might it be slow, but the opportunities for something to go wrong are magnified. Even if you've automated everything well enough that there's no human error involved, what if you've got a corrupted file somewhere? Maybe that's not likely in absolute terms, but the more files you've got, the more likely it becomes. What I'd suggest someone do instead is periodically do pg_combinebackup full_reference_backup oldest_incremental -o new_full_reference_backup; rm -rf full_reference_backup; mv new_full_reference_backup full_reference_backup. The new full reference backup is intended to still be usable for restoring incrementals based on the incremental it replaced. I hope that, if people use the feature well, this should limit the need for really long backup chains. I am sure, though, that some people will use it poorly. Maybe there's room for more documentation on this topic. > I was concerned with the difficulty of trying to stage the correct > backups for pg_combinebackup, not whether it would recognize that the > needed data was not available and then error appropriately. The latter > is surmountable within pg_combinebackup but the former is left up to the > user. Indeed. > One note regarding the patches. I feel like > v5-0005-Prototype-patch-for-incremental-backup should be split to have > the WAL summarizer as one patch and the changes to base backup as a > separate patch. > > It might not be useful to commit one without the other, but it would > make for an easier read. Just my 2c. Yeah, maybe so. I'm not quite ready to commit to doing that split as of this writing but I will think about it and possibly do it. -- Robert Haas EDB: http://www.enterprisedb.com
Re: [PATCH] Tracking statements entry timestamp in pg_stat_statements
On 24.10.23 09:58, Andrei Zubkov wrote: During last moving to the current commitfest this patch have lost its reviewers list. With respect to reviewers contribution in this patch, I think reviewers list should be fixed. I don't think it's the purpose of the commitfest app to track how *has* reviewed a patch. The purpose is to plan and allocate *current* work. If we keep a bunch of reviewers listed on a patch who are not actually reviewing the patch, then that effectively blocks new reviewers from signing up and the patch will not make progress. Past reviewers should of course be listed in the commit message, the release notes, etc. as appropriate.
Open a streamed block for transactional messages during decoding
Hi, While reviewing the test_decoding code, I noticed that when skip_empty_xacts option is specified, it doesn't open the streaming block( e.g. pg_output_stream_start) before streaming the transactional MESSAGE even if it's the first change in a streaming block. It looks inconsistent with what we do when streaming DML changes(e.g. pg_decode_stream_change()). Here is a small patch to open the stream block in this case. Best Regards, Hou Zhijie 0001-Open-a-streamed-block-for-transactional-messages-dur.patch Description: 0001-Open-a-streamed-block-for-transactional-messages-dur.patch
d844cd75a and postgres_fdw
Hi, I'm seeing an issue after upgrading from 12.13 to 15.4. This happens when we run a query against a foreign table (fdw on the same instance to a different database) -- but does not appear when we get rid of postgres_fdw: ERROR: cursor can only scan forward HINT: Declare it with SCROLL option to enable backward scan. CONTEXT: remote SQL command: MOVE BACKWARD ALL IN c1 SQL state: 55000 I attached the query. The name of the foreign table is "foobar.sys_user". Looks like the bug #17889, and this is the last email in that thread: https://www.postgresql.org/message-id/1852635.1682808624%40sss.pgh.pa.us OTOH, same query works (against the FDW) when we remove the following WHERE clause: WHERE tbl.table_status = 'A' AND tbl.table_id <> 1 AND tbl.table_id <> - 2 Any hints? Regards, -- Devrim Gündüz Open Source Solution Architect, PostgreSQL Major Contributor Twitter: @DevrimGunduz , @DevrimGunduzTR SELECT * FROM foobar.sys_data_table_info AS tbl INNER JOIN foobar.sys_folder_items fi ON tbl.table_id = fi.item_id AND fi.item_type = 'TABLE' LEFT JOIN foobar.sys_user_foreign uc ON tbl.created_by = uc.user_id LEFT JOIN foobar.sys_user_foreign uu ON tbl.updated_by = uu.user_id WHERE tbl.table_status = 'A' AND tbl.table_id <> 1 AND tbl.table_id <> - 2 ORDER BY updated_at DESC NULLS LAST
RE: logical decoding and replication of sequences, take 2
On Thursday, October 12, 2023 11:06 PM Tomas Vondra wrote: > Hi, I have been reviewing the patch set, and here are some initial comments. 1. I think we need to mark the RBTXN_HAS_STREAMABLE_CHANGE flag for transactional sequence change in ReorderBufferQueueChange(). 2. ReorderBufferSequenceIsTransactional It seems we call the above function once in sequence_decode() and call it again in ReorderBufferQueueSequence(), would it better to avoid the second call as the hashtable search looks not cheap. 3. The patch cleans up the sequence hash table when COMMIT or ABORT a transaction (via ReorderBufferAbort() and ReorderBufferReturnTXN()), while it doesn't seem destory the hash table when PREPARE the transaction. It's not a big porblem but would it be better to release the memory earlier by destory the table for prepare ? 4. +pg_decode_stream_sequence(LogicalDecodingContext *ctx, ReorderBufferTXN *txn, ... + /* output BEGIN if we haven't yet, but only for the transactional case */ + if (transactional) + { + if (data->skip_empty_xacts && !txndata->xact_wrote_changes) + { + pg_output_begin(ctx, data, txn, false); + } + txndata->xact_wrote_changes = true; + } I think we should call pg_output_stream_start() instead of pg_output_begin() for streaming sequence changes. 5. + /* +* Schema should be sent using the original relation because it +* also sends the ancestor's relation. +*/ + maybe_send_schema(ctx, txn, relation, relentry); The comment seems a bit misleading here, I think it was used for the partition logic in pgoutput_change(). Best Regards, Hou zj
A case for GIST supporting ORDER BY
Hi, Some time ago I’ve provided some details with the issues we face when trying to use GIST and partitioning at the same time in the postgresql-general mailing list: https://www.postgresql.org/message-id/3FA1E0A9-8393-41F6-88BD-62EEEA1EC21F%40kleczek.org GIST index and ORDER BY postgresql.org We decided to go with the solution to partition our table by: RANGE (‘2100-01-01' <-> operation_date). While it (somewhat) solves partition pruning issues described above there is another problem: It is impossible to create a unique constraint on the partitioned table. So now we cannot use INSERT … ON CONFLICT (…) DO UPDATE My question to hackers: Would it be feasible to implement ORDER BY column GIST index (only) scan for types with total order and sensible greatest and least values? Thanks, Michal
Re: Bug: RLS policy FOR SELECT is used to check new rows
On Tue, 24 Oct 2023 at 09:36, Laurenz Albe wrote: > > I'd say that this error is wrong. The FOR SELECT policy should be applied > to the WHERE condition, but certainly not to check new rows. > Yes, I had the same thought recently. I would say that the SELECT policies should only be used to check new rows if the UPDATE has a RETURNING clause and SELECT permissions are required on the target relation. In other words, it should be OK to UPDATE a row to new values that are not visible according to the table's SELECT policies, provided that the UPDATE command does not attempt to return those new values. That would be consistent with what we do for INSERT. Note, that the current behaviour goes back a long way, though it's not quite clear whether this was intentional [1]. [1] https://github.com/postgres/postgres/commit/7d8db3e8f37aec9d252353904e77381a18a2fa9f Regards, Dean
Re: Synchronizing slots from primary to standby
Hi, On 10/24/23 7:44 AM, Ajin Cherian wrote: On Mon, Oct 23, 2023 at 11:22 PM Drouvot, Bertrand wrote: @@ -602,6 +602,9 @@ CreateDecodingContext(XLogRecPtr start_lsn, SnapBuildSetTwoPhaseAt(ctx->snapshot_builder, start_lsn); } + /* set failover in the slot, as requested */ + slot->data.failover = ctx->failover; + I think we can get rid of this change in CreateDecodingContext(). Yes, I too noticed this in my testing, however just removing this from CreateDecodingContext will not allow us to change the slot's failover flag using Alter subscription. Oh right. I am thinking of moving this change to StartLogicalReplication prior to calling CreateDecodingContext by parsing the command options in StartReplicationCmd without adding it to the LogicalDecodingContext. Yeah, that looks like a good place to update "failover". Doing more testing and I have a couple of remarks about he current behavior. 1) Let's imagine that: - there is no standby - standby_slot_names is set to a valid slot on the primary (but due to the above, not linked to any standby) - then a create subscription on a subscriber WITH (failover = true) would start the synchronisation but never finish (means leaving a "synchronisation" slot like "pg_32811_sync_24576_7293415241672430356" in place coming from ReplicationSlotNameForTablesync()). That's expected, but maybe we should emit a warning in WalSndWaitForStandbyConfirmation() on the primary when there is a slot part of standby_slot_names which is not active/does not have an active_pid attached to it? 2) When we create a subscription, another slot is created during the subscription synchronization, namely like "pg_16397_sync_16388_7293447291374081805" (coming from ReplicationSlotNameForTablesync()). This extra slot appears to have failover also set to true. So, If the standby refresh the list of slot to sync when the subscription is still synchronizing we'd see things like on the standby: LOG: waiting for remote slot "mysub" LSN (0/C0034808) and catalog xmin (756) to pass local slot LSN (0/C0034840) and and catalog xmin (756) LOG: wait over for remote slot "mysub" as its LSN (0/C00368B0) and catalog xmin (756) has now passed local slot LSN (0/C0034840) and catalog xmin (756) LOG: waiting for remote slot "pg_16397_sync_16388_7293447291374081805" LSN (0/C0034808) and catalog xmin (756) to pass local slot LSN (0/C00368E8) and and catalog xmin (756) WARNING: slot "pg_16397_sync_16388_7293447291374081805" disappeared from the primary, aborting slot creation I'm not sure this "pg_16397_sync_16388_7293447291374081805" should have failover set to true. If there is a failover during the subscription creation, better to re-launch the subscription instead? Regards, -- Bertrand Drouvot PostgreSQL Contributors Team RDS Open Source Databases Amazon Web Services: https://aws.amazon.com
Re: Simplify create_merge_append_path a bit for clarity
Hi! On 11.08.2023 05:31, Richard Guo wrote: As explained in the comments for generate_orderedappend_paths, we don't currently support parameterized MergeAppend paths, and it doesn't seem like going to change anytime soon. Based on that, we could simplify create_merge_append_path a bit, such as set param_info to NULL directly rather than call get_appendrel_parampathinfo() for it. We already have an Assert on that in create_merge_append_plan. I understand that the change would not make any difference for performance, it's just for clarity's sake. I agree with you, and we can indeed directly set the param_info value to NULL, and there are enough comments here to explain. I didn't find anything else to add in your patch. -- Regards, Alena Rybakina
Bug: RLS policy FOR SELECT is used to check new rows
Try this as a user with NOBYPASSRLS: CREATE TABLE rlsbug (deleted boolean); INSERT INTO rlsbug VALUES (FALSE); CREATE POLICY p_sel ON rlsbug FOR SELECT TO laurenz USING (NOT deleted); CREATE POLICY p_upd ON rlsbug FOR UPDATE TO laurenz USING (TRUE); ALTER TABLE rlsbug ENABLE ROW LEVEL SECURITY; ALTER TABLE rlsbug FORCE ROW LEVEL SECURITY; UPDATE rlsbug SET deleted = TRUE WHERE NOT deleted; ERROR: new row violates row-level security policy for table "rlsbug" I'd say that this error is wrong. The FOR SELECT policy should be applied to the WHERE condition, but certainly not to check new rows. Yours, Laurenz Albe
Re: Replace references to malloc() in libpq documentation with generic language
> On 24 Oct 2023, at 07:13, Gurjeet Singh wrote: > The user does not benefit from knowing that libpq allocates some/all memory > using malloc(). Mentioning malloc() here has a few downsides, and almost no > benefits. I'm not entirely convinced that replacing "malloc" with "allocated on the heap" improves the documentation. I do agree with this proposed change though: - all the space that will be freed by . + all the memory that will be freed by . -- Daniel Gustafsson
Re: pg_ctl start may return 0 even if the postmaster has been already started on Windows
At Tue, 24 Oct 2023 07:37:22 +, "Hayato Kuroda (Fujitsu)" wrote in > Dear Horiguchi-san, > > Thanks for updates! I was quite not sure the Windows env, but I can post > comments. > (We need reviews by windows-friendly developers...) Indeed, I haven't managed to successfully build using Meson on Windows... > Formatting of messages for write_stderr() seem different from others. In v3, > I slightly modified for readability like below. I wanted to let you know just > in case > because you did not say anything about these changes... Ah. Sorry, I was lazy about the messages because I didn't regard this to be at that stage yet. In the attached, fixed the existing two messages, and adjusted one message to display an error code, all in the consistent format. Thanks! -- Kyotaro Horiguchi NTT Open Source Software Center >From 65013b5ccebd101f91be0e46b5209dc9cfcc7d5f Mon Sep 17 00:00:00 2001 From: Kyotaro Horiguchi Date: Tue, 24 Oct 2023 11:43:57 +0900 Subject: [PATCH v5 1/3] Disable autoruns for cmd.exe on Windows On Windows, we use cmd.exe to launch the postmaster process for easy redirection setup. However, cmd.exe may execute other programs at startup due to autorun configurations. This patch adds /D flag to the launcher cmd.exe command line to disable autorun settings written in the registry. --- src/bin/pg_ctl/pg_ctl.c | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/src/bin/pg_ctl/pg_ctl.c b/src/bin/pg_ctl/pg_ctl.c index 807d7023a9..3ac2fcc004 100644 --- a/src/bin/pg_ctl/pg_ctl.c +++ b/src/bin/pg_ctl/pg_ctl.c @@ -553,11 +553,11 @@ start_postmaster(void) else close(fd); - cmd = psprintf("\"%s\" /C \"\"%s\" %s%s < \"%s\" >> \"%s\" 2>&1\"", + cmd = psprintf("\"%s\" /D /C \"\"%s\" %s%s < \"%s\" >> \"%s\" 2>&1\"", comspec, exec_path, pgdata_opt, post_opts, DEVNULL, log_file); } else - cmd = psprintf("\"%s\" /C \"\"%s\" %s%s < \"%s\" 2>&1\"", + cmd = psprintf("\"%s\" /D /C \"\"%s\" %s%s < \"%s\" 2>&1\"", comspec, exec_path, pgdata_opt, post_opts, DEVNULL); if (!CreateRestrictedProcess(cmd, , false)) -- 2.39.3 >From 18c9a319b5160721d08cc1d3a3df770a63756f65 Mon Sep 17 00:00:00 2001 From: Kyotaro Horiguchi Date: Tue, 24 Oct 2023 14:46:50 +0900 Subject: [PATCH v5 2/3] Improve pg_ctl postmaster process check on Windows Currently pg_ctl on Windows does not verify that it actually executed a postmaster process due to lack of process ID knowledge. This can lead to false positives in cases where another pg_ctl instance starts a different server simultaneously. This patch adds the capability to identify the process ID of the launched postmaster on Windows, similar to other OS versions, ensuring more reliable detection of concurrent server startups. --- src/bin/pg_ctl/pg_ctl.c | 102 1 file changed, 93 insertions(+), 9 deletions(-) diff --git a/src/bin/pg_ctl/pg_ctl.c b/src/bin/pg_ctl/pg_ctl.c index 3ac2fcc004..221049db0e 100644 --- a/src/bin/pg_ctl/pg_ctl.c +++ b/src/bin/pg_ctl/pg_ctl.c @@ -132,6 +132,7 @@ static void adjust_data_dir(void); #ifdef WIN32 #include +#include static bool pgwin32_IsInstalled(SC_HANDLE); static char *pgwin32_CommandLine(bool); static void pgwin32_doRegister(void); @@ -142,6 +143,7 @@ static void WINAPI pgwin32_ServiceMain(DWORD, LPTSTR *); static void pgwin32_doRunAsService(void); static int CreateRestrictedProcess(char *cmd, PROCESS_INFORMATION *processInfo, bool as_service); static PTOKEN_PRIVILEGES GetPrivilegesToDelete(HANDLE hToken); +static pid_t pgwin32_find_postmaster_pid(pid_t shell_pid); #endif static pid_t get_pgpid(bool is_status_request); @@ -609,7 +611,11 @@ wait_for_postmaster_start(pid_t pm_pid, bool do_checkpoint) /* File is complete enough for us, parse it */ pid_t pmpid; time_t pmstart; - +#ifndef WIN32 + pid_t wait_pid = pm_pid; +#else + pid_t wait_pid = pgwin32_find_postmaster_pid(pm_pid); +#endif /* * Make sanity checks. If it's for the wrong PID, or the recorded * start time is before pg_ctl started, then either we are looking @@ -619,14 +625,8 @@ wait_for_postmaster_start(pid_t pm_pid, bool do_checkpoint) */ pmpid = atol(optlines[LOCK_FILE_LINE_PID - 1]); pmstart = atol(optlines[LOCK_FILE_LINE_START_TIME - 1]); - if (pmstart >= start_time - 2 && -#ifndef WIN32 -pmpid == pm_pid -#else - /* Windows can only reject standalone-backend PIDs */ -pmpid > 0 -#endif -) + + if (pmstart >= start_time - 2 && pmpid == wait_pid) { /* * OK, seems to be a valid pidfile from our child. Check the @@ -1950,6 +1950,90 @@ GetPrivilegesToDelete(HANDLE hToken) return tokenPrivs; } + +/* + * Find the PID of the launched postmaster. + * + * On Windows, the cmd.exe doesn't support the exec command. As a result, we + * don't directly get the postmaster's PID. This function identifies the PID of + * the postmaster started by the child cmd.exe. + * + * Returns the postmaster's
Re: [PATCH] Tracking statements entry timestamp in pg_stat_statements
Hi, During last moving to the current commitfest this patch have lost its reviewers list. With respect to reviewers contribution in this patch, I think reviewers list should be fixed. regards, Andrei Zubkov Postgres Professional The Russian Postgres Company
Re: [PoC] pg_upgrade: allow to upgrade publisher node
On Tue, Oct 24, 2023 at 11:32 AM Hayato Kuroda (Fujitsu) wrote: > > > If we don't want to keep it generic then we should use something like > > 'contains_decodable_change'. 'is_change_decodable' could have suited > > here if we were checking a particular change. > > I kept the name for now. How does Bharath think? No more bikeshedding from my side. +1 for processing_required as-is. > > > 6. An optimization in count_old_cluster_logical_slots(void): Turn > > > slot_count to a function static variable so that the for loop isn't > > > required every time because the slot count is prepared in > > > get_old_cluster_logical_slot_infos only once and won't change later > > > on. Do you see any problem with the following? This saves a few CPU > > > cycles when there are large number of replication slots. > > > { > > > static int slot_count = 0; > > > static bool first_time = true; > > > > > > if (first_time) > > > { > > > for (int dbnum = 0; dbnum < old_cluster.dbarr.ndbs; dbnum++) > > > slot_count += old_cluster.dbarr.dbs[dbnum].slot_arr.nslots; > > > > > > first_time = false; > > > } > > > > > > return slot_count; > > > } > > > > > > > This may not be a problem but this is also not a function that will be > > used frequently. I am not sure if adding such code optimizations is > > worth it. > > Not addressed. count_old_cluster_logical_slots is being called 3 times during pg_upgrade and every time counting number of slots for all the databases seems redundant IMV especially given the fact that the slot count is computed once at the beginning and never changes. When the replication slots on the cluster are on the higher side, every time counting *may* prove costly. And, the use of static variables isn't a huge change requiring a different set of infra or as such, it's a simple pattern. Having said above, if others don't see a merit in it, I'm okay to withdraw my comment. > Below commands are an example of the test. > > ``` > # test PG9.5 -> patched HEAD > $ oldinstall=/home/hayato/older/pg95 make check > PROVE_TESTS='t/003_upgrade_logical_replication_slots.pl' Oh, I get it. Thanks. > Also, based on a comment [2], the upgrade function was renamed to > 'binary_upgrade_logical_slot_has_caught_up'. +1. I spent some time on the v57 patch and it looks good to me - tests are passing, no complaints from pgindent and pgperltidy. I turned the CF entry https://commitfest.postgresql.org/45/4273/ to RfC. -- Bharath Rupireddy PostgreSQL Contributors Team RDS Open Source Databases Amazon Web Services: https://aws.amazon.com
RE: pg_ctl start may return 0 even if the postmaster has been already started on Windows
Dear Horiguchi-san, Thanks for updates! I was quite not sure the Windows env, but I can post comments. (We need reviews by windows-friendly developers...) > Other error cases will fit to "shouldn't occur under normal > conditions" errors. Formatting of messages for write_stderr() seem different from others. In v3, I slightly modified for readability like below. I wanted to let you know just in case because you did not say anything about these changes... ``` + /* create a process snapshot */ + hSnapshot = CreateToolhelp32Snapshot(TH32CS_SNAPPROCESS, 0); + if (hSnapshot == INVALID_HANDLE_VALUE) + { + write_stderr(_("%s: could not create a snapshot: error code %lu\n"), +progname, (unsigned long) GetLastError()); + exit(1); + } + + /* start iterating on the snapshot */ + ppe.dwSize = sizeof(PROCESSENTRY32); + if (!Process32First(hSnapshot, )) + { + write_stderr(_("%s: cound not retrieve information about the process: error code %lu\n"), +progname, GetLastError()); + exit(1); + } + ``` Best Regards, Hayato Kuroda FUJITSU LIMITED
Re: Synchronizing slots from primary to standby
Here are some review comments for patch v25-0002 (additional to v25-0002 review comments [1]) == src/backend/catalog/system_views.sql 1. @@ -1003,7 +1003,8 @@ CREATE VIEW pg_replication_slots AS L.safe_wal_size, L.two_phase, L.conflicting, -L.failover +L.failover, +L.synced_slot FROM pg_get_replication_slots() AS L LEFT JOIN pg_database D ON (L.datoid = D.oid); AFAICT the patch is missing PG DOCS descriptions for these new view attributes. == src/backend/replication/logical/launcher.c 2. slotsync_remove_obsolete_dbs + + /* + * TODO: Take care of of removal of old 'synced' slots for the dbs which + * are no longer eligible for slot-sync. + */ typo: "of of" ~~~ 3. + /* + * Make sure that concerned WAL is received before syncing slot to target + * lsn received from the primary. + * + * This check should never pass as on the primary, we have waited for + * standby's confirmation before updating the logical slot. But to take + * care of any bug in that flow, we should retain this check. + */ + if (remote_slot->confirmed_lsn > WalRcv->latestWalEnd) + { + ereport(LOG, + errmsg_internal("skipping sync of slot \"%s\" as the received slot-sync " + "lsn %X/%X is ahead of the standby position %X/%X", + remote_slot->name, + LSN_FORMAT_ARGS(remote_slot->confirmed_lsn), + LSN_FORMAT_ARGS(WalRcv->latestWalEnd))); + return; + } Would elog be better here than using ereport(LOG, errmsg_internal...); IIUC it does the same thing? == [1] https://www.postgresql.org/message-id/CAHut%2BPspseC03Fhsi%3DOqOtksagspE%2B0MVOhrhhUb64cc_4SE1w%40mail.gmail.com Kind Regards, Peter Smith. Fujitsu Australia
Re: Synchronizing slots from primary to standby
Here are some review comments for v24-0001 == 1. GENERAL - failover slots terminology There is inconsistent terminology, such as below. Try to use the same wording everywhere. - failover logical slots - failover slots - logical failover slots - logical replication failover slots - etc. These are in many places - comments, function names, constants etc. ~~~ 2. GENERAL - THE s/primary.../the primary.../ s/standby.../the standby.../ Missing "the" problems remain in multiple places in the patch. ~~~ 3. GENERAL - messages I searched all the ereports and elogs (the full list is below only for reference). There are many little quirks: 3a. Sometimes messages say "primary"; sometimes "primary server" etc. Be consistent. 3b. /primary/the primary/ 3c. Sometimes messages include errcode and sometimes they do not; Are they deliberate or are there missing errcodes? 3d. At least one message has unwanted trailing space 3e. Sometimes using errcode and/or errmsg enclosed in parentheses; sometimes not. AFAIK it is not necessary anymore. 3f. Inconsistent terminology "slot" V "failover slots" V "failover logical slots" etc mentioned in the previous review comment #1 3g. Sometimes messages "slot creation aborted"; Sometimes "aborting slot creation". Be consistent. 3h. s/lsn/LSN/ 3i. s/move it backward/move it backwards/ 3j. Sometimes LOG message starts uppercase; Sometimes lowercase. Be consistent. 3k. typo: s/and and/and/ 3l. "worker %d" V "worker%d" ~ Messages: ereport(ERROR, (errmsg("could not receive failover slots dbinfo from the primary server: %s", pchomp(PQerrorMessage(conn->streamConn); ereport(ERROR, (errmsg("invalid response from primary server"), errdetail("Could not get failover slots dbinfo: got %d fields, " "expected 1", nfields))); ereport(ERROR, (errcode(ERRCODE_SYNTAX_ERROR), errmsg("invalid connection string syntax: %s", errcopy))); ereport(ERROR, (errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE), errmsg("replication slot-sync worker slot %d is " "empty, cannot attach", slot))); ereport(ERROR, (errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE), errmsg("replication slot-sync worker slot %d is " "already used by another worker, cannot attach", slot))); ereport(ERROR, (errmsg("could not connect to the primary server: %s", err))); ereport(ERROR, (errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE), errmsg("cannot use replication slot \"%s\" for logical decoding", NameStr(slot->data.name)), errdetail("This slot is being synced from the primary."), errhint("Specify another replication slot."))); ereport(ERROR, (errmsg("could not fetch slot info for slot \"%s\" from" " the primary: %s", remote_slot->name, res->err))); ereport(ERROR, (errmsg("could not fetch slot info for slot \"%s\" from" " the primary: %s", remote_slot->name, res->err))); ereport(ERROR, (errmsg("could not fetch invalidation cause for slot \"%s\" from" " primary: %s", slot_name, res->err))); ereport(ERROR, (errmsg("slot \"%s\" disappeared from the primary", slot_name))); ereport(ERROR, (errmsg("could not fetch failover logical slots info from the primary: %s", res->err))); ereport(ERROR, (errmsg("could not connect to the primary server: %s", err))); ereport(ERROR, (errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE), errmsg("could not map dynamic shared memory " "segment for slot-sync worker"))); ereport(ERROR, (errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE), errmsg("cannot drop replication slot \"%s\"", name), errdetail("This slot is being synced from the primary."))); ereport(ERROR, (errmsg("could not receive failover slots dbinfo from the primary server: %s", pchomp(PQerrorMessage(conn->streamConn); ereport(ERROR, (errmsg("invalid response from primary server"), errdetail("Could not get failover slots dbinfo: got %d fields, " "expected 1", nfields))); ereport(ERROR, (errcode(ERRCODE_SYNTAX_ERROR), errmsg("invalid connection string syntax: %s", errcopy))); ereport(WARNING, (errcode(ERRCODE_CONFIGURATION_LIMIT_EXCEEDED), errmsg("out of background worker slots"), errhint("You might need to increase %s.", "max_worker_processes"))); ereport(WARNING, (errmsg("replication slot-sync worker failed to attach to " "worker-pool slot %d", worker_slot))); ereport(WARNING, errmsg("skipping slots synchronization as primary_slot_name " "is not set.")); ereport(WARNING, errmsg("skipping slots synchronization as hot_standby_feedback " "is off.")); ereport(WARNING, errmsg("skipping slots synchronization as dbname is not " "specified in primary_conninfo.")); ereport(WARNING, (errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE), errmsg("slot-sync wait for slot %s interrupted by promotion, " "slot creation aborted", remote_slot->name))); ereport(WARNING, (errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE), errmsg("slot-sync wait for slot %s interrupted by promotion, " "slot creation aborted", remote_slot->name))); ereport(WARNING, (errmsg("slot \"%s\" disappeared from the primary, aborting" " slot creation", remote_slot->name)));
Re: RFC: Logging plan of the running query
> Hi, > > Yeah, and when we have a situation where we want to run > pg_log_query_plan(), we can run it in any environment as long as it > is bundled with the core. Is it possible to get the plan of a backend programmatically without access to the logs? Something like pg_get_query_plan(pid, format) which output would be the same as EXPLAIN. Regards, Étienne BERSAC Dalibo
Re: Fix output of zero privileges in psql
On Mon, 2023-10-23 at 22:43 -0400, Tom Lane wrote: > Laurenz Albe writes: > > On Mon, 2023-10-23 at 11:37 -0700, David G. Johnston wrote: > > > I do believe that we should be against exposing, like in this case, any > > > internal > > > implementation detail that encodes something (e.g., default privileges) > > > as NULL > > > in the catalogs, to the user of the psql meta-commands. > > > Sure, it would be best to hide this implementation detail from the user. > > The correct way to do that would be to fake an ACL entry like > > "laurenz=arwdDxt/laurenz" > > if there is a NULL in the catalog, but that would add a ton of special-case > > code to psql, which does not look appealing at all. > > For better or worse, that *is* the backend's catalog representation, > and I don't think that psql would be doing our users a service by > trying to obscure the fact. They'd run into it anyway the moment > they look at the catalogs with anything but a \d-something command. ... for example with a client like pgAdmin, which is a frequent choice of many PostgreSQL beginners (they display empty privileges). Yes, it is "(default)" or NULL. The former is friendlier for beginners, the latter incurs less backward incompatibility. I could live with either solution, but I am still leaning towards NULL. I ran the regression tests with a patch that displays "(default)", and I counted 22 failures, excluding the one added by my patch. The tests can of course be fixed, but perhaps that serves as a measure of the backward incompatibility. Yours, Laurenz Albe
Re: Synchronizing slots from primary to standby
Hi, On 10/23/23 2:56 PM, shveta malik wrote: On Mon, Oct 23, 2023 at 5:52 PM Drouvot, Bertrand wrote: We are waiting for DEFAULT_NAPTIME_PER_CYCLE (3 minutes) before checking if there is new synced slot(s) to be created on the standby. Do we want to keep this behavior for V1? I think for the slotsync workers case, we should reduce the naptime in the launcher to say 30sec and retain the default one of 3mins for subscription apply workers. Thoughts? Another option could be to keep DEFAULT_NAPTIME_PER_CYCLE and create a new API on the standby that would refresh the list of sync slot at wish, thoughts? Regards, -- Bertrand Drouvot PostgreSQL Contributors Team RDS Open Source Databases Amazon Web Services: https://aws.amazon.com
Re: Add trailing commas to enum definitions
On Tue, Oct 24, 2023 at 4:34 AM Nathan Bossart wrote: > > On Mon, Oct 23, 2023 at 05:55:32PM +0800, Junwang Zhao wrote: > > On Mon, Oct 23, 2023 at 2:37 PM Peter Eisentraut > > wrote: > >> Since C99, there can be a trailing comma after the last value in an enum > > > > C99 allows us to do this doesn't mean we must do this, this is not > > inconsistent IMHO, and this will pollute the git log messages, people > > may *git blame* the file and see the reason for the introduction of the > > line. > > I suspect that your concerns about git-blame could be resolved by adding > this commit to .git-blame-ignore-revs. From a long-term perspective, I > think standardizing on the trailing comma style will actually improve > git-blame because patches won't need to add a comma to the previous line > when adding a value. make sense, +1 from me now. > > -- > Nathan Bossart > Amazon Web Services: https://aws.amazon.com -- Regards Junwang Zhao
RE: [PoC] pg_upgrade: allow to upgrade publisher node
Dear Bharath, Amit, Thanks for reviewing! PSA new version. I addressed comments which have not been claimed. > On Mon, Oct 23, 2023 at 2:00 PM Bharath Rupireddy > wrote: > > > > On Mon, Oct 23, 2023 at 11:10 AM Hayato Kuroda (Fujitsu) > > wrote: > > > > > > Thank you for reviewing! PSA new version. > > > > > > 6. A nit: how about is_decodable_txn or is_decodable_change or some > > > > other instead of just a plain name processing_required? > > > > +/* Do we need to process any change in 'fast_forward' mode? */ > > > > +boolprocessing_required; > > > > > > I preferred current one. Because not only decodable txn, non-txn change > > > and > > > empty transactions also be processed. > > > > Right. It's not the txn, but the change. processing_required seems too > > generic IMV. A nit: is_change_decodable or something? > > > > If we don't want to keep it generic then we should use something like > 'contains_decodable_change'. 'is_change_decodable' could have suited > here if we were checking a particular change. I kept the name for now. How does Bharath think? > > Thanks for the patch. Here are few comments on v56 patch: > > > > 1. > > + * > > + * Although this function is currently used only during pg_upgrade, there > > are > > + * no reasons to restrict it, so IsBinaryUpgrade is not checked here. > > > > This comment isn't required IMV, because anyone looking at the code > > and callsites can understand it. Removed. > > 2. A nit: IMV "This is a special purpose ..." statement seems redundant. > > + * > > + * This is a special purpose function to ensure that the given slot can be > > + * upgraded without data loss. > > > > How about > > > > Verify that the given replication slot has consumed all the WAL changes. > > If there's any decodable WAL record after the slot's > > confirmed_flush_lsn, the slot's consumer will lose that data after the > > slot is upgraded. > > Returns true if there are no decodable WAL records after the > > confirmed_flush_lsn. Otherwise false. > > > > Personally, I find the current comment succinct and clear. I kept current one. > > 3. > > +if (PG_ARGISNULL(0)) > > +elog(ERROR, "null argument to > > binary_upgrade_validate_wal_records is not allowed"); > > > > I can see the above style is referenced from > > binary_upgrade_create_empty_extension, but IMV the following looks > > better and latest (ereport is new style than elog) > > > > ereport(ERROR, > > (errcode(ERRCODE_INVALID_PARAMETER_VALUE), > > errmsg("replication slot name cannot be null"))); > > > > Do you have any theory for making elog to ereport? I am not completely > sure but as this and related function is used internally, so using > elog seems reasonable. Also, I find keeping it consistent with the > existing error message is also reasonable. We can change both later > together if we get a broader agreement. I kept current style. elog() was used here because I regarded it as "cannot happen" error. According to the doc [1], elog() is still used for the purpose. > > 4. The following comment seems frivolous, the code tells it all. > > Please remove the comment. > > + > > +/* No need to check this slot, seek to new one */ > > +continue; Removed. > > 5. A typo - s/gets/Gets > > + * gets the LogicalSlotInfos for all the logical replication slots of the Replaced. > > 6. An optimization in count_old_cluster_logical_slots(void): Turn > > slot_count to a function static variable so that the for loop isn't > > required every time because the slot count is prepared in > > get_old_cluster_logical_slot_infos only once and won't change later > > on. Do you see any problem with the following? This saves a few CPU > > cycles when there are large number of replication slots. > > { > > static int slot_count = 0; > > static bool first_time = true; > > > > if (first_time) > > { > > for (int dbnum = 0; dbnum < old_cluster.dbarr.ndbs; dbnum++) > > slot_count += old_cluster.dbarr.dbs[dbnum].slot_arr.nslots; > > > > first_time = false; > > } > > > > return slot_count; > > } > > > > This may not be a problem but this is also not a function that will be > used frequently. I am not sure if adding such code optimizations is > worth it. Not addressed. > > 7. A typo: s/slotname/slot name. "slot name" looks better in user > > visible messages. > > +pg_log(PG_VERBOSE, "slotname: \"%s\", plugin: \"%s\", > two_phase: %s", > > > > If we want to follow other parameters then we can even use slot_name. Changed to slot_name. Below part is replies for remained comments: >8. >+else >+{ >+test_upgrade_from_pre_PG17($old_publisher, $new_publisher, >+@pg_upgrade_cmd); >+} >Will this ever be tested in current TAP test framework? I mean, will >the TAP test framework allow testing upgrades from one PG version to >another PG version? Yes, the TAP tester allow to do
Re: pg_ctl start may return 0 even if the postmaster has been already started on Windows
At Mon, 23 Oct 2023 08:57:19 +, "Hayato Kuroda (Fujitsu)" wrote in > > I agree with your suggestion. Ultimately, if there's a possibility > > for this to be committed, the message will be consolidated to "could > > not start server". > > Based on the suggestion, I tried to update the patch. > A new argument is_valid is added for reporting callee. Also, reporting formats > are adjusted based on other functions. How do you think? An equivalent check is already done shortly afterward in the calling function. Therefore, we can simply remove the code path for "launcher shell died", and it will work the same way. Please find the attached. Other error cases will fit to "shouldn't occur under normal conditions" errors. There is a possibility that the launcher shell terminates while postmaster is running. Even in such a case, the server continue working without any problems. I contemplated accomodating this case but the effort required seemed disproportionate to the possibility. regards. -- Kyotaro Horiguchi NTT Open Source Software Center >From 530dc21be72e4e7f5ea199a9fceee00bbb88cf75 Mon Sep 17 00:00:00 2001 From: Kyotaro Horiguchi Date: Tue, 24 Oct 2023 11:43:57 +0900 Subject: [PATCH v4 1/3] Disable autoruns for cmd.exe on Windows On Windows, we use cmd.exe to launch the postmaster process for easy redirection setup. However, cmd.exe may execute other programs at startup due to autorun configurations. This patch adds /D flag to the launcher cmd.exe command line to disable autorun settings written in the registry. --- src/bin/pg_ctl/pg_ctl.c | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/src/bin/pg_ctl/pg_ctl.c b/src/bin/pg_ctl/pg_ctl.c index 807d7023a9..3ac2fcc004 100644 --- a/src/bin/pg_ctl/pg_ctl.c +++ b/src/bin/pg_ctl/pg_ctl.c @@ -553,11 +553,11 @@ start_postmaster(void) else close(fd); - cmd = psprintf("\"%s\" /C \"\"%s\" %s%s < \"%s\" >> \"%s\" 2>&1\"", + cmd = psprintf("\"%s\" /D /C \"\"%s\" %s%s < \"%s\" >> \"%s\" 2>&1\"", comspec, exec_path, pgdata_opt, post_opts, DEVNULL, log_file); } else - cmd = psprintf("\"%s\" /C \"\"%s\" %s%s < \"%s\" 2>&1\"", + cmd = psprintf("\"%s\" /D /C \"\"%s\" %s%s < \"%s\" 2>&1\"", comspec, exec_path, pgdata_opt, post_opts, DEVNULL); if (!CreateRestrictedProcess(cmd, , false)) -- 2.39.3 >From 913a0fbe0937331fca6ea4099834ed3001714339 Mon Sep 17 00:00:00 2001 From: Kyotaro Horiguchi Date: Tue, 24 Oct 2023 14:46:50 +0900 Subject: [PATCH v4 2/3] Improve pg_ctl postmaster process check on Windows Currently pg_ctl on Windows does not verify that it actually executed a postmaster process due to lack of process ID knowledge. This can lead to false positives in cases where another pg_ctl instance starts a different server simultaneously. This patch adds the capability to identify the process ID of the launched postmaster on Windows, similar to other OS versions, ensuring more reliable detection of concurrent server startups. --- src/bin/pg_ctl/pg_ctl.c | 102 1 file changed, 93 insertions(+), 9 deletions(-) diff --git a/src/bin/pg_ctl/pg_ctl.c b/src/bin/pg_ctl/pg_ctl.c index 3ac2fcc004..9c0168b075 100644 --- a/src/bin/pg_ctl/pg_ctl.c +++ b/src/bin/pg_ctl/pg_ctl.c @@ -132,6 +132,7 @@ static void adjust_data_dir(void); #ifdef WIN32 #include +#include static bool pgwin32_IsInstalled(SC_HANDLE); static char *pgwin32_CommandLine(bool); static void pgwin32_doRegister(void); @@ -142,6 +143,7 @@ static void WINAPI pgwin32_ServiceMain(DWORD, LPTSTR *); static void pgwin32_doRunAsService(void); static int CreateRestrictedProcess(char *cmd, PROCESS_INFORMATION *processInfo, bool as_service); static PTOKEN_PRIVILEGES GetPrivilegesToDelete(HANDLE hToken); +static pid_t pgwin32_find_postmaster_pid(pid_t shell_pid); #endif static pid_t get_pgpid(bool is_status_request); @@ -609,7 +611,11 @@ wait_for_postmaster_start(pid_t pm_pid, bool do_checkpoint) /* File is complete enough for us, parse it */ pid_t pmpid; time_t pmstart; - +#ifndef WIN32 + pid_t wait_pid = pm_pid; +#else + pid_t wait_pid = pgwin32_find_postmaster_pid(pm_pid); +#endif /* * Make sanity checks. If it's for the wrong PID, or the recorded * start time is before pg_ctl started, then either we are looking @@ -619,14 +625,8 @@ wait_for_postmaster_start(pid_t pm_pid, bool do_checkpoint) */ pmpid = atol(optlines[LOCK_FILE_LINE_PID - 1]); pmstart = atol(optlines[LOCK_FILE_LINE_START_TIME - 1]); - if (pmstart >= start_time - 2 && -#ifndef WIN32 -pmpid == pm_pid -#else - /* Windows can only reject standalone-backend PIDs */ -pmpid > 0 -#endif -) + + if (pmstart >= start_time - 2 && pmpid == wait_pid) { /* * OK, seems to be a valid pidfile from our child. Check the @@ -1950,6 +1950,90 @@ GetPrivilegesToDelete(HANDLE hToken) return tokenPrivs; } + +/* + * Find the PID of the launched postmaster. + * + * On Windows,