Re: Simplify some logical replication worker type checking
On Tue, Aug 1, 2023 at 12:59 PM Zhijie Hou (Fujitsu) wrote: > > > About 2,3,4, it seems you should use "if (am_leader_apply_worker())" instead > of > "if (!am_leader_apply_worker())" because only leader apply worker should > invoke > this function. > Hi Hou-san, Thanks for your review! Oops. Of course, you are right. My cut/paste typo was mostly confined to the post, but there was one instance also in the patch. Fixed in v2. -- Kind Regards, Peter Smith. Fujitsu Australia v2-0001-Simplify-worker-type-checks.patch Description: Binary data
Re: Faster "SET search_path"
On Sat, 2023-07-29 at 12:44 -0400, Isaac Morland wrote: > Essentially, "just" observe efficiently (somehow) that no change is > needed, and skip changing it? I gave this a try and it speeds things up some more. There might be a surprise factor with an optimization like that, though. If someone becomes accustomed to the function running fast, then changing the search_path in the caller could slow things down a lot and it would be hard for the user to understand what happened. Also, there are a few implementation complexities because I think we need to still trigger the same invalidation that happens in assign_search_path(). Regards, Jeff Davis
Re: Incorrect handling of OOM in WAL replay leading to data loss
On Tue, Aug 01, 2023 at 01:51:13PM +0900, Kyotaro Horiguchi wrote: > I believe a database server is not supposed to be executed under such > a memory-constrained environment. I don't really follow this argument. The backend and the frontends are reliable on OOM, where we generate ERRORs or even FATALs depending on the code path involved. A memory bounded environment is something that can easily happen if one's not careful enough with the sizing of the instance. For example, this error can be triggered on a standby with read-only queries that put pressure on the host's memory. > One issue on changing that behavior is that there's not a simple way > to detect a broken record before loading it into memory. We might be > able to implement a fallback mechanism for example that loads the > record into an already-allocated buffer (which is smaller than the > specified length) just to verify if it's corrupted. However, I > question whether it's worth the additional complexity. And I'm not > sure what if the first allocation failed. Perhaps we could rely more on a fallback memory, especially if it is possible to use that for the header validation. That seems like a separate thing, still. -- Michael signature.asc Description: PGP signature
Fix compilation warnings when CFLAGS -Og is specified
Dear hackers, # Background Based on [1], I did configure and build with options: (I used Meson build system, but it could be reproduced by Autoconf/Make) ``` $ meson setup -Dcassert=true -Ddebug=true -Dc_args=-Og ../builder $ cd ../builder $ ninja ``` My gcc version is 4.8.5, and ninja is 1.10.2. # Problem I got warnings while compiling. I found that these were reported when -Og was specified. All of raised said that the variable might be used without initialization. ``` [122/1910] Compiling C object src/backend/postgres_lib.a.p/libpq_be-fsstubs.c.o ../postgres/src/backend/libpq/be-fsstubs.c: In function ‘be_lo_export’: ../postgres/src/backend/libpq/be-fsstubs.c:537:24: warning: ‘fd’ may be used uninitialized in this function [-Wmaybe-uninitialized] if (CloseTransientFile(fd) != 0) ^ [390/1910] Compiling C object src/backend/postgres_lib.a.p/commands_trigger.c.o In file included from ../postgres/src/backend/commands/trigger.c:14:0: ../postgres/src/backend/commands/trigger.c: In function ‘ExecCallTriggerFunc’: ../postgres/src/include/postgres.h:314:2: warning: ‘result’ may be used uninitialized in this function [-Wmaybe-uninitialized] return (Pointer) X; ^ ../postgres/src/backend/commands/trigger.c:2316:9: note: ‘result’ was declared here Datum result; ^ [1023/1910] Compiling C object src/backend/postgres_lib.a.p/utils_adt_xml.c.o ../postgres/src/backend/utils/adt/xml.c: In function ‘xml_pstrdup_and_free’: ../postgres/src/backend/utils/adt/xml.c:1359:2: warning: ‘result’ may be used uninitialized in this function [-Wmaybe-uninitialized] return result; ^ [1600/1910] Compiling C object contrib/pg_stat_statements/pg_stat_statements.so.p/pg_stat_statements.c.o ../postgres/contrib/pg_stat_statements/pg_stat_statements.c: In function ‘pgss_planner’: ../postgres/contrib/pg_stat_statements/pg_stat_statements.c:960:2: warning: ‘result’ may be used uninitialized in this function [-Wmaybe-uninitialized] return result; ^ [1651/1910] Compiling C object contrib/postgres_fdw/postgres_fdw.so.p/postgres_fdw.c.o ../postgres/contrib/postgres_fdw/postgres_fdw.c: In function ‘postgresAcquireSampleRowsFunc’: ../postgres/contrib/postgres_fdw/postgres_fdw.c:5346:14: warning: ‘reltuples’ may be used uninitialized in this function [-Wmaybe-uninitialized] *totalrows = reltuples; ^ [1910/1910] Linking target src/interfaces/ecpg/test/thread/alloc ``` # Solution PSA the patch to keep the compiler quiet. IIUC my compiler considered that substitutions in PG_TRY() might be skipped. I'm not sure it is real problem, but the workarounds are harmless. Or, did I miss something for ignoring above? [1]: https://wiki.postgresql.org/wiki/Developer_FAQ#:~:text=or%20MSVC%20tracepoints.-,What%20debugging%20features%20are%20available%3F,-Compile%2Dtime Best Regards, Hayato Kuroda FUJITSU LIMITED keep_quiet.patch Description: keep_quiet.patch
Re: Incorrect handling of OOM in WAL replay leading to data loss
At Tue, 1 Aug 2023 12:43:21 +0900, Michael Paquier wrote in > A colleague, Ethan Mertz (in CC), has discovered that we don't handle > correctly WAL records that are failing because of an OOM when > allocating their required space. In the case of Ethan, we have bumped > on the failure after an allocation failure on XLogReadRecordAlloc(): > "out of memory while trying to decode a record of length" I believe a database server is not supposed to be executed under such a memory-constrained environment. > In crash recovery, any records after the OOM would not be replayed. > At quick glance, it seems to me that this can also impact standbys, > where recovery could stop earlier than it should once a consistent > point has been reached. Actually the code is assuming that OOM happens solely due to a broken record length field. I believe that we intentionally put that assumption. > A patch is registered in the commit fest to improve the error > detection handling, but as far as I can see it fails to handle the OOM > case and replaces ReadRecord() to use a WARNING in the redo loop: > https://www.postgresql.org/message-id/20200228.160100.2210969269596489579.horikyota.ntt%40gmail.com It doesn't change behavior unrelated to the case where the last record is followed by zeroed trailing bytes. > On top of my mind, any solution I can think of needs to add more > information to XLogReaderState, where we'd either track the type of > error that happened close to errormsg_buf which is where these errors > are tracked, but any of that cannot be backpatched, unfortunately. One issue on changing that behavior is that there's not a simple way to detect a broken record before loading it into memory. We might be able to implement a fallback mechanism for example that loads the record into an already-allocated buffer (which is smaller than the specified length) just to verify if it's corrupted. However, I question whether it's worth the additional complexity. And I'm not sure what if the first allocation failed. regards. -- Kyotaro Horiguchi NTT Open Source Software Center
Extract numeric filed in JSONB more effectively
Hi: Currently if we want to extract a numeric field in jsonb, we need to use the following expression: cast (a->>'a' as numeric). It will turn a numeric to text first and then turn the text to numeric again. See jsonb_object_field_text and JsonbValueAsText. However the binary format of numeric in JSONB is compatible with the numeric in SQL, so I think we can have an operator to extract the numeric directly. If the value of a given field is not a numeric data type, an error will be raised, this can be documented. In this patch, I added a new operator for this purpose, here is the performance gain because of this. create table tb (a jsonb); insert into tb select '{"a": 1}'::jsonb from generate_series(1, 10)i; current method: select count(*) from tb where cast (a->>'a' as numeric) = 2; 167ms. new method: select count(*) from tb where a@->'a' = 2; 65ms. Is this the right way to go? Testcase, document and catalog version are updated. -- Best Regards Andy Fan v1-0001-Add-jsonb-operator-to-return-a-numeric-directly.patch Description: Binary data
Re: Extension Enhancement: Buffer Invalidation in pg_buffercache
Hii, Thanks for your feedback. We have decided to add a role for the extension to solve that problem. And concerning to pg_unwarm table I think we can create a new function to do that but I think a general user would not require to clear a table from buffercache. We can use bufferid and where statements to do the same if a superuser/(specific role) requests it. Thanks. On Sat, 29 Jul 2023 at 02:55, Cary Huang wrote: > > Hello > > I had a look at the patch and tested it on CI bot, it compiles and tests fine > both autoconf and meson. I noticed that the v2 patch contains the v1 patch > file as well. Not sure if intended but put there my accident. > > > I don't think "invalidating" is the right terminology. Note that we already > > have InvalidateBuffer() - but it's something we can't allow users to do, > as it > > throws away dirty buffer contents (it's used for things like dropping a > > table). > > > > How about using "discarding" for this functionality? > > I think "invalidating" is the right terminology here, it is exactly what the > feature is doing, it tries to invalidate a buffer ID by calling > InvalidateBuffer() routine inside buffer manager and calls FlushBuffer() > before invalidating if marked dirty. > > The problem here is that InvalidateBuffer() could be dangerous because it > allows a user to invalidate buffer that may have data in other tables not > owned by the current user, > > I think it all comes down to the purpose of this feature. Based on the > description in this email thread, I feel like this feature should be > categorized as a developer-only feature, to be used by PG developer to > experiment and observe some development works by invalidating one more more > specific buffers. If this is the case, it may be helpful to add a > "DEVELOPER_OPTIONS" in GUC, which allows or disallows the > TryInvalidateBuffer() to run or to return error if user does not have this > developer option enabled. > > If the purpose of this feature is for general users, then it would make sense > to have something like pg_unwarm (exactly opposite of pg_prewarm) that takes > table name (instead of buffer ID) and drop all buffers associated with that > table name. There will be permission checks as well so a user cannot > pg_unwarm a table owned by someone else. User in this case won't be able to > invalidate a particular buffer, but he/she should not have to as a regular > user anyway. > > thanks! > > Cary Huang > - > HighGo Software Inc. (Canada) > cary.hu...@highgo.ca > www.highgo.ca >
Re: [PATCH] Reuse Workers and Replication Slots during Logical Replication
On Fri, Jul 28, 2023 at 5:22 PM Peter Smith wrote: > > Hi Melih, > > BACKGROUND > -- > > We wanted to compare performance for the 2 different reuse-worker > designs, when the apply worker is already busy handling other > replications, and then simultaneously the test table tablesyncs are > occurring. > > To test this scenario, some test scripts were written (described > below). For comparisons, the scripts were then run using a build of > HEAD; design #1 (v21); design #2 (0718). > > HOW THE TEST WORKS > -- > > Overview: > 1. The apply worker is made to subscribe to a 'busy_tbl'. > 2. After the SUBSCRIPTION is created, the publisher-side then loops > (forever) doing INSERTS into that busy_tbl. > 3. While the apply worker is now busy, the subscriber does an ALTER > SUBSCRIPTION REFRESH PUBLICATION to subscribe to all the other test > tables. > 4. We time how long it takes for all tablsyncs to complete > 5. Repeat above for different numbers of empty tables (10, 100, 1000, > 2000) and different numbers of sync workers (2, 4, 8, 16) > > Scripts > --- > > (PSA 4 scripts to implement this logic) > > testrun script > - this does common setup (do_one_test_setup) and then the pub/sub > scripts (do_one_test_PUB and do_one_test_SUB -- see below) are run in > parallel > - repeat 10 times > > do_one_test_setup script > - init and start instances > - ipc setup tables and procedures > > do_one_test_PUB script > - ipc setup pub/sub > - table setup > - publishes the "busy_tbl", but then waits for the subscriber to > subscribe to only this one > - alters the publication to include all other tables (so subscriber > will see these only after the ALTER SUBSCRIPTION PUBLICATION REFRESH) > - enter a busy INSERT loop until it informed by the subscriber that > the test is finished > > do_one_test_SUB script > - ipc setup pub/sub > - table setup > - subscribes only to "busy_tbl", then informs the publisher when that > is done (this will cause the publisher to commence the stay_busy loop) > - after it knows the publishing busy loop has started it does > - ALTER SUBSCRIPTION REFRESH PUBLICATION > - wait until all the tablesyncs are ready <=== This is the part that > is timed for the test RESULT > > PROBLEM > --- > > Looking at the output files (e.g. *.dat_PUB and *.dat_SUB) they seem > to confirm the tests are working how we wanted. > > Unfortunately, there is some slot problem for the patched builds (both > designs #1 and #2). e.g. Search "ERROR" in the *.log files and see > many slot-related errors. > > Please note - running these same scripts with HEAD build gave no such > errors. So it appears to be a patch problem. > Hi FYI, here is some more information about ERRORs seen. The patches were re-tested -- applied in stages (and also against the different scripts) to identify where the problem was introduced. Below are the observations: ~~~ Using original test scripts 1. Using only patch v21-0001 - no errors 2. Using only patch v21-0001+0002 - no errors 3. Using patch v21-0001+0002+0003 - no errors ~~~ Using the "busy loop" test scripts for long transactions 1. Using only patch v21-0001 - no errors 2. Using only patch v21-0001+0002 - gives errors for "no copy in progress issue" e.g. ERROR: could not send data to WAL stream: no COPY in progress 3. Using patch v21-0001+0002+0003 - gives the same "no copy in progress issue" errors as above e.g. ERROR: could not send data to WAL stream: no COPY in progress - and also gives slot consistency point errors e.g. ERROR: could not create replication slot "pg_16700_sync_16514_7261998170966054867": ERROR: could not find logical decoding starting point e.g. LOG: could not drop replication slot "pg_16700_sync_16454_7261998170966054867" on publisher: ERROR: replication slot "pg_16700_sync_16454_7261998170966054867" does not exist -- Kind Regards, Peter Smith. Fujitsu Australia
Incorrect handling of OOM in WAL replay leading to data loss
Hi all, A colleague, Ethan Mertz (in CC), has discovered that we don't handle correctly WAL records that are failing because of an OOM when allocating their required space. In the case of Ethan, we have bumped on the failure after an allocation failure on XLogReadRecordAlloc(): "out of memory while trying to decode a record of length" As far as I can see, PerformWalRecovery() uses LOG as elevel for its private callback in the xlogreader, when doing through ReadRecord(), which leads to a failure being reported, but recovery considers that the failure is the end of WAL and decides to abruptly end recovery, leading to some data lost. In crash recovery, any records after the OOM would not be replayed. At quick glance, it seems to me that this can also impact standbys, where recovery could stop earlier than it should once a consistent point has been reached. Attached is a patch that can be applied on HEAD to inject an error, then just run the script xlogreader_oom.bash attached, or something similar, to see the failure in the logs: LOG: redo starts at 0/1913CD0 LOG: out of memory while trying to decode a record of length 57 LOG: redo done at 0/1917358 system usage: CPU: user: 0.00 s, system: 0.00 s, elapsed: 0.00 s It also looks like recovery_prefetch may mitigate a bit the issue if we do a read in non-blocking mode, but that's not a strong guarantee either, especially if the host is under memory pressure. A patch is registered in the commit fest to improve the error detection handling, but as far as I can see it fails to handle the OOM case and replaces ReadRecord() to use a WARNING in the redo loop: https://www.postgresql.org/message-id/20200228.160100.2210969269596489579.horikyota.ntt%40gmail.com On top of my mind, any solution I can think of needs to add more information to XLogReaderState, where we'd either track the type of error that happened close to errormsg_buf which is where these errors are tracked, but any of that cannot be backpatched, unfortunately. Comments? -- Michael #!/bin/bash killall -9 postgres # Setup Postgres rm -rf /tmp/data initdb -D /tmp/data cat <> /tmp/data/postgresql.conf logging_collector = on log_min_messages = debug1 log_min_error_statement = debug1 #log_line_prefix = '%m [%p]: [%l-1] db=%d,user=%u,app=%a,client=%h ' recovery_prefetch = off EOL pg_ctl start -D /tmp/data createdb $USER psql= 100) + { +trigger_oom = true; + +/* Reset counter, to not fail when shutting down WAL */ +counter = 0; + } + } + } +#endif + + if (decoded == NULL || trigger_oom) { /* * There is no space in the decode buffer. The caller should help -- 2.40.1 signature.asc Description: PGP signature
Re: Support to define custom wait events for extensions
Hi, On 2023-08-01 12:14:49 +0900, Michael Paquier wrote: > On Tue, Aug 01, 2023 at 11:51:35AM +0900, Masahiro Ikeda wrote: > > Thanks for committing the main patch. > > > > In my understanding, the rest works are > > * to support WaitEventExtensionMultiple() > > * to replace WAIT_EVENT_EXTENSION to custom wait events > > > > Do someone already works for them? If not, I'll consider > > how to realize them. > > Note that postgres_fdw and dblink use WAIT_EVENT_EXTENSION, but have > no dependency to shared_preload_libraries. Perhaps these could just > use a dynamic handling but that deserves a separate discussion because > of the fact that they'd need shared memory without being able to > request it. autoprewarm.c is much simpler. This is why the scheme as implemented doesn't really make sense to me. It'd be much easier to use if we had a shared hashtable with the identifiers than what's been merged now. In plenty of cases it's not realistic for an extension library to run in each backend, while still needing to wait for things. Greetings, Andres Freund
Re: Support to define custom wait events for extensions
On Tue, Aug 01, 2023 at 11:51:35AM +0900, Masahiro Ikeda wrote: > Thanks for committing the main patch. > > In my understanding, the rest works are > * to support WaitEventExtensionMultiple() > * to replace WAIT_EVENT_EXTENSION to custom wait events > > Do someone already works for them? If not, I'll consider > how to realize them. Note that postgres_fdw and dblink use WAIT_EVENT_EXTENSION, but have no dependency to shared_preload_libraries. Perhaps these could just use a dynamic handling but that deserves a separate discussion because of the fact that they'd need shared memory without being able to request it. autoprewarm.c is much simpler. -- Michael signature.asc Description: PGP signature
Re: logical decoding and replication of sequences, take 2
On Mon, Jul 31, 2023 at 5:04 PM Tomas Vondra wrote: > > On 7/31/23 11:25, Amit Kapila wrote: > > On Sat, Jul 29, 2023 at 5:53 PM Tomas Vondra > > wrote: > >> > >> On 7/28/23 14:44, Ashutosh Bapat wrote: > >>> On Wed, Jul 26, 2023 at 8:48 PM Tomas Vondra > >>> wrote: > > Anyway, I was thinking about this a bit more, and it seems it's not as > difficult to use the page LSN to ensure sequences don't go backwards. > The 0005 change does that, by: > > 1) adding pg_sequence_state, that returns both the sequence state and > the page LSN > > 2) copy_sequence returns the page LSN > > 3) tablesync then sets this LSN as origin_startpos (which for tables is > just the LSN of the replication slot) > > AFAICS this makes it work - we start decoding at the page LSN, so that > we skip the increments that could lead to the sequence going backwards. > > >>> > >>> I like this design very much. It makes things simpler than complex. > >>> Thanks for doing this. > >>> > >> > >> I agree it seems simpler. It'd be good to try testing / reviewing it a > >> bit more, so that it doesn't misbehave in some way. > >> > > > > Yeah, I also think this needs a review. This is a sort of new concept > > where we don't use the LSN of the slot (for cases where copy returned > > a larger value of LSN) or a full_snapshot created corresponding to the > > sync slot by Walsender. For the case of the table, we build a full > > snapshot because we use that for copying the table but why do we need > > to build that for copying the sequence especially when we directly > > copy it from the sequence relation without caring for any snapshot? > > > > We need the slot to decode/apply changes during catchup. The main > subscription may get ahead, and we need to ensure the WAL is not > discarded or something like that. This applies even if the initial sync > step does not use the slot/snapshot directly. > AFAIK, none of these needs a full_snapshot (see usage of SnapBuild->building_full_snapshot). The full_snapshot tracks both catalog and non-catalog xacts in the snapshot where we require to track non-catalog ones because we want to copy the table using that snapshot. It is relatively expensive to build a full snapshot and we don't do that unless it is required. For the current usage of this patch, I think using CRS_NOEXPORT_SNAPSHOT would be sufficient. -- With Regards, Amit Kapila.
RE: Simplify some logical replication worker type checking
On Tuesday, August 1, 2023 9:36 AM Peter Smith > PROBLEM / SOLUTION > > During recent reviews, I noticed some of these conditions are a bit unusual. Thanks for the patch. > > == > worker.c > > 1. apply_worker_exit > > /* > * Reset the last-start time for this apply worker so that the launcher > * will restart it without waiting for wal_retrieve_retry_interval if the > * subscription is still active, and so that we won't leak that hash table > * entry if it isn't. > */ > if (!am_tablesync_worker()) > ApplyLauncherForgetWorkerStartTime(MyLogicalRepWorker->subid); > > ~ > > In this case, it cannot be a parallel apply worker (there is a check > prior), so IMO it is better to simplify the condition here to below. > This also makes the code consistent with all the subsequent > suggestions in this post. > > if (am_apply_leader_worker()) > ApplyLauncherForgetWorkerStartTime(MyLogicalRepWorker->subid); This change looks OK to me. > ~~~ > > 2. maybe_reread_subscription > > /* Ensure we remove no-longer-useful entry for worker's start time */ > if (!am_tablesync_worker() && !am_parallel_apply_worker()) > ApplyLauncherForgetWorkerStartTime(MyLogicalRepWorker->subid); > proc_exit(0); > > ~ > > Should simplify the above condition to say: > if (!am_leader_apply_worker()) > > ~~~ > > 3. InitializeApplyWorker > > /* Ensure we remove no-longer-useful entry for worker's start time */ > if (!am_tablesync_worker() && !am_parallel_apply_worker()) > ApplyLauncherForgetWorkerStartTime(MyLogicalRepWorker->subid); > proc_exit(0); > > ~ > > Ditto. Should simplify the above condition to say: > if (!am_leader_apply_worker()) > > ~~~ > > 4. DisableSubscriptionAndExit > > /* Ensure we remove no-longer-useful entry for worker's start time */ > if (!am_tablesync_worker() && !am_parallel_apply_worker()) > ApplyLauncherForgetWorkerStartTime(MyLogicalRepWorker->subid); > > ~ > > Ditto. Should simplify the above condition to say: > if (!am_leader_apply_worker()) > > -- > > PSA a small patch making those above-suggested changes. The 'make > check' and TAP subscription tests are all passing OK. About 2,3,4, it seems you should use "if (am_leader_apply_worker())" instead of "if (!am_leader_apply_worker())" because only leader apply worker should invoke this function. Best Regards, Hou zj
Re: Support to define custom wait events for extensions
On 2023-07-31 19:22, Michael Paquier wrote: I am not sure that any of that is necessary. Anyway, I have applied v11 to get the basics done. Thanks for committing the main patch. In my understanding, the rest works are * to support WaitEventExtensionMultiple() * to replace WAIT_EVENT_EXTENSION to custom wait events Do someone already works for them? If not, I'll consider how to realize them. Regards, -- Masahiro Ikeda NTT DATA CORPORATION
Re: Ignore 2PC transaction GIDs in query jumbling
On Tue, Aug 01, 2023 at 11:37:49AM +0900, Michael Paquier wrote: > On Tue, Aug 01, 2023 at 10:22:09AM +0800, Julien Rouhaud wrote: > > Looking at the rest of the ignored patterns, the only remaining one would be > > DEALLOCATE, which AFAICS doesn't have a query_jumble_ignore tag for now. > > This one seems to be simple as well with a location field, looking > quickly at its Node. Agreed, it should be as trivial to implement as for the 2pc commands :)
Re: Ignore 2PC transaction GIDs in query jumbling
On Tue, Aug 01, 2023 at 10:22:09AM +0800, Julien Rouhaud wrote: > Looking at the rest of the ignored patterns, the only remaining one would be > DEALLOCATE, which AFAICS doesn't have a query_jumble_ignore tag for now. This one seems to be simple as well with a location field, looking quickly at its Node. -- Michael signature.asc Description: PGP signature
Re: Inaccurate comments in ReorderBufferCheckMemoryLimit()
On Mon, Jul 31, 2023 at 8:46 PM Masahiko Sawada wrote: > > While reading the code, I realized that the following code comments > might not be accurate: > > /* > * Pick the largest transaction (or subtransaction) and evict it from > * memory by streaming, if possible. Otherwise, spill to disk. > */ > if (ReorderBufferCanStartStreaming(rb) && > (txn = ReorderBufferLargestStreamableTopTXN(rb)) != NULL) > { > /* we know there has to be one, because the size is not zero */ > Assert(txn && rbtxn_is_toptxn(txn)); > Assert(txn->total_size > 0); > Assert(rb->size >= txn->total_size); > > ReorderBufferStreamTXN(rb, txn); > } > > AFAICS since ReorderBufferLargestStreamableTopTXN() returns only > top-level transactions, the comment above the if statement is not > right. It would not pick a subtransaction. > I think the subtransaction case is for the spill-to-disk case as both cases are explained in the same comment. > Also, I'm not sure that the second comment "we know there has to be > one, because the size is not zero" is right since there might not be > top-transactions that are streamable. > I think this comment is probably referring to asserts related to the size similar to spill to disk case. How about if we just remove (or subtransaction) from the following comment: "Pick the largest transaction (or subtransaction) and evict it from memory by streaming, if possible. Otherwise, spill to disk."? Then by referring to streaming/spill-to-disk cases, one can understand in which cases only top-level xacts are involved and in which cases both are involved. -- With Regards, Amit Kapila.
Re: Ignore 2PC transaction GIDs in query jumbling
On Tue, Aug 01, 2023 at 11:00:32AM +0900, Michael Paquier wrote: > On Tue, Aug 01, 2023 at 09:28:08AM +0800, Julien Rouhaud wrote: > > > FTR we had to entirely ignore all those statements in powa years ago to try > > to > > make the tool usable in such case for some users who where using 2pc, it > > would > > be nice to be able to track them back for pg16+. > > That would be nice for your users. Indeed, although I will need to make that part a runtime variable based on the server version rather than a hardcoded string since the extension framework doesn't provide a way to do that cleanly across major pg versions. > Are there more query patterns you'd > be interested in grouping in the backend? I had a few folks aiming > for CallStmt and SetStmt, but both a a bit tricky without a custom > routine. Looking at the rest of the ignored patterns, the only remaining one would be DEALLOCATE, which AFAICS doesn't have a query_jumble_ignore tag for now.
Re: [PATCH] Add support function for containment operators
On Sat, 2023-07-08 at 08:11 +0200, Kim Johan Andersson wrote: > On 07-07-2023 13:20, Laurenz Albe wrote: > > I wrote: > > > You implement both "SupportRequestIndexCondition" and > > > "SupportRequestSimplify", > > > but when I experimented, the former was never called. That does not > > > surprise me, since any expression of the shape "expr <@ range constant" > > > can be simplified. Is the "SupportRequestIndexCondition" branch dead > > > code? > > > If not, do you have an example that triggers it? > > I would think it is dead code, I came to the same conclusion. So we can > drop SupportRequestIndexCondition, since the simplification happens to > take care of everything. > > > > I had an idea about this: > > So far, you only consider constant ranges. But if we have a STABLE range > > expression, you could use an index scan for "expr <@ range", for example > > Index Cond (expr >= lower(range) AND expr < upper(range)). > > > > I will try to look into this. Originally that was what I was hoping for, > but didn't see way of going about it. > > Thanks for your comments, I will also look at the locale-related > breakage you spotted. I have marked the patch as "returned with feedback". I encourage you to submit an improved version in a future commitfest! Yours, Laurenz Albe
Re: Ignore 2PC transaction GIDs in query jumbling
On Tue, Aug 01, 2023 at 09:28:08AM +0800, Julien Rouhaud wrote: > Having an application relying on 2pc leads to pg_stat_statements being > virtually unusable on the whole instance, so +1 for the patch. Cool, thanks for the feedback! > FTR we had to entirely ignore all those statements in powa years ago to try to > make the tool usable in such case for some users who where using 2pc, it would > be nice to be able to track them back for pg16+. That would be nice for your users. Are there more query patterns you'd be interested in grouping in the backend? I had a few folks aiming for CallStmt and SetStmt, but both a a bit tricky without a custom routine. -- Michael signature.asc Description: PGP signature
Simplify some logical replication worker type checking
Hi hackers, BACKGROUND There are 3 different types of logical replication workers. 1. apply leader workers 2. parallel apply workers 3. tablesync workers And there are a number of places where the current worker type is checked using the am_ inline functions. PROBLEM / SOLUTION During recent reviews, I noticed some of these conditions are a bit unusual. == worker.c 1. apply_worker_exit /* * Reset the last-start time for this apply worker so that the launcher * will restart it without waiting for wal_retrieve_retry_interval if the * subscription is still active, and so that we won't leak that hash table * entry if it isn't. */ if (!am_tablesync_worker()) ApplyLauncherForgetWorkerStartTime(MyLogicalRepWorker->subid); ~ In this case, it cannot be a parallel apply worker (there is a check prior), so IMO it is better to simplify the condition here to below. This also makes the code consistent with all the subsequent suggestions in this post. if (am_apply_leader_worker()) ApplyLauncherForgetWorkerStartTime(MyLogicalRepWorker->subid); ~~~ 2. maybe_reread_subscription /* Ensure we remove no-longer-useful entry for worker's start time */ if (!am_tablesync_worker() && !am_parallel_apply_worker()) ApplyLauncherForgetWorkerStartTime(MyLogicalRepWorker->subid); proc_exit(0); ~ Should simplify the above condition to say: if (!am_leader_apply_worker()) ~~~ 3. InitializeApplyWorker /* Ensure we remove no-longer-useful entry for worker's start time */ if (!am_tablesync_worker() && !am_parallel_apply_worker()) ApplyLauncherForgetWorkerStartTime(MyLogicalRepWorker->subid); proc_exit(0); ~ Ditto. Should simplify the above condition to say: if (!am_leader_apply_worker()) ~~~ 4. DisableSubscriptionAndExit /* Ensure we remove no-longer-useful entry for worker's start time */ if (!am_tablesync_worker() && !am_parallel_apply_worker()) ApplyLauncherForgetWorkerStartTime(MyLogicalRepWorker->subid); ~ Ditto. Should simplify the above condition to say: if (!am_leader_apply_worker()) -- PSA a small patch making those above-suggested changes. The 'make check' and TAP subscription tests are all passing OK. --- Kind Regards, Peter Smith. Fujitsu Australia v1-0001-Simplify-worker-type-checks.patch Description: Binary data
Re: Ignore 2PC transaction GIDs in query jumbling
Hi, On Tue, Aug 01, 2023 at 09:38:14AM +0900, Michael Paquier wrote: > > 31de7e6 has silenced savepoint names in the query jumbling, and > something similar can be done for 2PC transactions once the GID is > ignored in TransactionStmt. This leads to the following grouping in > pg_stat_statements, for instance, which is something that matters with > workloads that heavily rely on 2PC: > COMMIT PREPARED $1 > PREPARE TRANSACTION $1 > ROLLBACK PREPARED $1 Having an application relying on 2pc leads to pg_stat_statements being virtually unusable on the whole instance, so +1 for the patch. FTR we had to entirely ignore all those statements in powa years ago to try to make the tool usable in such case for some users who where using 2pc, it would be nice to be able to track them back for pg16+. The patch LGTM.
Re: Performance degradation on concurrent COPY into a single relation in PG16.
Hi, On 2023-07-27 20:53:16 +1200, David Rowley wrote: > To summarise, REL_15_STABLE can run this benchmark in 526.014 ms on my > AMD 3990x machine. Today's REL_16_STABLE takes 530.344 ms. We're > talking about another patch to speed up the pg_strtoint functions > which gets this down to 483.790 ms. Do we need to do this for v16 or > can we just leave this as it is already? The slowdown does not seem > to be much above what we'd ordinarily classify as noise using this > test on my machine. I think we need to do something for 16 - it appears on recent-ish AMD the regression is quite a bit smaller than on intel. You see something < 1%, I see more like 4%. I think there's also other cases where the slowdown is more substantial. Besides intel vs amd, it also looks like the gcc version might make a difference. The code generated by 13 is noticeably slower than 12 for me... > Benchmark setup: > > COPY (SELECT generate_series(1, 200) a, (random() * 10 - > 5)::int b, 3243423 c) TO '/tmp/lotsaints.copy'; > DROP TABLE lotsaints; CREATE UNLOGGED TABLE lotsaints(a int, b int, c int); There's a lot of larger numbers in the file, which likely reduces the impact some. And there's the overhead of actually inserting the rows into the table, making the difference appear smaller than it is. If I avoid the actual insert into the table and use more columns, I see an about 10% regression here. COPY (SELECT generate_series(1, 1000) a, 10 b, 20 c, 30 d, 40 e, 50 f FROM generate_series(1, 1)) TO '/tmp/lotsaints_wide.copy'; psql -c 'DROP TABLE IF EXISTS lotsaints_wide; CREATE UNLOGGED TABLE lotsaints_wide(a int, b int, c int, d int, e int, f int);' && \ pgbench -n -P1 -f <( echo "COPY lotsaints_wide FROM '/tmp/lotsaints_wide.copy' WHERE false") -t 5 15: 2992.605 HEAD: 3325.201 fastpath1.patch 2932.606 fastpath2.patch 2783.915 Interestingly fastpath1 is slower now, even though it wasn't with earlier patches (which still is repeatable). I do not have the foggiest as to why. Greetings, Andres Freund
PostgreSQL 16 Beta 3 release date
Hi, The release date for PostgreSQL 16 Beta 3 is August 10, 2023, alongside the regular update release[1]. Please be sure to commit any open items[2] for the Beta 3 release before August 6, 2023 0:00 AoE[3] to give them enough time to work through the buildfarm. Thanks, Jonathan [1] https://www.postgresql.org/developer/roadmap/ [2] https://wiki.postgresql.org/wiki/PostgreSQL_16_Open_Items [3] https://en.wikipedia.org/wiki/Anywhere_on_Earth OpenPGP_signature Description: OpenPGP digital signature
Re: pg_upgrade fails with in-place tablespace
On Sat, Jul 29, 2023 at 11:10:22PM +0800, Rui Zhao wrote: > 2) Only check the tablespace with an absolute path in pg_upgrade. > There are also other solutions, such as supporting the creation of > relative-path tablespace in function CreateTableSpace. But do we > really need relative-path tablespace? I think in-place tablespace > is enough by now. My solution may be more lightweight and > harmless. + /* The path of the in-place tablespace is always pg_tblspc/. */ if (!S_ISLNK(st.st_mode)) - PG_RETURN_TEXT_P(cstring_to_text(sourcepath)); + PG_RETURN_TEXT_P(cstring_to_text("")); I don't think that your solution is the correct move. Having pg_tablespace_location() return the physical location of the tablespace is very useful because that's the location where the physical files are, and client applications don't need to know the logic behind the way a path is built. -" spcname != 'pg_global'"); +" spcname != 'pg_global' AND " +" pg_catalog.pg_tablespace_location(oid) ~ '^/'"); That may not work on Windows where the driver letter is appended at the beginning of the path, no? There is is_absolute_path() to do this job in a more portable way. -- Michael signature.asc Description: PGP signature
Re: [feature]COPY FROM enable FORCE_NULL/FORCE_NOT_NULL on all columns
On Aug 1, 2023, at 03:35, Andrew Dunstan wrote:I was hoping it be able to get to it today but that's not happening. If you want to submit a revised patch as above that will be good. I hope to get to it later this week.HI, Andrew Patch rebased and updated like above, thanks. v4-0001-COPY-FROM-enable-FORCE_NULL-FORCE_NOT_NULL-on-all-co.patch Description: Binary data Zhang Minglihttps://www.hashdata.xyz
Ignore 2PC transaction GIDs in query jumbling
Hi all, 31de7e6 has silenced savepoint names in the query jumbling, and something similar can be done for 2PC transactions once the GID is ignored in TransactionStmt. This leads to the following grouping in pg_stat_statements, for instance, which is something that matters with workloads that heavily rely on 2PC: COMMIT PREPARED $1 PREPARE TRANSACTION $1 ROLLBACK PREPARED $1 Thoughts or comments? -- Michael diff --git a/src/include/nodes/parsenodes.h b/src/include/nodes/parsenodes.h index fe003ded50..2565348303 100644 --- a/src/include/nodes/parsenodes.h +++ b/src/include/nodes/parsenodes.h @@ -3540,7 +3540,8 @@ typedef struct TransactionStmt List *options; /* for BEGIN/START commands */ /* for savepoint commands */ char *savepoint_name pg_node_attr(query_jumble_ignore); - char *gid; /* for two-phase-commit related commands */ + /* for two-phase-commit related commands */ + char *gid pg_node_attr(query_jumble_ignore); bool chain; /* AND CHAIN option */ /* token location, or -1 if unknown */ int location pg_node_attr(query_jumble_location); diff --git a/src/backend/parser/gram.y b/src/backend/parser/gram.y index 15ece871a0..b3bdf947b6 100644 --- a/src/backend/parser/gram.y +++ b/src/backend/parser/gram.y @@ -10924,7 +10924,7 @@ TransactionStmt: n->kind = TRANS_STMT_PREPARE; n->gid = $3; - n->location = -1; + n->location = @3; $$ = (Node *) n; } | COMMIT PREPARED Sconst @@ -10933,7 +10933,7 @@ TransactionStmt: n->kind = TRANS_STMT_COMMIT_PREPARED; n->gid = $3; - n->location = -1; + n->location = @3; $$ = (Node *) n; } | ROLLBACK PREPARED Sconst @@ -10942,7 +10942,7 @@ TransactionStmt: n->kind = TRANS_STMT_ROLLBACK_PREPARED; n->gid = $3; - n->location = -1; + n->location = @3; $$ = (Node *) n; } ; diff --git a/contrib/pg_stat_statements/expected/utility.out b/contrib/pg_stat_statements/expected/utility.out index 3d920fb5f7..93735d5d85 100644 --- a/contrib/pg_stat_statements/expected/utility.out +++ b/contrib/pg_stat_statements/expected/utility.out @@ -197,6 +197,29 @@ SELECT pg_stat_statements_reset(); (1 row) +-- Two-phase transactions +BEGIN; +PREPARE TRANSACTION 'stat_trans1'; +COMMIT PREPARED 'stat_trans1'; +BEGIN; +PREPARE TRANSACTION 'stat_trans2'; +ROLLBACK PREPARED 'stat_trans2'; +SELECT calls, rows, query FROM pg_stat_statements ORDER BY query COLLATE "C"; + calls | rows | query +---+--+--- + 2 |0 | BEGIN + 1 |0 | COMMIT PREPARED $1 + 2 |0 | PREPARE TRANSACTION $1 + 1 |0 | ROLLBACK PREPARED $1 + 1 |1 | SELECT pg_stat_statements_reset() +(5 rows) + +SELECT pg_stat_statements_reset(); + pg_stat_statements_reset +-- + +(1 row) + -- Savepoints BEGIN; SAVEPOINT sp1; diff --git a/contrib/pg_stat_statements/pg_stat_statements.conf b/contrib/pg_stat_statements/pg_stat_statements.conf index 13346e2807..0e900d7119 100644 --- a/contrib/pg_stat_statements/pg_stat_statements.conf +++ b/contrib/pg_stat_statements/pg_stat_statements.conf @@ -1 +1,2 @@ shared_preload_libraries = 'pg_stat_statements' +max_prepared_transactions = 5 diff --git a/contrib/pg_stat_statements/sql/utility.sql b/contrib/pg_stat_statements/sql/utility.sql index 859e57955e..87666d9135 100644 --- a/contrib/pg_stat_statements/sql/utility.sql +++ b/contrib/pg_stat_statements/sql/utility.sql @@ -115,6 +115,16 @@ COMMIT; SELECT calls, rows, query FROM pg_stat_statements ORDER BY query COLLATE "C"; SELECT pg_stat_statements_reset(); +-- Two-phase transactions +BEGIN; +PREPARE TRANSACTION 'stat_trans1'; +COMMIT PREPARED 'stat_trans1'; +BEGIN; +PREPARE TRANSACTION 'stat_trans2'; +ROLLBACK PREPARED 'stat_trans2'; +SELECT calls, rows, query FROM pg_stat_statements ORDER BY query COLLATE "C"; +SELECT pg_stat_statements_reset(); + -- Savepoints BEGIN; SAVEPOINT sp1; signature.asc Description: PGP signature
Re: pltcl tests fail with FreeBSD 13.2
Andres Freund writes: > On 2023-07-31 19:11:38 -0400, Tom Lane wrote: >> Huh. Maybe worth reporting as a FreeBSD bug? > Yea. Hoping our local freebsd developer has a suggestion as to which component > to report it to, or even fix it :). You already have a reproducer using just tcl, so I'd suggest filing it against tcl and letting them drill down as needed. (It's possible that it actually is tcl that's misbehaving, seeing that our core regression tests are passing in the same environment.) regards, tom lane
Re: Getting rid of OverrideSearhPath in namespace.c
On Mon, Jul 17, 2023 at 05:11:46PM +0300, Aleksander Alekseev wrote: > > As a follow-up for the CVE-2023-2454 fix, I think that it makes sense to > > completely remove unsafe functions > > PushOverrideSearchPath()/PopOverrideSearchPath(), which are not used in the > > core now. > > Please look at the patch attached. > > > > [...] > > > > What do you think? > > +1 to remove dead code. > > The proposed patch however removes get_collation_oid(), apparently by > mistake. Other than that the patch looks fine and passes `make > installcheck-world`. > > I added an entry to the nearest CF [1]. > > > Beside that, maybe it's worth to rename three functions in "Override" in > > their names: GetOverrideSearchPath(), CopyOverrideSearchPath(), > > OverrideSearchPathMatchesCurrent(), and then maybe struct > > OverrideSearchPath. > > Noah Misch proposed name GetSearchPathMatcher() for the former. > > +1 as well. I added the corresponding 0002 patch. Pushed both. Thanks.
Re: pltcl tests fail with FreeBSD 13.2
Hi, On 2023-07-31 19:11:38 -0400, Tom Lane wrote: > Andres Freund writes: > > On 2023-07-31 18:33:37 -0400, Tom Lane wrote: > >> (And could it be that we had one in the predecessor 13.1 image?) > > > No, I checked, and it's not in there either... It looks like the difference > > is > > that 13.1 reads the UTC zoneinfo in that case, whereas 13.2 doesn't. > > Huh. Maybe worth reporting as a FreeBSD bug? Yea. Hoping our local freebsd developer has a suggestion as to which component to report it to, or even fix it :). > In any case I don't think it's *our* bug. Agreed. An explicit tzsetup UTC isn't much to carry going forward... Greetings, Andres Freund
Re: Avoid possible memory leak (src/common/rmtree.c)
On Mon, Jul 31, 2023 at 08:10:55PM -0300, Ranier Vilela wrote: > Thanks for the commit, Michael. Sorry for the lack of update here. For the sake of the archives, this is f1e9f6b. -- Michael signature.asc Description: PGP signature
Re: CDC/ETL system on top of logical replication with pgoutput, custom client
Hi, On 2023-07-31 21:25:06 +, José Neves wrote: > Ok, if I understood you correctly, I start to see where my logic is faulty. > Just to make sure that I got it right, taking the following example again: > > T-1 > INSERT LSN1-1000 > UPDATE LSN2-2000 > UPDATE LSN3-3000 > COMMIT LSN4-4000 > > T-2 > INSERT LSN1-500 > UPDATE LSN2-1500 > UPDATE LSN3-2500 > COMMIT LSN4-5500 > > Where data will arrive in this order: > > INSERT LSN1-500 > INSERT LSN1-1000 > UPDATE LSN2-1500 > UPDATE LSN2-2000 > UPDATE LSN3-2500 > UPDATE LSN3-3000 > COMMIT LSN4-4000 > COMMIT LSN4-5500 No, they won't arrive in that order. They will arive as BEGIN INSERT LSN1-1000 UPDATE LSN2-2000 UPDATE LSN3-3000 COMMIT LSN4-4000 BEGIN INSERT LSN1-500 UPDATE LSN2-1500 UPDATE LSN3-2500 COMMIT LSN4-5500 Because T1 committed before T2. Changes are only streamed out at commit / prepare transaction (*). Within a transaction, they however *will* be ordered by LSN. (*) Unless you use streaming mode, in which case it'll all be more complicated, as you'll also receive changes for transactions that might still abort. > You are saying that the LSN3-3000 will never be missing, either the entire > connection will fail at that point, or all should be received in the > expected order (which is different from the "numeric order" of LSNs). I'm not quite sure what you mean with the "different from the numeric order" bit... > If the connection is down, upon restart, I will receive the entire T-1 > transaction again (well, all example data again). Yes, unless you already acknowledged receipt up to LSN4-4000 and/or are only asking for newer transactions when reconnecting. > In addition to that, if I commit LSN4-4000, even tho that LSN has a "bigger > numeric value" than the ones representing INSERT and UPDATE events on T-2, I > will be receiving the entire T-2 transaction again, as the LSN4-5500 is > still uncommitted. I don't quite know what you mean with "commit LSN4-4000" here. > This makes sense to me, but just to be extra clear, I will never receive a > transaction commit before receiving all other events for that transaction. Correct. Greetings, Andres Freund
Re: Avoid possible memory leak (src/common/rmtree.c)
Em sex, 28 de jul de 2023 11:54 PM, Michael Paquier escreveu: > On Tue, Jul 25, 2023 at 04:45:22PM +0200, Daniel Gustafsson wrote: > > Skimming the tree there doesn't seem to be any callers which aren't > exiting or > > ereporting on failure so the real-world impact seems low. That being > said, > > silencing static analyzers could be reason enough to delay allocation. > > A different reason would be out-of-core code that uses rmtree() in a > memory context where the leak would be an issue if facing a failure > continuously? Delaying the allocation after the OPENDIR() seems like > a good practice anyway. > Thanks for the commit, Michael. best regards, Ranier Vilela
Re: pltcl tests fail with FreeBSD 13.2
Andres Freund writes: > On 2023-07-31 18:33:37 -0400, Tom Lane wrote: >> (And could it be that we had one in the predecessor 13.1 image?) > No, I checked, and it's not in there either... It looks like the difference is > that 13.1 reads the UTC zoneinfo in that case, whereas 13.2 doesn't. Huh. Maybe worth reporting as a FreeBSD bug? In any case I don't think it's *our* bug. regards, tom lane
Re: pltcl tests fail with FreeBSD 13.2
Hi, On 2023-07-31 18:33:37 -0400, Tom Lane wrote: > Andres Freund writes: > > I saw that CI image builds for freebsd were failing, because 13.1, used > > until > > now, is EOL. Update to 13.2, no problem, I thought. Ran a CI run against > > 13.2 > > - unfortunately that failed. In pltcl of all places. > > I tried to replicate this in a freshly-installed 13.2 VM, and could > not, which is unsurprising because the FreeBSD installation process > does not let you skip selecting a timezone --- which creates > /etc/localtime AFAICT. So I'm unconvinced that we ought to worry > about the case where that's not there. How is it that the CI image > lacks that file? I don't know why it lacks the file - the CI image is based on the google cloud image freebsd maintains / publishes ([1][2]). Which doesn't have /etc/localtime. I've now added a "tzsetup UTC" to the image generation, which fixes the test failure. > (And could it be that we had one in the predecessor 13.1 image?) No, I checked, and it's not in there either... It looks like the difference is that 13.1 reads the UTC zoneinfo in that case, whereas 13.2 doesn't. Upstream 13.1 image: truss date 2>&1|grep -E 'open|stat' ... open("/etc/localtime",O_RDONLY,0101401200) ERR#2 'No such file or directory' open("/usr/share/zoneinfo/UTC",O_RDONLY,00) = 3 (0x3) fstat(3,{ mode=-r--r--r-- ,inode=351417,size=118,blksize=32768 }) = 0 (0x0) open("/usr/share/zoneinfo/posixrules",O_RDONLY,014330460400) = 3 (0x3) fstat(3,{ mode=-r--r--r-- ,inode=356621,size=3535,blksize=32768 }) = 0 (0x0) fstat(1,{ mode=p- ,inode=696,size=0,blksize=4096 }) = 0 (0x0) Upstream 13.2 image: open("/etc/localtime",O_RDONLY,01745)ERR#2 'No such file or directory' fstat(1,{ mode=p- ,inode=658,size=0,blksize=4096 }) = 0 (0x0) Why not reading the UTC zone leads to timestamps being out of range, I do not know... Greetings, Andres Freund [1] https://wiki.freebsd.org/GoogleCloudPlatform [2] https://cloud.google.com/compute/docs/images#freebsd
Re: pltcl tests fail with FreeBSD 13.2
Andres Freund writes: > I saw that CI image builds for freebsd were failing, because 13.1, used until > now, is EOL. Update to 13.2, no problem, I thought. Ran a CI run against 13.2 > - unfortunately that failed. In pltcl of all places. I tried to replicate this in a freshly-installed 13.2 VM, and could not, which is unsurprising because the FreeBSD installation process does not let you skip selecting a timezone --- which creates /etc/localtime AFAICT. So I'm unconvinced that we ought to worry about the case where that's not there. How is it that the CI image lacks that file? (And could it be that we had one in the predecessor 13.1 image?) regards, tom lane
Re: pgsql: Fix search_path to a safe value during maintenance operations.
On Mon, 2023-07-31 at 13:17 -0400, Joe Conway wrote: > But the analysis of the issue needs to go one step further. Even if > the > search_path does not change from the originally intended one, a newly > created function can shadow the intended one based on argument > coercion > rules. There are quite a few issues going down this path: * The set of objects in each schema can change. Argument coercion is a particularly subtle one, but there are other ways that it could find the wrong object. The temp namespace also has some subtle issues. * Schema USAGE privileges may vary over time or from caller to caller, affecting which items in the search path are searched at all. The same goes if theres an object access hook in place. * $user should be resolved to a specific schema (or perhaps not in some cases?) * There are other GUCs and environment that can affect function behavior. Is it worth trying to lock those down? I agree that each of these is some potential problem, but these are much smaller problems than allowing the caller to have complete control over the search_path. Regards, Jeff Davis
Re: pgsql: Fix search_path to a safe value during maintenance operations.
On Mon, 2023-07-31 at 12:53 -0400, Robert Haas wrote: > I agree. I think there are actually two interrelated problems here. > > One is that virtually all code needs to run with the originally > intended search_path rather than some search_path chosen at another > time and maybe by a different user. If not, it's going to break, or > compromise security, depending on the situation. The other is that > running arbitrary code written by somebody else as yourself is > basically instant death, from a security perspective. Good framing. The search_path is a particularly nasty problem in our system because it means that users can't even trust the code that they write themselves! A function author has no way to know how their own function will behave under a different search_path. > It's a little hard to imagine a world in which these problems don't > exist at all, but it somehow feels like the design of the system > pushes you toward doing this stuff incorrectly rather than doing it > correctly. For instance, you can imagine a system where when you run > CREATE OR REPLACE FUNCTION, the prevailing search_path is captured > and > automatically included in proconfig. Capturing the environment is not ideal either, in my opinion. It makes it easy to carelessly depend on a schema that others might not have USAGE privileges on, which would then create a runtime problem for other callers. Also, I don't think we could just depend on the raw search_path, we'd need to do some processing for $user, and there are probably a few other annoyances. It's one possibility and we don't have a lot of great options, so I don't want to rule it out though. If nothing else it could be a transition path to something better. > But you can maybe imagine a system where > all code associated with a table is run as the table owner in all > cases, regardless of SECURITY INVOKER/DEFINER, which I think would at > least close some holes. > > The difficulty is that it's a bit hard to imagine making these kinds > of definitional changes now, because they'd probably be breaking > changes for pretty significant numbers of users. I believe we can get close to a good model with minimal breakage. And when we make the problem small enough I believe other solutions will emerge. We will probably have to hedge with some compatibility GUCs. > But on the other > hand, if we don't start thinking about systemic changes here, it > feels > like we're just playing whack-a-mole. Exactly. If we can agree on where we're going then I think we can get there. Regards, Jeff Davis
RE: CDC/ETL system on top of logical replication with pgoutput, custom client
Hi Andres, thanks for your reply. Ok, if I understood you correctly, I start to see where my logic is faulty. Just to make sure that I got it right, taking the following example again: T-1 INSERT LSN1-1000 UPDATE LSN2-2000 UPDATE LSN3-3000 COMMIT LSN4-4000 T-2 INSERT LSN1-500 UPDATE LSN2-1500 UPDATE LSN3-2500 COMMIT LSN4-5500 Where data will arrive in this order: INSERT LSN1-500 INSERT LSN1-1000 UPDATE LSN2-1500 UPDATE LSN2-2000 UPDATE LSN3-2500 UPDATE LSN3-3000 COMMIT LSN4-4000 COMMIT LSN4-5500 You are saying that the LSN3-3000 will never be missing, either the entire connection will fail at that point, or all should be received in the expected order (which is different from the "numeric order" of LSNs). If the connection is down, upon restart, I will receive the entire T-1 transaction again (well, all example data again). In addition to that, if I commit LSN4-4000, even tho that LSN has a "bigger numeric value" than the ones representing INSERT and UPDATE events on T-2, I will be receiving the entire T-2 transaction again, as the LSN4-5500 is still uncommitted. This makes sense to me, but just to be extra clear, I will never receive a transaction commit before receiving all other events for that transaction. Are these statements correct? >Are you using the 'streaming' mode / option to pgoutput? No. >Not sure what you mean with "unordered offsets"? Ordered: EB53/E0D88188, EB53/E0D88189, EB53/E0D88190 Unordered: EB53/E0D88190, EB53/E0D88188, EB53/E0D88189 Extra question: When I get a begin message, I get a transaction starting at LSN-1000, and a transaction ending at LSN-2000. But as the example above shows, I can have data points from other transactions with LSNs in that interval. I have no way to identify to which transaction they belong, correct? Thanks again. Regards, José Neves De: Andres Freund Enviado: 31 de julho de 2023 21:39 Para: José Neves Cc: Amit Kapila ; pgsql-hack...@postgresql.org Assunto: Re: CDC/ETL system on top of logical replication with pgoutput, custom client Hi, On 2023-07-31 14:16:22 +, José Neves wrote: > Hi Amit, thanks for the reply. > > In our worker (custom pg replication client), we care only about INSERT, > UPDATE, and DELETE operations, which - sure - may be part of the issue. That seems likely. Postgres streams out changes in commit order, not in order of the changes having been made (that'd not work due to rollbacks etc). If you just disregard transactions entirely, you'll get something bogus after retries. You don't need to store the details for each commit in the target system, just up to which LSN you have processed *commit records*. E.g. if you have received and safely stored up to commit 0/1000, you need to remember that. Are you using the 'streaming' mode / option to pgoutput? > 1. We have no way to match LSN operations with the respective commit, as > they have unordered offsets. Not sure what you mean with "unordered offsets"? > Assuming that all of them were received in order, we would commit all data > with the commit message LSN4-4000 as other events would match the transaction > start and end LSN interval of it. Logical decoding sends out changes in a deterministic order and you won't see out of order data when using TCP (the entire connection can obviously fail though). Andres
Re: pgsql: Fix search_path to a safe value during maintenance operations.
On Mon, 2023-07-31 at 16:06 -0400, Robert Haas wrote: > if you > include in your search_path a schema to which some other user can > write, you are pretty much agreeing to execute code provided by that > user. Agreed on all counts here. I don't think it's reasonable for us to try to make such a setup secure, and I don't think users have much need for such a setup anyway. > One thing we might be able to do to prevent that sort of thing is to > have a feature to prevent "accidental" code execution, as in the > "function trust" mechanism proposed previously. Say I trust all users > who can SET ROLE to me and/or who inherit my privileges. Additionally > I can decide to trust users who do neither of those things by some > sort of explicit declaration. If I don't trust a user then if I do > anything that would cause code supplied by that user to get executed, > it just errors out: > > ERROR: role "rhaas" should not execute arbitrary code provided by > role "jconway" > HINT: If this should be allowed, use the TRUST command to permit it. +1, though I'm not sure we need an extensive trust mechanism beyond what we already have with the SET ROLE privilege. > And > we probably also still need to find ways to control search_path in a > lot more widely than we do today. Otherwise, even if stuff is > technically secure, it may just not work. +1. Regards, Jeff Davis
Re: Correct the documentation for work_mem
The following review has been posted through the commitfest application: make installcheck-world: tested, passed Implements feature: tested, passed Spec compliant: not tested Documentation:tested, passed Hello, I've reviewed and built the documentation for the updated patch. As it stands right now I think the documentation for this section is quite clear. > I'm wondering about adding "and more than one of these operations may > be in progress simultaneously". Are you talking about concurrent > sessions running other queries which are using work_mem too? This appears to be referring to the "sort and hash" operations mentioned prior. > If so, > isn't that already covered by the final sentence in the quoted text > above? if not, what is running simultaneously? I believe the last sentence is referring to another session that is running its own sort and hash operations. So the first section you mention is describing how sort and hash operations can be in execution at the same time for a query, while the second refers to how sessions may overlap in their execution of sort and hash operations if I am understanding this correctly. I also agree that changing "sort or hash" to "sort and hash" is a better description. Tristen
Re: CDC/ETL system on top of logical replication with pgoutput, custom client
Hi, On 2023-07-31 14:16:22 +, José Neves wrote: > Hi Amit, thanks for the reply. > > In our worker (custom pg replication client), we care only about INSERT, > UPDATE, and DELETE operations, which - sure - may be part of the issue. That seems likely. Postgres streams out changes in commit order, not in order of the changes having been made (that'd not work due to rollbacks etc). If you just disregard transactions entirely, you'll get something bogus after retries. You don't need to store the details for each commit in the target system, just up to which LSN you have processed *commit records*. E.g. if you have received and safely stored up to commit 0/1000, you need to remember that. Are you using the 'streaming' mode / option to pgoutput? > 1. We have no way to match LSN operations with the respective commit, as > they have unordered offsets. Not sure what you mean with "unordered offsets"? > Assuming that all of them were received in order, we would commit all data > with the commit message LSN4-4000 as other events would match the transaction > start and end LSN interval of it. Logical decoding sends out changes in a deterministic order and you won't see out of order data when using TCP (the entire connection can obviously fail though). Andres
Re: pgsql: Fix search_path to a safe value during maintenance operations.
On Mon, Jul 31, 2023 at 1:18 PM Joe Conway wrote: > But the analysis of the issue needs to go one step further. Even if the > search_path does not change from the originally intended one, a newly > created function can shadow the intended one based on argument coercion > rules. Yeah, this is a complicated issue. As the system works today, if you include in your search_path a schema to which some other user can write, you are pretty much agreeing to execute code provided by that user. If that user has strictly greater privileges than you, e.g. they are the super-user, then that's fine, because they can compromise your account anyway, but otherwise, you're probably doomed. Not only can they try to capture references with similarly-named objects, they can also do things like create objects whose names are common mis-spellings of the objects that are supposed to be there and hope you access the wrong one by mistake. Maybe there are other attacks as well, but even if not, I think it's already a pretty hopeless situation. I think the UNIX equivalent would be including a directory in your PATH that is world-writable and hoping your account will stay secure. Not very likely. We have already taken an important step in terms of preventing this attack in commit b073c3ccd06e4cb845e121387a43faa8c68a7b62, which removed PUBLIC CREATE from the public schema. Before that, we were shipping a configuration analogous to a UNIX system where /usr/bin was world-writable -- something which no actual UNIX system has likely done any time in the last 40 years, because it's so clearly insane. We could maybe go a step further by changing the default search_path to not even include public, to further discourage people from using that as a catch-all where everybody can just dump their objects. Right now, anybody can revert that change with a single GRANT statement, and if we removed public from the default search_path as well, people would have one extra step to restore that insecure configuration. I don't know such a change is worthwhile, though. It would still be super-easy for users to create insecure configurations: as soon as user A can write a schema and user B has it in the search_path, B is in a lot of trouble if A turns out to be malicious. One thing we might be able to do to prevent that sort of thing is to have a feature to prevent "accidental" code execution, as in the "function trust" mechanism proposed previously. Say I trust all users who can SET ROLE to me and/or who inherit my privileges. Additionally I can decide to trust users who do neither of those things by some sort of explicit declaration. If I don't trust a user then if I do anything that would cause code supplied by that user to get executed, it just errors out: ERROR: role "rhaas" should not execute arbitrary code provided by role "jconway" HINT: If this should be allowed, use the TRUST command to permit it. Even if we do this, I still think we need the kinds of fixes that I mentioned earlier. An error like this is fine if you're trying to do something to a table owned by another user and they've got a surprise trigger on there or something, but a maintenance command like VACUUM should find a way to work that is both secure and non-astonishing. And we probably also still need to find ways to control search_path in a lot more widely than we do today. Otherwise, even if stuff is technically secure, it may just not work. -- Robert Haas EDB: http://www.enterprisedb.com
Re: pltcl tests fail with FreeBSD 13.2
Hi, On 2023-07-31 12:15:10 -0700, Andres Freund wrote: > It sure looks like freebsd 13.2 tcl is just busted. Notably it can't even > parse what it generates: > > echo 'puts [clock scan [clock format [clock seconds] -format "%Y/%m/%d"] > -format "%Y/%m/%d"]'|tclsh8.6 > > Which works on 13.1 (and other operating systems), without a problem. > > I used truss as a very basic way to see differences between 13.1 and 13.2 - > the big difference was .2 failing just after > access("/etc/localtime",F_OK) ERR#2 'No such file or > directory' > open("/etc/localtime",O_RDONLY,077)ERR#2 'No such file or > directory' > > whereas 13.1 also saw that, but then continued to > > issetugid()= 0 (0x0) > open("/usr/share/zoneinfo/UTC",O_RDONLY,00)= 3 (0x3) > fstat(3,{ mode=-r--r--r-- ,inode=351417,size=118,blksize=32768 }) = 0 (0x0) > ... > > which made me test specifying the timezone explicitly: > echo 'puts [clock scan [clock format [clock seconds] -format "%Y/%m/%d" > -timezone "UTC"] -format "%Y/%m/%d" -timezone "UTC"]'|tclsh8.6 > > Which, surprise, works. > > So does specifying the timezone via the TZ='UTC' environment variable. > > > I guess there could be a libc behaviour change or such around timezones? I do > see > https://www.freebsd.org/releases/13.2R/relnotes/ > "tzcode has been upgraded to version 2022g with improved timezone change > detection and reliability fixes." One additional datapoint: If I configure the timezone with "tzsetup" (which creates /etc/localtime), the problem vanishes as well. Greetings, Andres Freund
Re: [feature]COPY FROM enable FORCE_NULL/FORCE_NOT_NULL on all columns
On 2023-07-26 We 03:03, Zhang Mingli wrote: HI, I've looked at this patch and it looks mostly fine, though I do not intend to commit it myself; perhaps Andrew will. HI, Amit, thanks for review. A few minor things to improve: + If * is specified, it will be applied in all columns. ... + If * is specified, it will be applied in all columns. Please write "it will be applied in" as "the option will be applied to". +1 + bool force_notnull_all; /* FORCE_NOT_NULL * */ ... + bool force_null_all; /* FORCE_NULL * */ Like in the comment for force_quote, please add a "?" after * in the above comments. +1 + if (cstate->opts.force_notnull_all) + MemSet(cstate->opts.force_notnull_flags, true, num_phys_attrs * sizeof(bool)); ... + if (cstate->opts.force_null_all) + MemSet(cstate->opts.force_null_flags, true, num_phys_attrs * sizeof(bool)); While I am not especially opposed to using this 1-line variant to set the flags array, it does mean that there are now different styles being used for similar code, because force_quote_flags uses a for loop: if (cstate->opts.force_quote_all) { int i; for (i = 0; i < num_phys_attrs; i++) cstate->opts.force_quote_flags[i] = true; } Perhaps we could fix the inconsistency by changing the force_quote_all code to use MemSet() too. I'll defer whether to do that to Andrew's judgement. Sure, let’s wait for Andrew and I will put everything in one pot then. I was hoping it be able to get to it today but that's not happening. If you want to submit a revised patch as above that will be good. I hope to get to it later this week. cheers andrew -- Andrew Dunstan EDB:https://www.enterprisedb.com
pltcl tests fail with FreeBSD 13.2
Hi, I saw that CI image builds for freebsd were failing, because 13.1, used until now, is EOL. Update to 13.2, no problem, I thought. Ran a CI run against 13.2 - unfortunately that failed. In pltcl of all places. https://api.cirrus-ci.com/v1/artifact/task/5275616266682368/testrun/build/testrun/pltcl/regress/regression.diffs Notably both 13.1 and 13.2 are using tcl 8.6.13. The code for the relevant function is this: create function tcl_date_week(int4,int4,int4) returns text as $$ return [clock format [clock scan "$2/$3/$1"] -format "%U"] $$ language pltcl immutable; select tcl_date_week(2010,1,26); It doesn't surprise me that there are problems with above clock scan, it uses the MM/DD/ format without making that explicit. But why that doesn't work on freebsd 13.2, I can't explain. It looks like tcl specifies the MM/DD/ bit for "free format scans": https://www.tcl.tk/man/tcl8.6/TclCmd/clock.html#M80 It sure looks like freebsd 13.2 tcl is just busted. Notably it can't even parse what it generates: echo 'puts [clock scan [clock format [clock seconds] -format "%Y/%m/%d"] -format "%Y/%m/%d"]'|tclsh8.6 Which works on 13.1 (and other operating systems), without a problem. I used truss as a very basic way to see differences between 13.1 and 13.2 - the big difference was .2 failing just after access("/etc/localtime",F_OK)ERR#2 'No such file or directory' open("/etc/localtime",O_RDONLY,077) ERR#2 'No such file or directory' whereas 13.1 also saw that, but then continued to issetugid() = 0 (0x0) open("/usr/share/zoneinfo/UTC",O_RDONLY,00) = 3 (0x3) fstat(3,{ mode=-r--r--r-- ,inode=351417,size=118,blksize=32768 }) = 0 (0x0) ... which made me test specifying the timezone explicitly: echo 'puts [clock scan [clock format [clock seconds] -format "%Y/%m/%d" -timezone "UTC"] -format "%Y/%m/%d" -timezone "UTC"]'|tclsh8.6 Which, surprise, works. So does specifying the timezone via the TZ='UTC' environment variable. I guess there could be a libc behaviour change or such around timezones? I do see https://www.freebsd.org/releases/13.2R/relnotes/ "tzcode has been upgraded to version 2022g with improved timezone change detection and reliability fixes." Greetings, Andres Freund
RE: CDC/ETL system on top of logical replication with pgoutput, custom client
Hi Euler, thank you for your reply. Your output is exactly how mine doesn't look like, I don't have such an order - that is - not only under heavy load. Conditions in which this occurs make it challenging to provide detailed information, and will also take a while to trigger. I've sent a previous email explaining how my output looks like, from a previous debug. I can gather more information if needs be, but I was interested in this bit: > Instead, Postgres provides a replication progress mechanism [1] to do it. It's not 100% clear to me how that would look like at the code level, can you provide a high-level algorithm on how such code would work? For reference, our implementation - to the bones - is very similar to this: https://adam-szpilewicz.pl/cdc-replication-from-postgresql-using-go-golang Thanks for your help. Regards, José Neves De: Euler Taveira Enviado: 31 de julho de 2023 15:27 Para: José Neves ; pgsql-hackers Assunto: Re: CDC/ETL system on top of logical replication with pgoutput, custom client On Sat, Jul 29, 2023, at 8:07 PM, José Neves wrote: I'm attempting to develop a CDC on top of Postgres, currently using 12, the last minor, with a custom client, and I'm running into issues with data loss caused by out-of-order logical replication messages. Can you provide a test case to show this issue? Did you try in a newer version? The problem is as follows: postgres streams A, B, D, G, K, I, P logical replication events, upon exit signal we stop consuming new events at LSN K, and we wait 30s for out-of-order events. Let's say that we only got A, (and K ofc) so in the following 30s, we get B, D, however, for whatever reason, G never arrived. As with pgoutput-based logical replication we have no way to calculate the next LSN, we have no idea that G was missing, so we assumed that it all arrived, committing K to postgres slot and shutdown. In the next run, our worker will start receiving data from K forward, and G is lost forever... Meanwhile postgres moves forward with archiving and we can't go back to check if we lost anything. And even if we could, would be extremely inefficient. Logical decoding provides the changes to output plugin at commit time. You mentioned the logical replication events but didn't say which are part of the same transaction. Let's say A, B, D and K are changes from the same transaction and G, I and P are changes from another transaction. The first transaction will be available when it processes K. The second transaction will be provided when the logical decoding processes P. You didn't say how your consumer is working. Are you sure your consumer doesn't get the second transaction? If your consumer is advancing the replication slot *after* receiving K (using pg_replication_slot_advance), it is doing it wrong. Another common problem with consumer is that it uses pg_logical_slot_get_changes() but *before* using the data it crashes; in this case, the data is lost. It is hard to say where the problem is if you didn't provide enough information about the consumer logic and the WAL information (pg_waldump output) around the time you detect the data loss. In sum, the issue comes from the fact that postgres will stream events with unordered LSNs on high transactional systems, and that pgoutput doesn't have access to enough information to calculate the next or last LSN, so we have no way to check if we receive all the data that we are supposed to receive, risking committing an offset that we shouldn't as we didn't receive yet preceding data. It seems very either to me that none of the open-source CDC projects that I looked into care about this. They always assume that the next LSN received is... well the next one, and commit that one, so upon restart, they are vulnerable to the same issue. So... either I'm missing something... or we have a generalized assumption causing data loss under certain conditions all over. Let me illustrate the current behavior. Let's say there are 3 concurrent transactions. Session A == euler=# SELECT pg_create_logical_replication_slot('repslot1', 'wal2json'); pg_create_logical_replication_slot (repslot1,0/369DF088) (1 row) euler=# create table foo (a int primary key); CREATE TABLE euler=# BEGIN; BEGIN euler=*# INSERT INTO foo (a) SELECT generate_series(1, 2); INSERT 0 2 Session B == euler=# BEGIN; BEGIN euler=*# INSERT INTO foo (a) SELECT generate_series(11, 12); INSERT 0 2 Session C == euler=# BEGIN; BEGIN euler=*# INSERT INTO foo (a) SELECT generate_series(21, 22); INSERT 0 2 Session A == euler=*# INSERT INTO foo (a) VALUES(3); INSERT 0 1 Session B == euler=*# INSERT INTO foo (a) VALUES(13); INSERT 0 1 Session C == euler=*# INSERT INTO foo (a) VALUES(23); INSERT 0 1 euler=*# COMMIT; COMMIT Session B == euler=*# COMMIT; COMMIT Session A == euler=*# COMMIT; COMMIT The
Re: should frontend tools use syncfs() ?
On Mon, Jul 31, 2023 at 10:51:38AM -0700, Nathan Bossart wrote: > Here is a new version of the patch with documentation updates and a couple > other small improvements. I just realized I forgot to update the --help output for these utilities. I'll do that in the next version of the patch. -- Nathan Bossart Amazon Web Services: https://aws.amazon.com
Re: add timing information to pg_upgrade
On Mon, Jul 31, 2023 at 11:34:57AM +0530, Bharath Rupireddy wrote: > Either of "Checking for \"aclitem\" data type usage" or "Checking for > \"aclitem\" data type in user tables" seems okay to me, however, I > prefer the second wording. Okay. I used the second wording for all the data type checks in v3. I also marked the timing strings for translation. I considered trying to extract psql's PrintTiming() so that it could be reused in other utilities, but the small differences would likely make translation difficult, and the logic isn't terribly long or sophisticated. My only remaining concern is that this timing information could cause pg_upgrade's output to exceed 80 characters per line. I don't know if this is something that folks are very worried about in 2023, but it still seemed worth bringing up. -- Nathan Bossart Amazon Web Services: https://aws.amazon.com >From c52513341fb2a315203e9ba5dd95761220744a74 Mon Sep 17 00:00:00 2001 From: Nathan Bossart Date: Mon, 31 Jul 2023 11:02:27 -0700 Subject: [PATCH v3 1/2] harmonize data type status messages in pg_upgrade --- src/bin/pg_upgrade/check.c | 4 ++-- src/bin/pg_upgrade/version.c | 6 +++--- 2 files changed, 5 insertions(+), 5 deletions(-) diff --git a/src/bin/pg_upgrade/check.c b/src/bin/pg_upgrade/check.c index 64024e3b9e..38a4227be0 100644 --- a/src/bin/pg_upgrade/check.c +++ b/src/bin/pg_upgrade/check.c @@ -1215,7 +1215,7 @@ check_for_aclitem_data_type_usage(ClusterInfo *cluster) { char output_path[MAXPGPATH]; - prep_status("Checking for incompatible \"aclitem\" data type in user tables"); + prep_status("Checking for \"aclitem\" data type in user tables"); snprintf(output_path, sizeof(output_path), "tables_using_aclitem.txt"); @@ -1243,7 +1243,7 @@ check_for_jsonb_9_4_usage(ClusterInfo *cluster) { char output_path[MAXPGPATH]; - prep_status("Checking for incompatible \"jsonb\" data type"); + prep_status("Checking for \"jsonb\" data type in user tables"); snprintf(output_path, sizeof(output_path), "%s/%s", log_opts.basedir, diff --git a/src/bin/pg_upgrade/version.c b/src/bin/pg_upgrade/version.c index 403a6d7cfa..9755af920f 100644 --- a/src/bin/pg_upgrade/version.c +++ b/src/bin/pg_upgrade/version.c @@ -181,7 +181,7 @@ old_9_3_check_for_line_data_type_usage(ClusterInfo *cluster) { char output_path[MAXPGPATH]; - prep_status("Checking for incompatible \"line\" data type"); + prep_status("Checking for \"line\" data type in user tables"); snprintf(output_path, sizeof(output_path), "%s/%s", log_opts.basedir, @@ -221,7 +221,7 @@ old_9_6_check_for_unknown_data_type_usage(ClusterInfo *cluster) { char output_path[MAXPGPATH]; - prep_status("Checking for invalid \"unknown\" user columns"); + prep_status("Checking for \"unknown\" data type in user tables"); snprintf(output_path, sizeof(output_path), "%s/%s", log_opts.basedir, @@ -366,7 +366,7 @@ old_11_check_for_sql_identifier_data_type_usage(ClusterInfo *cluster) { char output_path[MAXPGPATH]; - prep_status("Checking for invalid \"sql_identifier\" user columns"); + prep_status("Checking for \"sql_identifier\" data type in user tables"); snprintf(output_path, sizeof(output_path), "%s/%s", log_opts.basedir, -- 2.25.1 >From 99ba1903472b72480d4ace13568300f35320cbaf Mon Sep 17 00:00:00 2001 From: Nathan Bossart Date: Thu, 27 Jul 2023 16:16:45 -0700 Subject: [PATCH v3 2/2] add timing information to pg_upgrade --- src/bin/pg_upgrade/util.c | 55 ++- 1 file changed, 54 insertions(+), 1 deletion(-) diff --git a/src/bin/pg_upgrade/util.c b/src/bin/pg_upgrade/util.c index 21ba4c8f12..23fd5f87af 100644 --- a/src/bin/pg_upgrade/util.c +++ b/src/bin/pg_upgrade/util.c @@ -9,14 +9,19 @@ #include "postgres_fe.h" +#include #include #include "common/username.h" #include "pg_upgrade.h" +#include "portability/instr_time.h" LogOpts log_opts; +static instr_time step_start; + static void pg_log_v(eLogType type, const char *fmt, va_list ap) pg_attribute_printf(2, 0); +static char *get_time_str(double time_ms); /* @@ -137,6 +142,8 @@ prep_status(const char *fmt,...) /* trim strings */ pg_log(PG_REPORT_NONL, "%-*s", MESSAGE_WIDTH, message); + + INSTR_TIME_SET_CURRENT(step_start); } /* @@ -170,6 +177,8 @@ prep_status_progress(const char *fmt,...) pg_log(PG_REPORT, "%-*s", MESSAGE_WIDTH, message); else pg_log(PG_REPORT_NONL, "%-*s", MESSAGE_WIDTH, message); + + INSTR_TIME_SET_CURRENT(step_start); } static void @@ -280,11 +289,55 @@ pg_fatal(const char *fmt,...) } +static char * +get_time_str(double time_ms) +{ + double seconds; + double minutes; + double hours; + double days; + + if (time_ms < 1000.0) + return psprintf(_("%.3f ms"), time_ms); + + seconds = time_ms / 1000.0; + minutes = floor(seconds / 60.0); + seconds -= 60.0 * minutes; + if (minutes < 60.0) + return psprintf(_("%.3f ms (%02d:%06.3f)"), + time_ms, (int) minutes, seconds); + + hours =
Re: should frontend tools use syncfs() ?
On Sat, Jul 29, 2023 at 02:40:10PM -0700, Nathan Bossart wrote: > I was about to start a new thread, but I found this one with some good > preliminary discussion. I came to the same conclusion about introducing a > new option instead of using syncfs() by default wherever it is available. > The attached patch is still a work-in-progress, but it seems to behave as > expected. I began investigating this because I noticed that the > sync-data-directory step on pg_upgrade takes quite a while when there are > many files, and I am looking for ways to reduce the amount of downtime > required for pg_upgrade. > > The attached patch adds a new --sync-method option to the relevant frontend > utilities, but I am not wedded to that name/approach. Here is a new version of the patch with documentation updates and a couple other small improvements. -- Nathan Bossart Amazon Web Services: https://aws.amazon.com >From 6b65605c90703ee3aebca0b90c771c9f14b59545 Mon Sep 17 00:00:00 2001 From: Nathan Bossart Date: Fri, 28 Jul 2023 15:56:24 -0700 Subject: [PATCH v2 1/1] allow syncfs in frontend utilities --- doc/src/sgml/ref/initdb.sgml | 27 doc/src/sgml/ref/pg_basebackup.sgml | 30 + doc/src/sgml/ref/pg_checksums.sgml| 27 doc/src/sgml/ref/pg_dump.sgml | 27 doc/src/sgml/ref/pg_rewind.sgml | 27 doc/src/sgml/ref/pgupgrade.sgml | 29 + src/bin/initdb/initdb.c | 11 +++- src/bin/pg_basebackup/pg_basebackup.c | 10 ++- src/bin/pg_checksums/pg_checksums.c | 8 ++- src/bin/pg_dump/pg_backup.h | 4 +- src/bin/pg_dump/pg_backup_archiver.c | 13 ++-- src/bin/pg_dump/pg_backup_archiver.h | 1 + src/bin/pg_dump/pg_backup_directory.c | 2 +- src/bin/pg_dump/pg_dump.c | 9 ++- src/bin/pg_rewind/file_ops.c | 2 +- src/bin/pg_rewind/pg_rewind.c | 8 +++ src/bin/pg_rewind/pg_rewind.h | 2 + src/bin/pg_upgrade/option.c | 12 src/bin/pg_upgrade/pg_upgrade.c | 6 +- src/bin/pg_upgrade/pg_upgrade.h | 1 + src/common/file_utils.c | 91 ++- src/fe_utils/option_utils.c | 18 ++ src/include/common/file_utils.h | 17 - src/include/fe_utils/option_utils.h | 3 + src/include/storage/fd.h | 4 ++ src/tools/pgindent/typedefs.list | 1 + 26 files changed, 369 insertions(+), 21 deletions(-) diff --git a/doc/src/sgml/ref/initdb.sgml b/doc/src/sgml/ref/initdb.sgml index 22f1011781..872fef5705 100644 --- a/doc/src/sgml/ref/initdb.sgml +++ b/doc/src/sgml/ref/initdb.sgml @@ -365,6 +365,33 @@ PostgreSQL documentation + + --sync-method + + +When set to fsync, which is the default, +initdb will recursively open and synchronize all +files in the data directory. The search for files will follow symbolic +links for the WAL directory and each configured tablespace. + + +On Linux, syncfs may be used instead to ask the +operating system to synchronize the whole file systems that contain the +data directory, the WAL files, and each tablespace. This may be a lot +faster than the fsync setting, because it doesn't +need to open each file one by one. On the other hand, it may be slower +if a file system is shared by other applications that modify a lot of +files, since those files will also be written to disk. Furthermore, on +versions of Linux before 5.8, I/O errors encountered while writing data +to disk may not be reported to initdb, and relevant +error messages may appear only in kernel logs. + + +This option has no effect when --no-sync is used. + + + + -S --sync-only diff --git a/doc/src/sgml/ref/pg_basebackup.sgml b/doc/src/sgml/ref/pg_basebackup.sgml index 79d3e657c3..af8eb43251 100644 --- a/doc/src/sgml/ref/pg_basebackup.sgml +++ b/doc/src/sgml/ref/pg_basebackup.sgml @@ -594,6 +594,36 @@ PostgreSQL documentation + + --sync-method + + +When set to fsync, which is the default, +pg_basebackup will recursively open and synchronize +all files in the backup directory. When the plain format is used, the +search for files will follow symbolic links for the WAL directory and +each configured tablespace. + + +On Linux, syncfs may be used instead to ask the +operating system to synchronize the whole file system that contains the +backup directory. When the plain format is used, +pg_basebackup will also synchronize the file systems +that contain the WAL files and each tablespace. This may be a lot +faster than the fsync setting, because it doesn't +need to open each file one by one. On the other hand, it may be slower +
Re: pgsql: Fix search_path to a safe value during maintenance operations.
On 7/31/23 12:53, Robert Haas wrote: On Fri, Jun 30, 2023 at 3:41 AM Jeff Davis wrote: I'm not sure that everyone in this thread realizes just how broken it is to depend on search_path in a functional index at all. And doubly so if it depends on a schema other than pg_catalog in the search_path. Let's also not forget that logical replication always uses search_path=pg_catalog, so if you depend on a different search_path for any function attached to the table (not just functional indexes, also functions inside expressions or trigger functions), then those are already broken in version 15. And if a superuser is executing maintenance commands, there's little reason to think they'll have the same search path as the user that created the table. At some point in the very near future (though I realize that point may come after version 16), we need to lock down the search path in a lot of cases (not just maintenance commands), and I don't see any way around that. I agree. I think there are actually two interrelated problems here. One is that virtually all code needs to run with the originally intended search_path rather than some search_path chosen at another time and maybe by a different user. If not, it's going to break, or compromise security, depending on the situation. The other is that running arbitrary code written by somebody else as yourself is basically instant death, from a security perspective. I agree too. But the analysis of the issue needs to go one step further. Even if the search_path does not change from the originally intended one, a newly created function can shadow the intended one based on argument coercion rules. -- Joe Conway PostgreSQL Contributors Team RDS Open Source Databases Amazon Web Services: https://aws.amazon.com
Re: Optimizing nbtree ScalarArrayOp execution, allowing multi-column ordered scans, skip scan
On Mon, Jul 31, 2023 at 12:24 PM Alena Rybakina wrote: > I noticed that you are going to add CNF->DNF transformation at the index > construction stage. If I understand correctly, you will rewrite > restrictinfo node, > change boolean "AND" expressions to "OR" expressions, but would it be > possible to apply such a procedure earlier? Sort of. I haven't really added any new CNF->DNF transformations. The code you're talking about is really just checking that every index path has clauses that we know that nbtree can handle. That's a big, ugly modularity violation -- many of these details are quite specific to the nbtree index AM (in theory we could have other index AMs that are amsearcharray). At most, v1 of the patch makes greater use of an existing transformation that takes place in the nbtree index AM, as it preprocesses scan keys for these types of queries (it's not inventing new transformations at all). This is a slightly creative interpretation, too. Tom's commit 9e8da0f7 didn't actually say anything about CNF/DNF. > Otherwise I suppose you > could face the problem of > incorrect selectivity of the calculation and, consequently, the > cardinality calculation? I can't think of any reason why that should happen as a direct result of what I have done here. Multi-column index paths + multiple SAOP clauses are not a new thing. The number of rows returned does not depend on whether we have some columns as filter quals or not. Of course that doesn't mean that the costing has no problems. The costing definitely has several problems right now. It also isn't necessarily okay that it's "just as good as before" if it turns out that it needs to be better now. But I don't see why it would be. (Actually, my hope is that selectivity estimation might be *less* important as a practical matter with the patch.) > I can't clearly understand at what stage it is clear that the such a > transformation needs to be applied? I don't know either. I think that most of this work needs to take place in the nbtree code, during preprocessing. But it's not so simple. There is a mutual dependency between the code that generates index paths in the planner and nbtree scan key preprocessing. The planner needs to know what kinds of index paths are possible/safe up-front, so that it can choose the fastest plan (the fastest that the index AM knows how to execute correctly). But, there are lots of small annoying nbtree implementation details that might matter, and can change. I think we need to have nbtree register a callback, so that the planner can initialize some preprocessing early. I think that we require a "two way conversation" between the planner and the index AM. -- Peter Geoghegan
Re: Faster "SET search_path"
On Sat, Jul 29, 2023 at 11:59 AM Jeff Davis wrote: > Unfortunately, adding a "SET search_path" clause to functions slows > them down. The attached patches close the performance gap > substantially. I haven't reviewed the code but I like the concept a lot. This is badly needed. -- Robert Haas EDB: http://www.enterprisedb.com
Re: pgsql: Fix search_path to a safe value during maintenance operations.
On Fri, Jun 30, 2023 at 3:41 AM Jeff Davis wrote: > I'm not sure that everyone in this thread realizes just how broken it > is to depend on search_path in a functional index at all. And doubly so > if it depends on a schema other than pg_catalog in the search_path. > > Let's also not forget that logical replication always uses > search_path=pg_catalog, so if you depend on a different search_path for > any function attached to the table (not just functional indexes, also > functions inside expressions or trigger functions), then those are > already broken in version 15. And if a superuser is executing > maintenance commands, there's little reason to think they'll have the > same search path as the user that created the table. > > At some point in the very near future (though I realize that point may > come after version 16), we need to lock down the search path in a lot > of cases (not just maintenance commands), and I don't see any way > around that. I agree. I think there are actually two interrelated problems here. One is that virtually all code needs to run with the originally intended search_path rather than some search_path chosen at another time and maybe by a different user. If not, it's going to break, or compromise security, depending on the situation. The other is that running arbitrary code written by somebody else as yourself is basically instant death, from a security perspective. It's a little hard to imagine a world in which these problems don't exist at all, but it somehow feels like the design of the system pushes you toward doing this stuff incorrectly rather than doing it correctly. For instance, you can imagine a system where when you run CREATE OR REPLACE FUNCTION, the prevailing search_path is captured and automatically included in proconfig. Then the default behavior would be to run functions and procedures with the search_path that was in effect when they were created, rather than what we actually have, where it's the one in effect at execution time as it is currently. It's a little harder to imagine something similar around all the user switching behavior, just because we have so many ways of triggering arbitrary code execution: views, triggers, event triggers, expression indexes, constraints, etc. But you can maybe imagine a system where all code associated with a table is run as the table owner in all cases, regardless of SECURITY INVOKER/DEFINER, which I think would at least close some holes. The difficulty is that it's a bit hard to imagine making these kinds of definitional changes now, because they'd probably be breaking changes for pretty significant numbers of users. But on the other hand, if we don't start thinking about systemic changes here, it feels like we're just playing whack-a-mole. -- Robert Haas EDB: http://www.enterprisedb.com
Re: POC, WIP: OR-clause support for indexes
Hi! I think it really helps to speed up the search with similar deep filtering compared to cluster indexes, but do we have cases where we don't use this algorithm because it takes longer than an usual index? I thought about the situation with wide indexes (with a lot of multiple columns) and having a lot of filtering predicates for them. I think that it should be possible for the optimizer to only use multi-column SAOP index paths when there is at least likely to be some small advantage -- that's definitely my goal. Importantly, we may not really need to accurately model the costs where the new approach turns out to be much faster. The only essential thing is that we avoid cases where the new approach is much slower than the old approach. Which is possible (in at least some cases) by making the runtime behavior adaptive. The best decision that the planner can make may be no decision at all. Better to wait until runtime where at all possible, since that gives us the latest and most accurate picture of things. But I'm not sure about this, so it seems to me that this is a problem of improper use of indexes rather. It's hard to say how true that is. Certainly, workloads similar to the TPC-DS benchmark kinda need something like MDAM. It's just not practical to have enough indexes to support every possible query -- the benchmark is deliberately designed to have unpredictable, hard-to-optimize access paths. It seems to require having fewer, more general indexes that can support multi-dimensional access reasonably efficiently. Of course, with OLTP it's much more likely that the workload will have predictable access patterns. That makes having exactly the right indexes much more practical. So maybe you're right there. But, I still see a lot of value in a design that is as forgiving as possible. Users really like that kind of thing in my experience. I tend to agree with you, but a runtime estimate cannot give us an accurate picture when using indexes correctly or any other optimizations due to the unstable state of the environment in which the query is executed. I believe that a more complex analysis is needed here. Currently, the optimizer doesn't recognize multi-column indexes with SAOPs on every column as having a valid sort order, except on the first column. It seems possible that that has consequences for your patch. (I'm really only guessing, though; don't trust anything that I say about the optimizer too much.) Honestly, I couldn't understand your concerns very well, could you describe it in more detail? Well, I'm not sure if there is any possible scenario where the transformation from your patch makes it possible to go from an access path that has a valid sort order (usually because there is an underlying index scan) into an access path that doesn't. In fact, the opposite situation seems more likely (which is good news) -- especially if you assume that my own patch is also present. Going from a bitmap OR (which cannot return sorted output) to a multi-column SAOP index scan (which now can) may have significant value in some specific circumstances. Most obviously, it's really useful when it enables us to feed tuples into a GroupAggregate without a separate sort step, and without a hash aggregate (that's why I see value in combining your patch with my own patch). You just need to be careful about allowing the opposite situation to take place. More generally, there is a need to think about strange second order effects. We want to be open to useful second order effects that make query execution much faster in some specific context, while avoiding harmful second order effects. Intuitively, I think that it should be possible to do this with the transformations performed by your patch. In other words, "helpful serendipity" is an important advantage, while "harmful anti-serendipity" is what we really want to avoid. Ideally by making the harmful cases impossible "by construction". I noticed only one thing there: when we have unsorted array values in SOAP, the query takes longer than when it has a sorted array. I'll double-check it just in case and write about the results later. I am also testing some experience with multi-column indexes using SAOPs. -- Regards, Alena Rybakina Postgres Professional
Re: Optimizing nbtree ScalarArrayOp execution, allowing multi-column ordered scans, skip scan
On Thu, Jul 27, 2023 at 10:00 AM Matthias van de Meent wrote: > My idea is not quite block nested loop join. It's more 'restart the > index scan at the location the previous index scan ended, if > heuristics say there's a good chance that might save us time'. I'd say > it is comparable to the fast tree descent optimization that we have > for endpoint queries, and comparable to this patch's scankey > optimization, but across AM-level rescans instead of internal rescans. Yeah, I see what you mean. Seems related, even though what you've shown in your prototype patch doesn't seem like it fits into my taxonomy very neatly. (BTW, I was a little confused by the use of the term "endpoint" at first, since there is a function that uses that term to refer to a descent of the tree that happens without any insertion scan key. This path is used whenever the best we can do in _bt_first is to descend to the rightmost or leftmost page.) > The basic design of that patch is this: We keep track of how many > times we've rescanned, and the end location of the index scan. If a > new index scan hits the same page after _bt_search as the previous > scan ended, we register that. I can see one advantage that block nested loop join would retain here: it does block-based accesses on both sides of the join. Since it "looks ahead" on both sides of the join, more repeat accesses are likely to be avoided. Not too sure how much that matters in practice, though. -- Peter Geoghegan
Re: Request for comment on setting binary format output per session
Dave Cramer On Mon, 10 Jul 2023 at 03:56, Daniel Gustafsson wrote: > > On 25 Apr 2023, at 16:47, Dave Cramer wrote: > > > Patch attached with comments removed > > This patch no longer applies, please submit a rebased version on top of > HEAD. > Rebased see attached > > -- > Daniel Gustafsson > > 0001-Created-protocol.h.patch Description: Binary data
Re: Optimizing nbtree ScalarArrayOp execution, allowing multi-column ordered scans, skip scan
Hi, all! CNF -> DNF conversion = Like many great papers, the MDAM paper takes one core idea, and finds ways to leverage it to the hilt. Here the core idea is to take predicates in conjunctive normal form (an "AND of ORs"), and convert them into disjunctive normal form (an "OR of ANDs"). DNF quals are logically equivalent to CNF quals, but ideally suited to SAOP-array style processing by an ordered B-Tree index scan -- they reduce everything to a series of non-overlapping primitive index scans, that can be processed in keyspace order. We already do this today in the case of SAOPs, in effect. The nbtree "next array keys" state machine already materializes values that can be seen as MDAM style DNF single value predicates. The state machine works by outputting the cartesian product of each array as a multi-column index is scanned, but that could be taken a lot further in the future. We can use essentially the same kind of state machine to do everything described in the paper -- ultimately, it just needs to output a list of disjuncts, like the DNF clauses that the paper shows in "Table 3". In theory, anything can be supported via a sufficiently complete CNF -> DNF conversion framework. There will likely always be the potential for unsafe/unsupported clauses and/or types in an extensible system like Postgres, though. So we will probably need to retain some notion of safety. It seems like more of a job for nbtree preprocessing (or some suitably index-AM-agnostic version of the same idea) than the optimizer, in any case. But that's not entirely true, either (that would be far too easy). The optimizer still needs to optimize. It can't very well do that without having some kind of advanced notice of what is and is not supported by the index AM. And, the index AM cannot just unilaterally decide that index quals actually should be treated as filter/qpquals, after all -- it doesn't get a veto. So there is a mutual dependency that needs to be resolved. I suspect that there needs to be a two way conversation between the optimizer and nbtree code to break the dependency -- a callback that does some of the preprocessing work during planning. Tom said something along the same lines in passing, when discussing the MDAM paper last year [2]. Much work remains here. Honestly, I'm just reading and delving into this thread and other topics related to it, so excuse me if I ask you a few obvious questions. I noticed that you are going to add CNF->DNF transformation at the index construction stage. If I understand correctly, you will rewrite restrictinfo node, change boolean "AND" expressions to "OR" expressions, but would it be possible to apply such a procedure earlier? Otherwise I suppose you could face the problem of incorrect selectivity of the calculation and, consequently, the cardinality calculation? I can't clearly understand at what stage it is clear that the such a transformation needs to be applied? -- Regards, Alena Rybakina Postgres Professional
Re: proposal: psql: show current user in prompt
On Mon, 24 Jul 2023 at 21:16, Pavel Stehule wrote: > I don't understand how it can be possible to do it without. I need to > process possible errors, and then I need to read and synchronize protocol. I > didn't inject > this feature to some oher flow, so I need to implement a complete process. I think I might be missing the reason for this then. Could you explain a bit more why you didn't inject the feature into another flow? Because it seems like it would be better to inserting the logic for handling the new response packet in pqParseInput3(), and then wait on the result with PQexecFinish(). This would allow sending these messages in a pipelined mode. > For example, PQsetClientEncoding does a PQexec call, which is much more > expensive. Yeah, but you'd usually only call that once for the life of the connection. But honestly it would still be good if there was a pipelined version of that function. > Unfortunately, for this feature, I cannot change some local state variables, > but I need to change the state of the server. Own message is necessary, > because we don't want to be limited by the current transaction state, and > then we cannot reuse PQexec. I guess this is your reasoning for why it needs its own state machine, but I don't think I understand the problem. Could you expand a bit more on what you mean? Note that different message types use PQexecFinish to wait for their result, e.g. PQdescribePrepared and PQclosePrepared use PQexecFinish too and those wait for a RowDescription and a Close message respectively. I added the logic for PQclosePerpared recently, that patch might be some helpful example code: https://github.com/postgres/postgres/commit/28b5726561841556dc3e00ffe26b01a8107ee654 > The API can be changed from PQlinkParameterStatus(PGconn *conn, const char > *paramName) > > to > > PQlinkParameterStatus(PGconn *conn, int nParamNames, const char *paramNames) That would definitely address the issue with the many round trips being needed. But it would still leave the issue of introducing a second state machine in the libpq code. So if it's possible to combine this new code into the existing state machine, then that seems a lot better.
Re: pg_upgrade fails with in-place tablespace
On Mon, Jul 31, 2023 at 5:36 PM Rui Zhao wrote: > > Hello postgres hackers, > Recently I encountered an issue: pg_upgrade fails when dealing with in-place > tablespace. As we know, pg_upgrade uses pg_dumpall to dump objects and > pg_restore to restore them. The problem seems to be that pg_dumpall is > dumping in-place tablespace as relative path, which can't be restored. > > Here is the error message of pg_upgrade: > psql:/home/postgres/postgresql/src/bin/pg_upgrade/tmp_check/t_002_pg_upgrade_new_node_data/pgdata/pg_upgrade_output.d/20230729T210058.329/dump/pg_upgrade_dump_globals.sql:36: > ERROR: tablespace location must be an absolute path > > To help reproduce the failure, I have attached a tap test. The test also > fails with tablespace regression, and it change the default value of > allow_in_place_tablespaces to true only for debug, so it may not be fit for > production. However, it is enough to reproduce this failure. > I have also identified a solution for this problem, which I have included in > the patch. The solution has two modifications: > 1) Make the function pg_tablespace_location returns path "" with in-place > tablespace, rather than relative path. Because the path of the in-place > tablespace is always 'pg_tblspc/'. > 2) Only check the tablespace with an absolute path in pg_upgrade. > > There are also other solutions, such as supporting the creation of > relative-path tablespace in function CreateTableSpace. But do we really need > relative-path tablespace? I think in-place tablespace is enough by now. My > solution may be more lightweight and harmless. > > Thank you for your attention to this matter. > > Best regards, > Rui Zhao Seems like allow_in_place_tablespaces is a developer only guc, and it is not for end user usage. check this commit 7170f2159fb21b62c263acd458d781e2f3c3f8bb -- Regards Junwang Zhao
Inaccurate comments in ReorderBufferCheckMemoryLimit()
Hi all, While reading the code, I realized that the following code comments might not be accurate: /* * Pick the largest transaction (or subtransaction) and evict it from * memory by streaming, if possible. Otherwise, spill to disk. */ if (ReorderBufferCanStartStreaming(rb) && (txn = ReorderBufferLargestStreamableTopTXN(rb)) != NULL) { /* we know there has to be one, because the size is not zero */ Assert(txn && rbtxn_is_toptxn(txn)); Assert(txn->total_size > 0); Assert(rb->size >= txn->total_size); ReorderBufferStreamTXN(rb, txn); } AFAICS since ReorderBufferLargestStreamableTopTXN() returns only top-level transactions, the comment above the if statement is not right. It would not pick a subtransaction. Also, I'm not sure that the second comment "we know there has to be one, because the size is not zero" is right since there might not be top-transactions that are streamable. I think this is why we say "Otherwise, spill to disk". I've attached a patch to fix these comments. Feedback is very welcome. Regards, -- Masahiko Sawada Amazon Web Services: https://aws.amazon.com fix_comment.patch Description: Binary data
RE: CDC/ETL system on top of logical replication with pgoutput, custom client
Hi Amit, thanks for the reply. In our worker (custom pg replication client), we care only about INSERT, UPDATE, and DELETE operations, which - sure - may be part of the issue. I can only replicate this with production-level load, not easy to get a real example, but as I'm understanding the issue (and building upon your exposition), we are seeing the following: T-1 INSERT LSN1-1000 UPDATE LSN2-2000 UPDATE LSN3-3000 COMMIT LSN4-4000 T-2 INSERT LSN1-500 UPDATE LSN2-1500 UPDATE LSN3-2500 COMMIT LSN4-5500 If we miss LSN3-3000, let's say, a bad network, and we already received all other LSNs, we will commit to Postgres LSN4-5500 before restarting. LSN3 3000 will never be reattempted. And there are a couple of issues with this scenery: 1. We have no way to match LSN operations with the respective commit, as they have unordered offsets. Assuming that all of them were received in order, we would commit all data with the commit message LSN4-4000 as other events would match the transaction start and end LSN interval of it. 2. Still we have no way to verify that we got all data for a given transaction, we will never miss LSN3-3000 of the first transaction till we look at and analyze the resulting data. So the question: how can we prevent our worker from committing LSN4-5500 without receiving LSN3-3000? Do we even have enough information out of pgoutput to do that? PS.: when I say bad network, my suspicion is that this situation may be caused by network saturation on high QPS periods. Data will still arrive eventually but by that time our worker is no longer listening. Thanks again. Regards, José Neves De: Amit Kapila Enviado: 31 de julho de 2023 14:31 Para: José Neves Cc: pgsql-hack...@postgresql.org Assunto: Re: CDC/ETL system on top of logical replication with pgoutput, custom client On Mon, Jul 31, 2023 at 3:06 PM José Neves wrote: > > Hi there, hope to find you well. > > I'm attempting to develop a CDC on top of Postgres, currently using 12, the > last minor, with a custom client, and I'm running into issues with data loss > caused by out-of-order logical replication messages. > > The problem is as follows: postgres streams A, B, D, G, K, I, P logical > replication events, upon exit signal we stop consuming new events at LSN K, > and we wait 30s for out-of-order events. Let's say that we only got A, (and K > ofc) so in the following 30s, we get B, D, however, for whatever reason, G > never arrived. As with pgoutput-based logical replication we have no way to > calculate the next LSN, we have no idea that G was missing, so we assumed > that it all arrived, committing K to postgres slot and shutdown. In the next > run, our worker will start receiving data from K forward, and G is lost > forever... > Meanwhile postgres moves forward with archiving and we can't go back to check > if we lost anything. And even if we could, would be extremely inefficient. > > In sum, the issue comes from the fact that postgres will stream events with > unordered LSNs on high transactional systems, and that pgoutput doesn't have > access to enough information to calculate the next or last LSN, so we have no > way to check if we receive all the data that we are supposed to receive, > risking committing an offset that we shouldn't as we didn't receive yet > preceding data. > As per my understanding, we stream the data in the commit LSN order and for a particular transaction, all the changes are per their LSN order. Now, it is possible that for a parallel transaction, we send some changes from a prior LSN after sending the commit of another transaction. Say we have changes as follows: T-1 change1 LSN1-1000 change2 LSN2- 2000 commit LSN3- 3000 T-2 change1 LSN1-500 change2 LSN2-1500 commit LSN3-4000 In such a case, all the changes including the commit of T-1 are sent and then all the changes including the commit of T-2 are sent. So, one can say that some of the changes from T-2 from prior LSN arrived after T-1's commit but that shouldn't be a problem because if restart happens after we received partial T-2, we should receive the entire T-2. It is possible that you are seeing something else but if so then please try to share a more concrete example. -- With Regards, Amit Kapila.
Re: CDC/ETL system on top of logical replication with pgoutput, custom client
On Sat, Jul 29, 2023, at 8:07 PM, José Neves wrote: > I'm attempting to develop a CDC on top of Postgres, currently using 12, the > last minor, with a custom client, and I'm running into issues with data loss > caused by out-of-order logical replication messages. Can you provide a test case to show this issue? Did you try in a newer version? > The problem is as follows: postgres streams *A, B, D, G, K, I, P *logical > replication events, upon exit signal we stop consuming new events at LSN *K,* > and we wait 30s for out-of-order events. Let's say that we only got *A*, (and > *K* ofc) so in the following 30s, we get *B, D*, however, for whatever > reason, *G* never arrived. As with pgoutput-based logical replication we have > no way to calculate the next LSN, we have no idea that *G* was missing, so we > assumed that it all arrived, committing *K *to postgres slot and shutdown. In > the next run, our worker will start receiving data from *K* forward, and *G* > is lost forever... > Meanwhile postgres moves forward with archiving and we can't go back to check > if we lost anything. And even if we could, would be extremely inefficient. Logical decoding provides the changes to output plugin at commit time. You mentioned the logical replication events but didn't say which are part of the same transaction. Let's say A, B, D and K are changes from the same transaction and G, I and P are changes from another transaction. The first transaction will be available when it processes K. The second transaction will be provided when the logical decoding processes P. You didn't say how your consumer is working. Are you sure your consumer doesn't get the second transaction? If your consumer is advancing the replication slot *after* receiving K (using pg_replication_slot_advance), it is doing it wrong. Another common problem with consumer is that it uses pg_logical_slot_get_changes() but *before* using the data it crashes; in this case, the data is lost. It is hard to say where the problem is if you didn't provide enough information about the consumer logic and the WAL information (pg_waldump output) around the time you detect the data loss. > In sum, the issue comes from the fact that postgres will stream events with > unordered LSNs on high transactional systems, and that pgoutput doesn't have > access to enough information to calculate the next or last LSN, so we have no > way to check if we receive all the data that we are supposed to receive, > risking committing an offset that we shouldn't as we didn't receive yet > preceding data. > > It seems very either to me that none of the open-source CDC projects that I > looked into care about this. They always assume that the next LSN received > is... well the next one, and commit that one, so upon restart, they are > vulnerable to the same issue. So... either I'm missing something... or we > have a generalized assumption causing data loss under certain conditions all > over. Let me illustrate the current behavior. Let's say there are 3 concurrent transactions. Session A == euler=# SELECT pg_create_logical_replication_slot('repslot1', 'wal2json'); pg_create_logical_replication_slot (repslot1,0/369DF088) (1 row) euler=# create table foo (a int primary key); CREATE TABLE euler=# BEGIN; BEGIN euler=*# INSERT INTO foo (a) SELECT generate_series(1, 2); INSERT 0 2 Session B == euler=# BEGIN; BEGIN euler=*# INSERT INTO foo (a) SELECT generate_series(11, 12); INSERT 0 2 Session C == euler=# BEGIN; BEGIN euler=*# INSERT INTO foo (a) SELECT generate_series(21, 22); INSERT 0 2 Session A == euler=*# INSERT INTO foo (a) VALUES(3); INSERT 0 1 Session B == euler=*# INSERT INTO foo (a) VALUES(13); INSERT 0 1 Session C == euler=*# INSERT INTO foo (a) VALUES(23); INSERT 0 1 euler=*# COMMIT; COMMIT Session B == euler=*# COMMIT; COMMIT Session A == euler=*# COMMIT; COMMIT The output is: euler=# SELECT * FROM pg_logical_slot_peek_changes('repslot1', NULL, NULL, 'format-version', '2', 'include-types', '0'); lsn | xid |data ++ 0/369E4800 | 454539 | {"action":"B"} 0/36A05088 | 454539 | {"action":"C"} 0/36A05398 | 454542 | {"action":"B"} 0/36A05398 | 454542 | {"action":"I","schema":"public","table":"foo","columns":[{"name":"a","value":21}]} 0/36A05418 | 454542 | {"action":"I","schema":"public","table":"foo","columns":[{"name":"a","value":22}]} 0/36A05658 | 454542 | {"action":"I","schema":"public","table":"foo","columns":[{"name":"a","value":23}]} 0/36A057C0 | 454542 | {"action":"C"} 0/36A05258 | 454541 | {"action":"B"} 0/36A05258 | 454541 | {"action":"I","schema":"public","table":"foo","columns":[{"name":"a","value":11}]} 0/36A052D8 | 454541 |
Re: Improve join_search_one_level readibilty (one line change)
Hi, On Wed, Jun 07, 2023 at 11:02:09AM +0800, 謝東霖 wrote: > Thank you, Julien, for letting me know that cfbot doesn't test txt files. > Much appreciated! Thanks for posting this v2! So unsurprisingly the cfbot is happy with this patch, since it doesn't change the behavior at all. I just have some nitpicking: @@ -109,14 +109,14 @@ join_search_one_level(PlannerInfo *root, int level) List *other_rels_list; ListCell *other_rels; + other_rels_list = joinrels[1]; + if (level == 2) /* consider remaining initial rels */ { - other_rels_list = joinrels[level - 1]; other_rels = lnext(other_rels_list, r); } else/* consider all initial rels */ { - other_rels_list = joinrels[1]; other_rels = list_head(other_rels_list); } Since each branch only has a single instruction after the change the curly braces aren't needed anymore. The only reason keep them is if it helps readability (like if there's a big comment associated), but that's not the case here so it would be better to get rid of them. Apart from that +1 from me for the patch, I think it helps focusing the attention on what actually matters here.
Re: CDC/ETL system on top of logical replication with pgoutput, custom client
On Mon, Jul 31, 2023 at 3:06 PM José Neves wrote: > > Hi there, hope to find you well. > > I'm attempting to develop a CDC on top of Postgres, currently using 12, the > last minor, with a custom client, and I'm running into issues with data loss > caused by out-of-order logical replication messages. > > The problem is as follows: postgres streams A, B, D, G, K, I, P logical > replication events, upon exit signal we stop consuming new events at LSN K, > and we wait 30s for out-of-order events. Let's say that we only got A, (and K > ofc) so in the following 30s, we get B, D, however, for whatever reason, G > never arrived. As with pgoutput-based logical replication we have no way to > calculate the next LSN, we have no idea that G was missing, so we assumed > that it all arrived, committing K to postgres slot and shutdown. In the next > run, our worker will start receiving data from K forward, and G is lost > forever... > Meanwhile postgres moves forward with archiving and we can't go back to check > if we lost anything. And even if we could, would be extremely inefficient. > > In sum, the issue comes from the fact that postgres will stream events with > unordered LSNs on high transactional systems, and that pgoutput doesn't have > access to enough information to calculate the next or last LSN, so we have no > way to check if we receive all the data that we are supposed to receive, > risking committing an offset that we shouldn't as we didn't receive yet > preceding data. > As per my understanding, we stream the data in the commit LSN order and for a particular transaction, all the changes are per their LSN order. Now, it is possible that for a parallel transaction, we send some changes from a prior LSN after sending the commit of another transaction. Say we have changes as follows: T-1 change1 LSN1-1000 change2 LSN2- 2000 commit LSN3- 3000 T-2 change1 LSN1-500 change2 LSN2-1500 commit LSN3-4000 In such a case, all the changes including the commit of T-1 are sent and then all the changes including the commit of T-2 are sent. So, one can say that some of the changes from T-2 from prior LSN arrived after T-1's commit but that shouldn't be a problem because if restart happens after we received partial T-2, we should receive the entire T-2. It is possible that you are seeing something else but if so then please try to share a more concrete example. -- With Regards, Amit Kapila.
Re: Adding a LogicalRepWorker type field
On Mon, Jul 31, 2023 at 3:25 PM Bharath Rupireddy wrote: > > On Mon, Jul 31, 2023 at 7:17 AM Peter Smith wrote: > > > > PROBLEM: > > > > IMO, deducing the worker's type by examining multiple different field > > values seems a dubious way to do it. This maybe was reasonable enough > > when there were only 2 types, but as more get added it becomes > > increasingly complicated. > > +1 for being more explicit in worker types. > +1. BTW, do we need the below functions (am_tablesync_worker(), am_leader_apply_worker()) after this work? static inline bool am_tablesync_worker(void) { - return OidIsValid(MyLogicalRepWorker->relid); + return isTablesyncWorker(MyLogicalRepWorker); } static inline bool am_leader_apply_worker(void) { - return (!am_tablesync_worker() && - !isParallelApplyWorker(MyLogicalRepWorker)); + return isLeaderApplyWorker(MyLogicalRepWorker); } -- With Regards, Amit Kapila.
Re: [BUG] Fix DETACH with FK pointing to a partitioned table fails
On 2023-Jul-05, Jehan-Guillaume de Rorthais wrote: > ALTER TABLE r ATTACH PARTITION r_1 FOR VALUES IN (1); > > The old sub-FKs (below 18289) created in this table to enforce the action > triggers on referenced partitions are not deleted when the table becomes a > partition. Because of this, we have additional and useless triggers on the > referenced partitions and we can not DETACH this partition on the referencing > side anymore: Oh, hm, interesting. Thanks for the report and patch. I found a couple of minor issues with it (most serious one: nkeys should be 3, not 2; also sysscan should use conrelid index), but I'll try and complete it so that it's ready for 2023-08-10's releases. Regards -- Álvaro Herrera 48°01'N 7°57'E — https://www.EnterpriseDB.com/
Re: New PostgreSQL Contributors
On Fri, 28 Jul 2023 at 17:29, Christoph Berg wrote: > > The PostgreSQL contributors team has been looking over the community > activity and, over the first half of this year, has been recognizing > new contributors to be listed on > > https://www.postgresql.org/community/contributors/ > > New PostgreSQL Contributors: > > Ajin Cherian > Alexander Kukushkin > Alexander Lakhin > Dmitry Dolgov > Hou Zhijie > Ilya Kosmodemiansky > Melanie Plageman > Michael Banck > Michael Brewer > Paul Jungwirth > Peter Smith > Vigneshwaran C > > New PostgreSQL Major Contributors: > > Julien Rouhaud > Stacey Haysler > Steve Singer > > Thank you all for contributing to PostgreSQL to make it such a great project! +1, thank you all, and congratulations! Kind regards, Matthias van de Meent Neon (https://neon.tech)
Re: stats test intermittent failure
Hi, On Tue, Jul 11, 2023 at 3:35 AM Melanie Plageman wrote: > > Hi, > > Jeff pointed out that one of the pg_stat_io tests has failed a few times > over the past months (here on morepork [1] and more recently here on > francolin [2]). > > Failing test diff for those who prefer not to scroll: > > +++ > /home/bf/bf-build/francolin/HEAD/pgsql.build/testrun/recovery/027_stream_regress/data/results/stats.out >2023-07-07 18:48:25.976313231 + > @@ -1415,7 +1415,7 @@ > :io_sum_vac_strategy_after_reuses > > :io_sum_vac_strategy_before_reuses; > ?column? | ?column? > --+-- > - t| t > + t| f > > My theory about the test failure is that, when there is enough demand > for shared buffers, the flapping test fails because it expects buffer > access strategy *reuses* and concurrent queries already flushed those > buffers before they could be reused. Attached is a patch which I think > will fix the test while keeping some code coverage. If we count > evictions and reuses together, those should have increased. > Yeah, I've not reproduced this issue but it's possible. IIUC if we get the buffer from the ring, we count an I/O as "reuse" even if the buffer has already been flushed/replaced. However, if the buffer in the ring is pinned by other backends, we end up evicting a buffer from outside of the ring and adding it to the buffer, which is counted as "eviction". Regarding the patch, I have a comment: -- Test that reuse of strategy buffers and reads of blocks into these reused --- buffers while VACUUMing are tracked in pg_stat_io. +-- buffers while VACUUMing are tracked in pg_stat_io. If there is sufficient +-- demand for shared buffers from concurrent queries, some blocks may be +-- evicted from the strategy ring before they can be reused. In such cases +-- this, the backend will evict a block from a shared buffer outside of the +-- ring and add it to the ring. This is considered an eviction and not a reuse. The new comment seems not to be accurate if my understanding is correct. How about the following? Test that reuse of strategy buffers and reads of blocks into these reused buffers while VACUUMing are tracked in pg_stat_io. If there is sufficient demand for shared buffers from concurrent queries, some buffers may be pinned by other backends before they can be reused. In such cases, the backend will evict a buffer from a shared buffer outside of the ring and add it to the ring. This is considered an eviction and not a reuse. Regards, -- Masahiko Sawada Amazon Web Services: https://aws.amazon.com
Re: Avoid undefined behavior with msvc compiler (src/include/port/pg_bitutils.h)
On Mon, Jul 31, 2023 at 5:57 PM Tom Lane wrote: > > John Naylor writes: > > Works for me, so done that way for both forward and reverse variants. Since > > the return value is no longer checked in any builds, I thought about > > removing the variable containing it, but it seems best to leave it behind > > for clarity since these are not our functions. > > Hmm, aren't you risking "variable is set but not used" warnings? > Personally I'd have made these like > > (void) _BitScanReverse(, word); I'd reasoned that such a warning would have showed up in non-assert builds already, but I neglected to consider that those could have different warning settings. For the time being drongo shows green, at least, but we'll see. -- John Naylor EDB: http://www.enterprisedb.com
Re: logical decoding and replication of sequences, take 2
On 7/31/23 11:25, Amit Kapila wrote: > On Sat, Jul 29, 2023 at 5:53 PM Tomas Vondra > wrote: >> >> On 7/28/23 14:44, Ashutosh Bapat wrote: >>> On Wed, Jul 26, 2023 at 8:48 PM Tomas Vondra >>> wrote: Anyway, I was thinking about this a bit more, and it seems it's not as difficult to use the page LSN to ensure sequences don't go backwards. The 0005 change does that, by: 1) adding pg_sequence_state, that returns both the sequence state and the page LSN 2) copy_sequence returns the page LSN 3) tablesync then sets this LSN as origin_startpos (which for tables is just the LSN of the replication slot) AFAICS this makes it work - we start decoding at the page LSN, so that we skip the increments that could lead to the sequence going backwards. >>> >>> I like this design very much. It makes things simpler than complex. >>> Thanks for doing this. >>> >> >> I agree it seems simpler. It'd be good to try testing / reviewing it a >> bit more, so that it doesn't misbehave in some way. >> > > Yeah, I also think this needs a review. This is a sort of new concept > where we don't use the LSN of the slot (for cases where copy returned > a larger value of LSN) or a full_snapshot created corresponding to the > sync slot by Walsender. For the case of the table, we build a full > snapshot because we use that for copying the table but why do we need > to build that for copying the sequence especially when we directly > copy it from the sequence relation without caring for any snapshot? > We need the slot to decode/apply changes during catchup. The main subscription may get ahead, and we need to ensure the WAL is not discarded or something like that. This applies even if the initial sync step does not use the slot/snapshot directly. regards -- Tomas Vondra EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Re: Avoid undefined behavior with msvc compiler (src/include/port/pg_bitutils.h)
John Naylor writes: > On Sun, Jul 30, 2023 at 9:45 PM Tom Lane wrote: >> That's basically equivalent to the existing Assert(non_zero). >> I think it'd be okay to drop that one and instead have >> the same Assert condition as other platforms, but having both >> would be redundant. > Works for me, so done that way for both forward and reverse variants. Since > the return value is no longer checked in any builds, I thought about > removing the variable containing it, but it seems best to leave it behind > for clarity since these are not our functions. Hmm, aren't you risking "variable is set but not used" warnings? Personally I'd have made these like (void) _BitScanReverse(, word); Anybody trying to understand this code is going to have to look up the man page for _BitScanReverse anyway, so I'm not sure that keeping the variable adds much readability. However, if no version of MSVC actually issues such a warning, it's moot. regards, tom lane
Re: Support to define custom wait events for extensions
On Mon, Jul 31, 2023 at 3:54 PM Michael Paquier wrote: > > On Mon, Jul 31, 2023 at 05:10:21PM +0900, Kyotaro Horiguchi wrote: > > +/* > > + * Return the name of an wait event ID for extension. > > + */ > > +static const char * > > +GetWaitEventExtensionIdentifier(uint16 eventId) > > > > This looks inconsistent. Shouldn't it be GetWaitEventExtentionName()? > > This is an inspiration from GetLWLockIdentifier(), which is kind of OK > by me. If there is a consensus in changing that, fine by me, of > course. +1 to GetWaitEventExtensionIdentifier for consistency with LWLock's counterpart. -- Bharath Rupireddy PostgreSQL Contributors Team RDS Open Source Databases Amazon Web Services: https://aws.amazon.com
Re: Support to define custom wait events for extensions
On Mon, Jul 31, 2023 at 05:10:21PM +0900, Kyotaro Horiguchi wrote: > +/* > + * Return the name of an wait event ID for extension. > + */ > +static const char * > +GetWaitEventExtensionIdentifier(uint16 eventId) > > This looks inconsistent. Shouldn't it be GetWaitEventExtentionName()? This is an inspiration from GetLWLockIdentifier(), which is kind of OK by me. If there is a consensus in changing that, fine by me, of course. -- Michael signature.asc Description: PGP signature
Re: Support to define custom wait events for extensions
On Mon, Jul 31, 2023 at 01:37:49PM +0530, Bharath Rupireddy wrote: > Do you think it's worth adding a note here in the docs about an > external module defining more than one custom wait event? A pseudo > code if possible or just a note? Also, how about a XXX comment atop > WaitEventExtensionNew and/or WaitEventExtensionRegisterName on the > possibility of extending the functions to support allocation of more > than one custom wait events? I am not sure that any of that is necessary. Anyway, I have applied v11 to get the basics done. Now, I agree that a WaitEventExtensionMultiple() may come in handy, particularly for postgres_fdw that uses WAIT_EVENT_EXTENSION three times. -- Michael signature.asc Description: PGP signature
Missing comments/docs about custom scan path
Hi, While working on [1], I noticed $SUBJECT: commit e7cb7ee14 failed to update comments for the CustomPath struct in pathnodes.h, and commit f49842d1e failed to update docs about custom scan path callbacks in custom-scan.sgml, IIUC. Attached are patches for updating these, which I created separately for ease of review (patch update-custom-scan-path-comments.patch for the former and patch update-custom-scan-path-docs.patch for the latter). In the second patch I used almost the same text as for the ReparameterizeForeignPathByChild callback function in fdwhandler.sgml. Best regards, Etsuro Fujita [1] https://www.postgresql.org/message-id/CAPmGK16ah9JtyVPtdqu6d%3DQGkRX%3DRAzoYQfX7%3DLZ%2BKnqwBfftg%40mail.gmail.com update-custom-scan-path-comments.patch Description: Binary data update-custom-scan-path-docs.patch Description: Binary data
Re: Adding a LogicalRepWorker type field
On Mon, Jul 31, 2023 at 7:17 AM Peter Smith wrote: > > PROBLEM: > > IMO, deducing the worker's type by examining multiple different field > values seems a dubious way to do it. This maybe was reasonable enough > when there were only 2 types, but as more get added it becomes > increasingly complicated. +1 for being more explicit in worker types. I also think that we must move away from am_{parallel_apply, tablesync, leader_apply}_worker(void) to Is{ParallelApply, TableSync, LeaderApply}Worker(MyLogicalRepWorker), just to be a little more consistent and less confusion around different logical replication worker type related functions. > AFAIK none of the complications is necessary anyway; the worker type > is already known at launch time, and it never changes during the life > of the process. > > The attached Patch 0001 introduces a new enum 'type' field, which is > assigned during the launch. > > This change not only simplifies the code, but it also permits other > code optimizations, because now we can switch on the worker enum type, > instead of using cascading if/else. (see Patch 0002). Some comments: 1. +/* Different kinds of workers */ +typedef enum LogicalRepWorkerType +{ +LR_WORKER_TABLESYNC, +LR_WORKER_APPLY, +LR_WORKER_APPLY_PARALLEL +} LogicalRepWorkerType; Can these names be readable? How about something like LR_TABLESYNC_WORKER, LR_APPLY_WORKER, LR_PARALLEL_APPLY_WORKER? 2. -#define isParallelApplyWorker(worker) ((worker)->leader_pid != InvalidPid) +#define isLeaderApplyWorker(worker) ((worker)->type == LR_WORKER_APPLY) +#define isParallelApplyWorker(worker) ((worker)->type == LR_WORKER_APPLY_PARALLEL) +#define isTablesyncWorker(worker) ((worker)->type == LR_WORKER_TABLESYNC) Can the above start with "Is" instead of "is" similar to IsLogicalWorker and IsLogicalParallelApplyWorker? 3. +worker->type = +OidIsValid(relid) ? LR_WORKER_TABLESYNC : +is_parallel_apply_worker ? LR_WORKER_APPLY_PARALLEL : +LR_WORKER_APPLY; Perhaps, an if-else is better for readability? if (OidIsValid(relid)) worker->type = LR_WORKER_TABLESYNC; else if (is_parallel_apply_worker) worker->type = LR_WORKER_APPLY_PARALLEL; else worker->type = LR_WORKER_APPLY; 4. +/* Different kinds of workers */ +typedef enum LogicalRepWorkerType +{ +LR_WORKER_TABLESYNC, +LR_WORKER_APPLY, +LR_WORKER_APPLY_PARALLEL +} LogicalRepWorkerType; Have a LR_WORKER_UNKNOWN = 0 and set it in logicalrep_worker_cleanup()? 5. +case LR_WORKER_APPLY: +return (rel->state == SUBREL_STATE_READY || +(rel->state == SUBREL_STATE_SYNCDONE && + rel->statelsn <= remote_final_lsn)); Not necessary, but a good idea to have a default: clause and error out saying wrong logical replication worker type? 6. +case LR_WORKER_APPLY_PARALLEL: +/* + * Skip for parallel apply workers because they only operate on tables + * that are in a READY state. See pa_can_start() and + * should_apply_changes_for_rel(). + */ +break; I'd better keep this if-else as-is instead of a switch case with nothing for parallel apply worker. -- Bharath Rupireddy PostgreSQL Contributors Team RDS Open Source Databases Amazon Web Services: https://aws.amazon.com
Re: Performance degradation on concurrent COPY into a single relation in PG16.
On Thu, Jul 27, 2023 at 7:17 AM David Rowley wrote: > > It would be really good if someone with another a newish intel CPU > could test this too. I ran the lotsaints test from last email on an i7-10750H (~3 years old) and got these results (gcc 13.1 , turbo off): REL_15_STABLE: latency average = 956.453 ms latency stddev = 4.854 ms REL_16_STABLE @ 695f5deb7902 (28-JUL-2023): latency average = 999.354 ms latency stddev = 3.611 ms master @ 39055cb4cc (31-JUL-2023): latency average = 995.310 ms latency stddev = 5.176 ms master + revert c1308ce2d (the replace-palloc0 fix) latency average = 1080.909 ms latency stddev = 8.707 ms master + pg_strtoint_fastpath1.patch latency average = 938.146 ms latency stddev = 9.354 ms master + pg_strtoint_fastpath2.patch latency average = 902.808 ms latency stddev = 3.957 ms For me, PG16 seems to regress from PG15, and the second patch seems faster than the first. -- John Naylor EDB: http://www.enterprisedb.com
回复:pg_rewind fails with in-place tablespace
Sorry for the delay in responding to this matter as I have been waiting for another similar subject to approved by a moderator. Upon review, I am satisfied with the proposed solution and believe that checking absolute path is better than hard coding with "pg_tblspc/". I think we have successfully resolved this issue in the pg_rewind case. However, I would like to bring your attention to another issue: pg_upgrade fails with in-place tablespace. Another issue is still waiting for approved. I have tested all the tools in src/bin with in-place tablespace, and I believe this is the final issue. Thank you for your understanding and assistance. Best regard, Rui Zhao -- 发件人:Michael Paquier 发送时间:2023年7月31日(星期一) 06:49 收件人:赵锐(惜元) 抄 送:pgsql-hackers ; Thomas Munro 主 题:Re: pg_rewind fails with in-place tablespace On Fri, Jul 28, 2023 at 04:54:56PM +0900, Michael Paquier wrote: > I am finishing with the attached. Thoughts? Applied this one as bf22792 on HEAD, without a backpatch as in-place tablespaces are around for developers. If there are opinions in favor of a backpatch, feel free of course. -- Michael
CDC/ETL system on top of logical replication with pgoutput, custom client
Hi there, hope to find you well. I'm attempting to develop a CDC on top of Postgres, currently using 12, the last minor, with a custom client, and I'm running into issues with data loss caused by out-of-order logical replication messages. The problem is as follows: postgres streams A, B, D, G, K, I, P logical replication events, upon exit signal we stop consuming new events at LSN K, and we wait 30s for out-of-order events. Let's say that we only got A, (and K ofc) so in the following 30s, we get B, D, however, for whatever reason, G never arrived. As with pgoutput-based logical replication we have no way to calculate the next LSN, we have no idea that G was missing, so we assumed that it all arrived, committing K to postgres slot and shutdown. In the next run, our worker will start receiving data from K forward, and G is lost forever... Meanwhile postgres moves forward with archiving and we can't go back to check if we lost anything. And even if we could, would be extremely inefficient. In sum, the issue comes from the fact that postgres will stream events with unordered LSNs on high transactional systems, and that pgoutput doesn't have access to enough information to calculate the next or last LSN, so we have no way to check if we receive all the data that we are supposed to receive, risking committing an offset that we shouldn't as we didn't receive yet preceding data. It seems very either to me that none of the open-source CDC projects that I looked into care about this. They always assume that the next LSN received is... well the next one, and commit that one, so upon restart, they are vulnerable to the same issue. So... either I'm missing something... or we have a generalized assumption causing data loss under certain conditions all over. Am I missing any postgres mechanism that will allow me to at least detect that I'm missing data? Thanks in advance for any clues on how to deal with this. It has been driving me nuts. * Regards, José Neves
pg_upgrade fails with in-place tablespace
Hello postgres hackers, Recently I encountered an issue: pg_upgrade fails when dealing with in-place tablespace. As we know, pg_upgrade uses pg_dumpall to dump objects and pg_restore to restore them. The problem seems to be that pg_dumpall is dumping in-place tablespace as relative path, which can't be restored. Here is the error message of pg_upgrade: psql:/home/postgres/postgresql/src/bin/pg_upgrade/tmp_check/t_002_pg_upgrade_new_node_data/pgdata/pg_upgrade_output.d/20230729T210058.329/dump/pg_upgrade_dump_globals.sql:36: ERROR: tablespace location must be an absolute path To help reproduce the failure, I have attached a tap test. The test also fails with tablespace regression, and it change the default value of allow_in_place_tablespaces to true only for debug, so it may not be fit for production. However, it is enough to reproduce this failure. I have also identified a solution for this problem, which I have included in the patch. The solution has two modifications: 1) Make the function pg_tablespace_location returns path "" with in-place tablespace, rather than relative path. Because the path of the in-place tablespace is always 'pg_tblspc/'. 2) Only check the tablespace with an absolute path in pg_upgrade. There are also other solutions, such as supporting the creation of relative-path tablespace in function CreateTableSpace. But do we really need relative-path tablespace? I think in-place tablespace is enough by now. My solution may be more lightweight and harmless. Thank you for your attention to this matter. Best regards, Rui Zhao 0001-Fix-pg_upgrade-fails-with-in-place-tablespace.patch Description: Binary data
Re: logical decoding and replication of sequences, take 2
On Sat, Jul 29, 2023 at 5:53 PM Tomas Vondra wrote: > > On 7/28/23 14:44, Ashutosh Bapat wrote: > > On Wed, Jul 26, 2023 at 8:48 PM Tomas Vondra > > wrote: > >> > >> Anyway, I was thinking about this a bit more, and it seems it's not as > >> difficult to use the page LSN to ensure sequences don't go backwards. > >> The 0005 change does that, by: > >> > >> 1) adding pg_sequence_state, that returns both the sequence state and > >>the page LSN > >> > >> 2) copy_sequence returns the page LSN > >> > >> 3) tablesync then sets this LSN as origin_startpos (which for tables is > >>just the LSN of the replication slot) > >> > >> AFAICS this makes it work - we start decoding at the page LSN, so that > >> we skip the increments that could lead to the sequence going backwards. > >> > > > > I like this design very much. It makes things simpler than complex. > > Thanks for doing this. > > > > I agree it seems simpler. It'd be good to try testing / reviewing it a > bit more, so that it doesn't misbehave in some way. > Yeah, I also think this needs a review. This is a sort of new concept where we don't use the LSN of the slot (for cases where copy returned a larger value of LSN) or a full_snapshot created corresponding to the sync slot by Walsender. For the case of the table, we build a full snapshot because we use that for copying the table but why do we need to build that for copying the sequence especially when we directly copy it from the sequence relation without caring for any snapshot? -- With Regards, Amit Kapila.
Re: postgres_fdw: wrong results with self join + enable_nestloop off
On Fri, Jul 28, 2023 at 4:56 PM Etsuro Fujita wrote: > Cool! I pushed the first patch after polishing it a little bit, so > here is a rebased version of the second patch, in which I modified the > ForeignPath and CustomPath cases in reparameterize_path_by_child() to > reflect the new members fdw_restrictinfo and custom_restrictinfo, for > safety, and tweaked a comment a bit. Hmm, it seems that ForeignPath for a foreign join does not support parameterized paths for now, as in postgresGetForeignJoinPaths() we have this check: /* * This code does not work for joins with lateral references, since those * must have parameterized paths, which we don't generate yet. */ if (!bms_is_empty(joinrel->lateral_relids)) return; And in create_foreign_join_path() we just set the path.param_info to NULL. pathnode->path.param_info = NULL; /* XXX see above */ So I doubt that it's necessary to adjust fdw_restrictinfo in reparameterize_path_by_child, because it seems to me that fdw_restrictinfo must be empty there. Maybe we can add an Assert there as below: -ADJUST_CHILD_ATTRS(fpath->fdw_restrictinfo); + +/* + * Parameterized foreign joins are not supported. So this + * ForeignPath cannot be a foreign join and fdw_restrictinfo + * must be empty. + */ +Assert(fpath->fdw_restrictinfo == NIL); That being said, it's also no harm to handle fdw_restrictinfo in reparameterize_path_by_child as the patch does. So I'm OK we do that for safety. Thanks Richard
Re: Support to define custom wait events for extensions
At Mon, 31 Jul 2023 16:28:16 +0900, Michael Paquier wrote in > Attaching a v11 based on Bharath's feedback and yours, for now. I > have also applied the addition of the two masking variables in > wait_event.c separately with 7395a90. +/* + * Return the name of an wait event ID for extension. + */ +static const char * +GetWaitEventExtensionIdentifier(uint16 eventId) This looks inconsistent. Shouldn't it be GetWaitEventExtentionName()? regards. -- Kyotaro Horiguchi NTT Open Source Software Center
Re: Support to define custom wait events for extensions
On Mon, Jul 31, 2023 at 12:58 PM Michael Paquier wrote: > > > Attaching a v11 based on Bharath's feedback and yours, for now. I > have also applied the addition of the two masking variables in > wait_event.c separately with 7395a90. +uint32 WaitEventExtensionNew(void) + + Next, each process needs to associate the wait event allocated previously + to a user-facing custom string, which is something done by calling: + +void WaitEventExtensionRegisterName(uint32 wait_event_info, const char *wait_event_name) + + An example can be found in src/test/modules/worker_spi + in the PostgreSQL source tree. + Do you think it's worth adding a note here in the docs about an external module defining more than one custom wait event? A pseudo code if possible or just a note? Also, how about a XXX comment atop WaitEventExtensionNew and/or WaitEventExtensionRegisterName on the possibility of extending the functions to support allocation of more than one custom wait events? -- Bharath Rupireddy PostgreSQL Contributors Team RDS Open Source Databases Amazon Web Services: https://aws.amazon.com
Re: Avoid undefined behavior with msvc compiler (src/include/port/pg_bitutils.h)
On Sun, Jul 30, 2023 at 9:45 PM Tom Lane wrote: > > John Naylor writes: > > It seems that we should have "Assert(word != 0);" at the top, which matches > > the other platforms anyway, so I'll add that. > > That's basically equivalent to the existing Assert(non_zero). > I think it'd be okay to drop that one and instead have > the same Assert condition as other platforms, but having both > would be redundant. Works for me, so done that way for both forward and reverse variants. Since the return value is no longer checked in any builds, I thought about removing the variable containing it, but it seems best to leave it behind for clarity since these are not our functions. -- John Naylor EDB: http://www.enterprisedb.com
Re: Support to define custom wait events for extensions
On 2023-07-31 16:28, Michael Paquier wrote: On Mon, Jul 31, 2023 at 03:53:27PM +0900, Masahiro Ikeda wrote: /* This should only be called for user-defined wait event. */ if (eventId < NUM_BUILTIN_WAIT_EVENT_EXTENSION) ereport(ERROR, errcode(ERRCODE_INVALID_PARAMETER_VALUE), errmsg("invalid wait event ID %u", eventId)); I was just wondering if it should also check the eventId that has been allocated though it needs to take the spinlock and GetWaitEventExtensionIdentifier() doesn't take it into account. What kind of extra check do you have in mind? Once WAIT_EVENT_ID_MASK is applied, we already know that we don't have something larger than PG_UNIT16_MAX, or perhaps you want to cross-check this number with what nextId holds in shared memory and that we don't have a number between nextId and PG_UNIT16_MAX? I am not sure that we need to care much about that this much in this code path, and I'd rather avoid taking an extra time the spinlock just for a cross-check. OK. I assumed to check that we don't have a number between nextId and PG_UNIT16_MAX. Regards, -- Masahiro Ikeda NTT DATA CORPORATION
Re: Support to define custom wait events for extensions
On Mon, Jul 31, 2023 at 03:53:27PM +0900, Masahiro Ikeda wrote: > I think the order in which they are mentioned should be matched. I mean that > - so an LWLock or Extension wait > + so an Extension or LWLock wait Makes sense. > /* This should only be called for user-defined wait event. */ > if (eventId < NUM_BUILTIN_WAIT_EVENT_EXTENSION) > ereport(ERROR, > errcode(ERRCODE_INVALID_PARAMETER_VALUE), > errmsg("invalid wait event ID %u", eventId)); > > I was just wondering if it should also check the eventId > that has been allocated though it needs to take the spinlock > and GetWaitEventExtensionIdentifier() doesn't take it into account. What kind of extra check do you have in mind? Once WAIT_EVENT_ID_MASK is applied, we already know that we don't have something larger than PG_UNIT16_MAX, or perhaps you want to cross-check this number with what nextId holds in shared memory and that we don't have a number between nextId and PG_UNIT16_MAX? I am not sure that we need to care much about that this much in this code path, and I'd rather avoid taking an extra time the spinlock just for a cross-check. Attaching a v11 based on Bharath's feedback and yours, for now. I have also applied the addition of the two masking variables in wait_event.c separately with 7395a90. -- Michael From 23b6c4c36fbf23ca9c6fb4f42a3f943be44689a6 Mon Sep 17 00:00:00 2001 From: Michael Paquier Date: Mon, 31 Jul 2023 16:04:20 +0900 Subject: [PATCH v11] Support custom wait events for wait event type "Extension" Two backend routines are added to allow extension to allocate and define custom wait events, all of these being allocated as of type "Extension": * WaitEventExtensionNew(), that allocates a wait event ID computed from a counter in shared memory. * WaitEventExtensionRegisterName(), to associate a custom string to the wait event registered. Note that this includes an example of how to use this new facility in worker_spi, with tests for various scenarios. The use of these routines should be extended across more in-core modules, but this is left as work for future patches. --- src/include/utils/wait_event.h| 26 +++ src/backend/storage/ipc/ipci.c| 3 + .../activity/generate-wait_event_types.pl | 7 +- src/backend/utils/activity/wait_event.c | 178 +- .../utils/activity/wait_event_names.txt | 2 +- .../modules/worker_spi/t/001_worker_spi.pl| 34 +++- .../modules/worker_spi/worker_spi--1.0.sql| 5 + src/test/modules/worker_spi/worker_spi.c | 104 +- doc/src/sgml/monitoring.sgml | 11 +- doc/src/sgml/xfunc.sgml | 45 + src/tools/pgindent/typedefs.list | 2 + 11 files changed, 393 insertions(+), 24 deletions(-) diff --git a/src/include/utils/wait_event.h b/src/include/utils/wait_event.h index 4517425f84..aad8bc08fa 100644 --- a/src/include/utils/wait_event.h +++ b/src/include/utils/wait_event.h @@ -38,6 +38,32 @@ extern void pgstat_reset_wait_event_storage(void); extern PGDLLIMPORT uint32 *my_wait_event_info; +/* -- + * Wait Events - Extension + * + * Use this category when the server process is waiting for some condition + * defined by an extension module. + * + * Extensions can define their own wait events in this category. First, + * they should call WaitEventExtensionNew() to get one or more wait event + * IDs that are allocated from a shared counter. These can be used directly + * with pgstat_report_wait_start() or equivalent. Next, each individual + * process should call WaitEventExtensionRegisterName() to associate a wait + * event string to the number allocated previously. + */ +typedef enum +{ + WAIT_EVENT_EXTENSION = PG_WAIT_EXTENSION, + WAIT_EVENT_EXTENSION_FIRST_USER_DEFINED +} WaitEventExtension; + +extern void WaitEventExtensionShmemInit(void); +extern Size WaitEventExtensionShmemSize(void); + +extern uint32 WaitEventExtensionNew(void); +extern void WaitEventExtensionRegisterName(uint32 wait_event_info, + const char *wait_event_name); + /* -- * pgstat_report_wait_start() - * diff --git a/src/backend/storage/ipc/ipci.c b/src/backend/storage/ipc/ipci.c index cc387c00a1..5551afffc0 100644 --- a/src/backend/storage/ipc/ipci.c +++ b/src/backend/storage/ipc/ipci.c @@ -49,6 +49,7 @@ #include "storage/spin.h" #include "utils/guc.h" #include "utils/snapmgr.h" +#include "utils/wait_event.h" /* GUCs */ int shared_memory_type = DEFAULT_SHARED_MEMORY_TYPE; @@ -142,6 +143,7 @@ CalculateShmemSize(int *num_semaphores) size = add_size(size, SyncScanShmemSize()); size = add_size(size, AsyncShmemSize()); size = add_size(size, StatsShmemSize()); + size = add_size(size, WaitEventExtensionShmemSize()); #ifdef EXEC_BACKEND size = add_size(size, ShmemBackendArraySize()); #endif @@ -301,6 +303,7 @@ CreateSharedMemoryAndSemaphores(void)
Re: Support to define custom wait events for extensions
On Mon, Jul 31, 2023 at 12:07:40PM +0530, Bharath Rupireddy wrote: > We're not giving up and returning an unknown state in the v10 patch > unlike v9, no? This comment needs to change. Right. Better to be consistent with lwlock.c here. >> No, it cannot because we need the custom wait event string to be >> loaded in the same connection as the one querying pg_stat_activity. >> A different thing that can be done here is to use background_psql() >> with a query_until(), though I am not sure that this is worth doing >> here. > > -1 to go to the background_psql and query_until route. However, > enhancing the comment might help "we need the custom wait event string > to be loaded in the same connection as .". Having said this, I > don't quite understand the point of having worker_spi_init() when we > say clearly how to use shmem hooks and custom wait events. If its > intention is to show loading of shared memory to a backend via a > function, do we really need it in worker_spi? I don't mind removing it > if it's not adding any significant value. It seems to initialize the state of the worker_spi, so attaching a function to this stuff makes sense to me, just for the sake of testing all that. -- Michael signature.asc Description: PGP signature
Re: Support to define custom wait events for extensions
On 2023-07-31 10:10, Michael Paquier wrote: Attached is a new version. Thanks for all the improvements. I have some comments for v10. (1) - Extensions can add LWLock types to the list shown in - . In some cases, the name + Extensions can add Extension and + LWLock types + to the list shown in and + . In some cases, the name assigned by an extension will not be available in all server processes; - so an LWLock wait event might be reported as - just extension rather than the + so an LWLock or Extension wait + event might be reported as just + extension rather than the extension-assigned name. I think the order in which they are mentioned should be matched. I mean that - so an LWLock or Extension wait + so an Extension or LWLock wait (2) /* This should only be called for user-defined wait event. */ if (eventId < NUM_BUILTIN_WAIT_EVENT_EXTENSION) ereport(ERROR, errcode(ERRCODE_INVALID_PARAMETER_VALUE), errmsg("invalid wait event ID %u", eventId)); I was just wondering if it should also check the eventId that has been allocated though it needs to take the spinlock and GetWaitEventExtensionIdentifier() doesn't take it into account. Regards, -- Masahiro Ikeda NTT DATA CORPORATION
Re: Support to define custom wait events for extensions
On Mon, Jul 31, 2023 at 6:40 AM Michael Paquier wrote: > > You are right that I am making things inconsistent here. Having a > behavior close to the existing LWLock and use "extension" when the > event cannot be found makes the most sense. I have been a bit > confused with the wording though of this part of the docs, though, as > LWLocks don't directly use the custom wait event APIs. + * calling WaitEventExtensionRegisterName() in the current process, in + * which case give up and return an unknown state. We're not giving up and returning an unknown state in the v10 patch unlike v9, no? This comment needs to change. > > 4. > > + Add-ins can define custom wait events under the wait event type > > > > I see a few instances of Add-ins/add-in in xfunc.sgml. Isn't it better > > to use the word extension given that glossary defines what an > > extension is > > https://www.postgresql.org/docs/current/glossary.html#GLOSSARY-EXTENSION? > > An extension is an Add-in, and may be loaded, but it is possible to > have modules that just need to be LOAD'ed (with command line or just > shared_preload_libraries) so an add-in may not always be an extension. > I am not completely sure if add-ins is the best term, but it covers > both, and that's consistent with the existing docs. Perhaps the same > area of the docs should be refreshed, but that looks like a separate > patch for me. For now, I'd rather use a consistent term, and this one > sounds OK to me. > > [1]: https://www.postgresql.org/docs/devel/extend-extensions.html. The "external module" seems the right wording here. Use of "add-ins" is fine by me for this patch. > Okay by me that it may be a better idea to use ereport(ERROR) in the > long run, so changed this way. I have introduced a > WAIT_EVENT_CLASS_MASK and a WAIT_EVENT_ID_MASK as we now use > 0xFF00 and 0x in three places of this file. This should > just be a patch on its own. Yeah, I don't mind these macros going along or before or after the custom wait events feature. > Yes, this was mentioned upthread. I am not completely sure yet how > much we need to do for this interface, but surely it would be faster > to have a Multiple() interface that returns an array made of N numbers > requested (rather than a rank of them). For now, I'd rather just aim > for simplicity for the basics. +1 to be simple for now. If any such requests come in future, I'm sure we can always get back to it. > > 9. > > # The expected result is a special pattern here with a newline coming from > > the > > # first query where the shared memory state is set. > > $result = $node->poll_query_until( > > 'postgres', > > qq[SELECT worker_spi_init(); SELECT wait_event FROM > > pg_stat_activity WHERE backend_type ~ 'worker_spi';], > > qq[ > > worker_spi_main]); > > > > This test doesn't have to be that complex with the result being a > > special pattern, SELECT worker_spi_init(); can just be within a > > separate safe_psql. > > No, it cannot because we need the custom wait event string to be > loaded in the same connection as the one querying pg_stat_activity. > A different thing that can be done here is to use background_psql() > with a query_until(), though I am not sure that this is worth doing > here. -1 to go to the background_psql and query_until route. However, enhancing the comment might help "we need the custom wait event string to be loaded in the same connection as .". Having said this, I don't quite understand the point of having worker_spi_init() when we say clearly how to use shmem hooks and custom wait events. If its intention is to show loading of shared memory to a backend via a function, do we really need it in worker_spi? I don't mind removing it if it's not adding any significant value. -- Bharath Rupireddy PostgreSQL Contributors Team RDS Open Source Databases Amazon Web Services: https://aws.amazon.com
Re: add timing information to pg_upgrade
On Sun, Jul 30, 2023 at 2:44 AM Nathan Bossart wrote: > > On Sat, Jul 29, 2023 at 12:17:40PM +0530, Bharath Rupireddy wrote: > > While on this, I noticed a thing unrelated to your patch that there's > > no space between the longest status message of size 60 bytes and ok - > > 'Checking for incompatible "aclitem" data type in user tablesok > > 23.932 ms'. I think MESSAGE_WIDTH needs to be bumped up - 64 or more. > > Good catch. I think I'd actually propose just removing "in user tables" or > the word "incompatible" from these messages to keep them succinct. Either of "Checking for \"aclitem\" data type usage" or "Checking for \"aclitem\" data type in user tables" seems okay to me, however, I prefer the second wording. -- Bharath Rupireddy PostgreSQL Contributors Team RDS Open Source Databases Amazon Web Services: https://aws.amazon.com