Re: WAL Insertion Lock Improvements
On Thu, Jul 20, 2023 at 02:38:29PM +0530, Bharath Rupireddy wrote: > On Fri, Jul 14, 2023 at 4:17 AM Andres Freund wrote: >> I think this commit does too many things at once. > > I've split the patch into three - 1) Make insertingAt 64-bit atomic. > 2) Have better commenting on why there's no memory barrier or spinlock > in and around LWLockWaitForVar call sites. 3) Have a quick exit for > LWLockUpdateVar. FWIW, I was kind of already OK with 0001, as it shows most of the gains observed while 0003 had a limited impact: https://www.postgresql.org/message-id/CALj2ACWgeOPEKVY-TEPvno%3DVatyzrb-vEEP8hN7QqrQ%3DyPRupA%40mail.gmail.com It is kind of a no-brainer to replace the spinlocks with atomic reads and writes there. >> I don't think: >> * NB: LWLockConflictsWithVar (which is called from >> * LWLockWaitForVar) relies on the spinlock used >> above in this >> * function and doesn't use a memory barrier. >> >> helps to understand why any of this is safe to a meaningful degree. >> >> The existing comments aren't obviously aren't sufficient to explain this, but >> the reason it's, I think, safe today, is that we are only waiting for >> insertions that started before WaitXLogInsertionsToFinish() was called. The >> lack of memory barriers in the loop means that we might see locks as "unused" >> that have *since* become used, which is fine, because they only can be for >> later insertions that we wouldn't want to wait on anyway. > > Right. FWIW, I always have a hard time coming back to this code and see it rely on undocumented assumptions with code in lwlock.c while we need to keep an eye in xlog.c (we take a spinlock there while the variable wait logic relies on it for ordering @-@). So the proposal of getting more documentation in place via 0002 goes in the right direction. >> This comment seems off to me. Using atomics doesn't guarantee not needing >> locking. It just guarantees that we are reading a non-torn value. > > Modified the comment. - /* -* Read value using the lwlock's wait list lock, as we can't generally -* rely on atomic 64 bit reads/stores. TODO: On platforms with a way to -* do atomic 64 bit reads/writes the spinlock should be optimized away. -*/ - LWLockWaitListLock(lock); - value = *valptr; - LWLockWaitListUnlock(lock); + /* Reading atomically avoids getting a torn value */ + value = pg_atomic_read_u64(valptr); Should this specify that this is specifically important for platforms where reading a uint64 could lead to a torn value read, if you apply this term in this context? Sounding picky, I would make that a bit longer, say something like that: "Reading this value atomically is safe even on platforms where uint64 cannot be read without observing a torn value." Only xlogprefetcher.c uses the term "torn" for a value by the way, but for a write. >>> @@ -1605,9 +1602,14 @@ LWLockConflictsWithVar(LWLock *lock, >>> * >>> * Note: this function ignores shared lock holders; if the lock is held >>> * in shared mode, returns 'true'. >>> + * >>> + * Be careful that LWLockConflictsWithVar() does not include a memory >>> barrier, >>> + * hence the caller of this function may want to rely on an explicit >>> barrier or >>> + * a spinlock to avoid memory ordering issues. >>> */ >> >> s/careful/aware/? >> >> s/spinlock/implied barrier via spinlock or lwlock/? > > Done. Okay to mention a LWLock here, even if the sole caller of this routine relies on a spinlock. 0001 looks OK-ish seen from here. Thoughts? -- Michael signature.asc Description: PGP signature
Re: Support worker_spi to execute the function dynamically.
On Fri, Jul 21, 2023 at 8:38 AM Michael Paquier wrote: > > On Thu, Jul 20, 2023 at 03:44:12PM +0530, Bharath Rupireddy wrote: > > I don't think that change is correct. The worker_spi essentially shows > > how to start bg workers with RegisterBackgroundWorker and dynamic bg > > workers with RegisterDynamicBackgroundWorker. If > > shared_preload_libraries = worker_spi not specified in there, you will > > miss to start RegisterBackgroundWorkers. Is giving an initidb time > > database name to worker_spi.database work there? If the database for > > bg workers doesn't exist, changing bgw_restart_time from > > BGW_NEVER_RESTART to say 1 will help to see bg workers coming up > > eventually. > > Yeah, it does not move the needle by much. I think that we are > looking at switching this module to use a TAP test in the long term, > instead, where it would be possible to test the scenarios we want to > look at *with* and *without* shared_preload_libraries especially with > the custom wait events for extensions in mind if we add our tests in > this module. Okay. Here's a quick patch for adding TAP tests to the worker_spi module. We can change it to taste. Thoughts? -- Bharath Rupireddy PostgreSQL Contributors Team RDS Open Source Databases Amazon Web Services: https://aws.amazon.com v1-0001-Add-TAP-tests-to-worker_spi-module.patch Description: Binary data
Re: [PATCH] Reuse Workers and Replication Slots during Logical Replication
On Fri, Jul 21, 2023 at 7:30 AM Peter Smith wrote: > > ~~~ > > 2. StartLogRepWorker > > /* Common function to start the leader apply or tablesync worker. */ > void > StartLogRepWorker(int worker_slot) > { > /* Attach to slot */ > logicalrep_worker_attach(worker_slot); > > /* Setup signal handling */ > pqsignal(SIGHUP, SignalHandlerForConfigReload); > pqsignal(SIGTERM, die); > BackgroundWorkerUnblockSignals(); > > /* > * We don't currently need any ResourceOwner in a walreceiver process, but > * if we did, we could call CreateAuxProcessResourceOwner here. > */ > > /* Initialise stats to a sanish value */ > MyLogicalRepWorker->last_send_time = MyLogicalRepWorker->last_recv_time = > MyLogicalRepWorker->reply_time = GetCurrentTimestamp(); > > /* Load the libpq-specific functions */ > load_file("libpqwalreceiver", false); > > InitializeLogRepWorker(); > > /* Connect to the origin and start the replication. */ > elog(DEBUG1, "connecting to publisher using connection string \"%s\"", > MySubscription->conninfo); > > /* > * Setup callback for syscache so that we know when something changes in > * the subscription relation state. > */ > CacheRegisterSyscacheCallback(SUBSCRIPTIONRELMAP, > invalidate_syncing_table_states, > (Datum) 0); > } > > ~ > > 2a. > The function name seems a bit misleading because it is not really > "starting" anything here - it is just more "initialization" code, > right? Nor is it common to all kinds of LogRepWorker. Maybe the > function could be named something else like 'InitApplyOrSyncWorker()'. > -- see also #2c > How about SetupLogRepWorker? The other thing I noticed is that we don't seem to be consistent in naming functions in these files. For example, shall we make all exposed functions follow camel case (like InitializeLogRepWorker) and static functions follow _ style (like run_apply_worker) or the other possibility is to use _ style for all functions except may be the entry functions like ApplyWorkerMain()? I don't know if there is already a pattern but if not then let's form it now, so that code looks consistent. > ~ > > 2b. > Should this have Assert to ensure this is only called from leader > apply or tablesync? -- see also #2c > > ~ > > 2c. > IMO maybe the best/tidiest way to do this is not to introduce a new > function at all. Instead, just put all this "common init" code into > the existing "common init" function ('InitializeLogRepWorker') and > execute it only if (am_tablesync_worker() || am_leader_apply_worker()) > { }. > I don't like 2c much because it will make InitializeLogRepWorker() have two kinds of initializations. > == > src/include/replication/worker_internal.h > > 3. > extern void pa_xact_finish(ParallelApplyWorkerInfo *winfo, > XLogRecPtr remote_lsn); > +extern void set_stream_options(WalRcvStreamOptions *options, > +char *slotname, > +XLogRecPtr *origin_startpos); > + > +extern void start_apply(XLogRecPtr origin_startpos); > +extern void DisableSubscriptionAndExit(void); > +extern void StartLogRepWorker(int worker_slot); > > This placement (esp. with the missing whitespace) seems to be grouping > the set_stream_options with the other 'pa' externs, which are all > under the comment "/* Parallel apply worker setup and interactions > */". > > Putting all these up near the other "extern void > InitializeLogRepWorker(void)" might be less ambiguous. > +1. Also, note that they should be in the same order as they are in .c files. -- With Regards, Amit Kapila.
PATCH: Add REINDEX tag to event triggers
Added my v1 patch to add REINDEX to event triggers. I originally built this against pg15 but rebased to master for the patch to hopefully make it easier for maintainers to merge. The rebase was automatic so it should be easy to include into any recent version. I'd love for this to land in pg16 if possible. I added regression tests and they are passing. I also formatted the code using the project tools. Docs are included too. A few notes: 1. I tried to not touch the function parameters too much and opted to create a convenience function that makes it easy to attach index or table OIDs to the current event trigger list. 2. I made sure both concurrent and regular reindex of index, table, database work as expected. 3. The ddl end command will make available all indexes that were modified by the reindex command. This is a large list when you run "reindex database" but works really well. I debated returning records of the table or DB that were reindexed but that doesn't really make sense. Returning each index fits my use case of building an audit record around the index lifecycle (hence the motivation for the patch). Thanks for your time, Garrett Thornburg v1-0001-Add-REINDEX-tag-to-event-triggers.patch Description: Binary data
Re: POC: GROUP BY optimization
On 20/7/2023 18:46, Tomas Vondra wrote: On 7/20/23 08:37, Andrey Lepikhov wrote: On 3/10/2022 21:56, Tom Lane wrote: Revert "Optimize order of GROUP BY keys". This reverts commit db0d67db2401eb6238ccc04c6407a4fd4f985832 and several follow-on fixes. ... Since we're hard up against the release deadline for v15, let's revert these changes for now. We can always try again later. It may be time to restart the project. As a first step, I rebased the patch on the current master. It wasn't trivial because of some latest optimizations (a29eab, 1349d27 and 8d83a5d). Now, Let's repeat the review and rewrite the current path according to the reasons uttered in the revert commit. I think the fundamental task is to make the costing more reliable, and the commit message 443df6e2db points out a couple challenges in this area. Not sure how feasible it is to address enough of them ... 1) procost = 1.0 - I guess we could make this more realistic by doing some microbenchmarks and tuning the costs for the most expensive cases. 2) estimating quicksort comparisons - This relies on ndistinct estimates, and I'm not sure how much more reliable we can make those. Probably not much :-( Not sure what to do about this, the only thing I can think of is to track "reliability" of the estimates and only do the reordering if we have high confidence in the estimates. That means we'll miss some optimization opportunities, but it should limit the risk. For me personally, the most challenging issue is: 3. Imbalance, induced by the cost_sort() changes. It may increase or decrease the contribution of sorting to the total cost compared to other factors and change choice of sorted paths. -- regards, Andrey Lepikhov Postgres Professional
Re: Buildfarm failures on chipmunk
At Fri, 21 Jul 2023 09:10:54 +0530, vignesh C wrote in > postmaster.log shows the following error: > 2023-07-20 08:21:00.343 GMT [27063] LOG: unrecognized configuration > parameter "force_parallel_mode" in file > "/home/pgbfarm/buildroot/HEAD/pgsql.build/src/test/regress/tmp_check/data/postgresql.conf" > line 835 > 2023-07-20 08:21:00.344 GMT [27063] FATAL: configuration file > "/home/pgbfarm/buildroot/HEAD/pgsql.build/src/test/regress/tmp_check/data/postgresql.conf" > contains errors > > Thoughts? [1] - https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=chipmunk=2023-07-20%2006%3A48%3A23 > $Script_Config = { > ... >'extra_config' => { >'DEFAULT' => [ > ... >'HEAD' => [ >'force_parallel_mode = > regress' > ] > }, The BF config needs a fix for HEAD. regards. -- Kyotaro Horiguchi NTT Open Source Software Center
Re: Buildfarm failures on chipmunk
HI, > On Jul 21, 2023, at 11:40, vignesh C wrote: > > Hi, > > Recently chipmunk has failed with the following errors at [1]: > > make -C '../../..' > DESTDIR='/home/pgbfarm/buildroot/HEAD/pgsql.build'/tmp_install install >> '/home/pgbfarm/buildroot/HEAD/pgsql.build'/tmp_install/log/install.log > 2>&1 > make -j1 checkprep >>> '/home/pgbfarm/buildroot/HEAD/pgsql.build'/tmp_install/log/install.log > 2>&1 > echo "# +++ regress check in src/test/regress +++" && > PATH="/home/pgbfarm/buildroot/HEAD/pgsql.build/tmp_install/home/pgbfarm/buildroot/HEAD/inst/bin:/home/pgbfarm/buildroot/HEAD/pgsql.build/src/test/regress:$PATH" > LD_LIBRARY_PATH="/home/pgbfarm/buildroot/HEAD/pgsql.build/tmp_install/home/pgbfarm/buildroot/HEAD/inst/lib" > ../../../src/test/regress/pg_regress --temp-instance=./tmp_check > --inputdir=. --bindir= > --temp-config=/tmp/buildfarm-BewC4d/bfextra.conf --no-locale > --port=5678 --dlpath=. --max-concurrent-tests=20 --port=5678 > --schedule=./parallel_schedule > # +++ regress check in src/test/regress +++ > # postmaster failed, examine > "/home/pgbfarm/buildroot/HEAD/pgsql.build/src/test/regress/log/postmaster.log" > for the reason > Bail out!GNUmakefile:118: recipe for target 'check' failed > make: *** [check] Error 2 > log files for step check: > > postmaster.log shows the following error: > 2023-07-20 08:21:00.343 GMT [27063] LOG: unrecognized configuration > parameter "force_parallel_mode" in file > "/home/pgbfarm/buildroot/HEAD/pgsql.build/src/test/regress/tmp_check/data/postgresql.conf" > line 835 > 2023-07-20 08:21:00.344 GMT [27063] FATAL: configuration file > "/home/pgbfarm/buildroot/HEAD/pgsql.build/src/test/regress/tmp_check/data/postgresql.conf" > contains errors > > [1] - > https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=chipmunk=2023-07-20%2006%3A48%3A23 > > Thoughts? > > Regards, > Vignesh > > It seems `force_parallel_mode` has been removed in 5352ca22e Config should use `debug_parallel_query` instead. Zhang Mingli https://www.hashdata.xyz
Re: [PATCH] Reuse Workers and Replication Slots during Logical Replication
Some review comments for v21-0002. On Thu, Jul 20, 2023 at 11:41 PM Melih Mutlu wrote: > > Hi, > > Attached the updated patches with recent reviews addressed. > > See below for my comments: > > Peter Smith , 19 Tem 2023 Çar, 06:08 tarihinde şunu > yazdı: >> >> 5. >> + /* Found a table for next iteration */ >> + finish_sync_worker(true); >> + >> + StartTransactionCommand(); >> + ereport(LOG, >> + (errmsg("logical replication worker for subscription \"%s\" will be >> reused to sync table \"%s\" with relid %u.", >> + MySubscription->name, >> + get_rel_name(MyLogicalRepWorker->relid), >> + MyLogicalRepWorker->relid))); >> + CommitTransactionCommand(); >> + >> + done = false; >> + break; >> + } >> + LWLockRelease(LogicalRepWorkerLock); >> >> >> 5b. >> Isn't there a missing call to that LWLockRelease, if the 'break' happens? > > > Lock is already released before break, if that's the lock you meant: > >> /* Update worker state for the next table */ >> MyLogicalRepWorker->relid = rstate->relid; >> MyLogicalRepWorker->relstate = rstate->state; >> MyLogicalRepWorker->relstate_lsn = rstate->lsn; >> LWLockRelease(LogicalRepWorkerLock); >> >> >> /* Found a table for next iteration */ >> finish_sync_worker(true); >> done = false; >> break; > > Sorry, I misread the code. You are right. == src/backend/replication/logical/tablesync.c 1. + if (!reuse_worker) + { + ereport(LOG, + (errmsg("logical replication table synchronization worker for subscription \"%s\" has finished", + MySubscription->name))); + } + else + { + ereport(LOG, + (errmsg("logical replication worker for subscription \"%s\" will be reused to sync table \"%s\" with relid %u.", + MySubscription->name, + get_rel_name(MyLogicalRepWorker->relid), + MyLogicalRepWorker->relid))); + } 1a. We know this must be a tablesync worker, so I think that second errmsg should also be saying "logical replication table synchronization worker". ~ 1b. Since this is if/else anyway, is it simpler to be positive and say "if (reuse_worker)" instead of the negative "if (!reuse_worker)" ~~~ 2. run_tablesync_worker { + MyLogicalRepWorker->relsync_completed = false; + + /* Start table synchronization. */ start_table_sync(origin_startpos, ); This still contains the added comment that I'd previously posted I thought was adding anything useful. Also, I didn't think this comment exists in the HEAD code. == src/backend/replication/logical/worker.c 3. LogicalRepApplyLoop + /* + * apply_dispatch() may have gone into apply_handle_commit() + * which can call process_syncing_tables_for_sync. + * + * process_syncing_tables_for_sync decides whether the sync of + * the current table is completed. If it is completed, + * streaming must be already ended. So, we can break the loop. + */ + if (am_tablesync_worker() && + MyLogicalRepWorker->relsync_completed) + { + endofstream = true; + break; + } + Maybe just personal taste, but IMO it is better to rearrange like below because then there is no reason to read the long comment except for tablesync workers. if (am_tablesync_worker()) { /* * apply_dispatch() may have gone into apply_handle_commit() * which can call process_syncing_tables_for_sync. * * process_syncing_tables_for_sync decides whether the sync of * the current table is completed. If it is completed, * streaming must be already ended. So, we can break the loop. */ if (MyLogicalRepWorker->relsync_completed) { endofstream = true; break; } } ~~~ 4. LogicalRepApplyLoop + + /* + * If relsync_completed is true, this means that the tablesync + * worker is done with synchronization. Streaming has already been + * ended by process_syncing_tables_for_sync. We should move to the + * next table if needed, or exit. + */ + if (am_tablesync_worker() && + MyLogicalRepWorker->relsync_completed) + endofstream = true; Ditto the same comment about rearranging the condition, as #3 above. == src/include/replication/worker_internal.h 5. + /* + * Indicates whether tablesync worker has completed syncing its assigned + * table. + */ + bool relsync_completed; + Isn't it better to arrange this to be adjacent to other relXXX fields, so they all clearly belong to that "Used for initial table synchronization." group? For example, something like: /* Used for initial table synchronization. */ Oid relid; char relstate; XLogRecPtr relstate_lsn; slock_t relmutex; bool relsync_completed; /* has tablesync finished syncing the assigned table? */ -- Kind Regards, Peter Smith. Fujitsu Australia
Buildfarm failures on chipmunk
Hi, Recently chipmunk has failed with the following errors at [1]: make -C '../../..' DESTDIR='/home/pgbfarm/buildroot/HEAD/pgsql.build'/tmp_install install >'/home/pgbfarm/buildroot/HEAD/pgsql.build'/tmp_install/log/install.log 2>&1 make -j1 checkprep >>'/home/pgbfarm/buildroot/HEAD/pgsql.build'/tmp_install/log/install.log 2>&1 echo "# +++ regress check in src/test/regress +++" && PATH="/home/pgbfarm/buildroot/HEAD/pgsql.build/tmp_install/home/pgbfarm/buildroot/HEAD/inst/bin:/home/pgbfarm/buildroot/HEAD/pgsql.build/src/test/regress:$PATH" LD_LIBRARY_PATH="/home/pgbfarm/buildroot/HEAD/pgsql.build/tmp_install/home/pgbfarm/buildroot/HEAD/inst/lib" ../../../src/test/regress/pg_regress --temp-instance=./tmp_check --inputdir=. --bindir= --temp-config=/tmp/buildfarm-BewC4d/bfextra.conf --no-locale --port=5678 --dlpath=. --max-concurrent-tests=20 --port=5678 --schedule=./parallel_schedule # +++ regress check in src/test/regress +++ # postmaster failed, examine "/home/pgbfarm/buildroot/HEAD/pgsql.build/src/test/regress/log/postmaster.log" for the reason Bail out!GNUmakefile:118: recipe for target 'check' failed make: *** [check] Error 2 log files for step check: postmaster.log shows the following error: 2023-07-20 08:21:00.343 GMT [27063] LOG: unrecognized configuration parameter "force_parallel_mode" in file "/home/pgbfarm/buildroot/HEAD/pgsql.build/src/test/regress/tmp_check/data/postgresql.conf" line 835 2023-07-20 08:21:00.344 GMT [27063] FATAL: configuration file "/home/pgbfarm/buildroot/HEAD/pgsql.build/src/test/regress/tmp_check/data/postgresql.conf" contains errors [1] - https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=chipmunk=2023-07-20%2006%3A48%3A23 Thoughts? Regards, Vignesh
Re: Use of additional index columns in rows filtering
On Thu, Jul 20, 2023 at 4:35 AM Tomas Vondra wrote: > I think the SAOP patch may need to be much more careful about this, but > for this patch it's simpler because it doesn't really change any of the > index internals, or the indexscan in general. It's true that the SAOP patch needs relatively complicated infrastructure to assess whether or not the technique is safe with a given set of quals. You cannot safely get an ordered index scan for something like "select * from table where a < 5 and b in (1,2,3) order by a, b". With or without my patch. My patch isn't really changing all that much about the behavior in nbtree, as these things go. It's *surprising* how little has to change about the high level structure of index scans, in fact. (Actually, I'm glossing over a lot. The MDAM paper describes techniques that'd make even the really challenging cases safe, through a process of converting quals from conjunctive normal form into disjunctive normal form. This is more or less the form that the state machine implemented by _bt_advance_array_keys() produces already, today. But even with all this practically all of the heavy lifting takes place before the index scan even begins, during preprocessing -- so you still require surprisingly few changes to index scans themselves.) > If I simplify that a bit, we're just reordering the clauses in a way to > maybe eliminate the heap fetch. The main risk seems to be that this will > force an expensive qual to the front of the list, just because it can be > evaluated on the index tuple. My example query might have been poorly chosen, because it involved a limit. What I'm thinking of is more general than that. > But the difference would need to be worse > than what we save by not doing the I/O - considering how expensive the > I/O is, that seems unlikely. Could happen for expensive quals that don't > really eliminate many rows, I guess. That sounds like the same principle that I was talking about. I think that it can be pushed quite far, though. I am mostly talking about the worst case, and it seems like you might not be. You can easily construct examples where some kind of skew causes big problems with a multi-column index. I'm thinking of indexes whose leading columns are low cardinality, and queries where including the second column as an index qual looks kind of marginal to the optimizer. Each grouping represented in the most significant index column might easily have its own unique characteristics; the distribution of values in subsequent columns might be quite inconsistent across each grouping, in whatever way. Since nothing stops a qual on a lower order column having a wildly different selectivity for one particular grouping, it might not make sense to say that a problem in this area is due to a bad selectivity estimate. Even if we have perfect estimates, what good can they do if the optimal strategy is to vary our strategy at runtime, *within* an individual index scan, as different parts of the key space (different groupings) are traversed through? To skip or not to skip, say. This isn't about picking the cheapest plan, really. That's another huge advantage of index quals -- they can (at least in principle) allow you skip over big parts of the index when it just ends up making sense, in whatever way, for whatever reason. In the index, and in the heap. Often both. You'd likely always prefer to err in the direction of having more index quals rather than fewer, when doing so doesn't substantially change the plan itself. It could be very cheap insurance, even without any limit. (It would probably also be a lot faster, but it needn't be.) > Anyway, I see this as extension of what order_qual_clauses() does. The > main issue is that even order_qual_clauses() admits the estimates are > somewhat unreliable, so we can't expect to make perfect decisions. The attribute value independence assumption is wishful thinking, in no small part -- it's quite surprising that it works as well as it does, really. -- Peter Geoghegan
Re: Support worker_spi to execute the function dynamically.
On Thu, Jul 20, 2023 at 03:44:12PM +0530, Bharath Rupireddy wrote: > I don't think that change is correct. The worker_spi essentially shows > how to start bg workers with RegisterBackgroundWorker and dynamic bg > workers with RegisterDynamicBackgroundWorker. If > shared_preload_libraries = worker_spi not specified in there, you will > miss to start RegisterBackgroundWorkers. Is giving an initidb time > database name to worker_spi.database work there? If the database for > bg workers doesn't exist, changing bgw_restart_time from > BGW_NEVER_RESTART to say 1 will help to see bg workers coming up > eventually. Yeah, it does not move the needle by much. I think that we are looking at switching this module to use a TAP test in the long term, instead, where it would be possible to test the scenarios we want to look at *with* and *without* shared_preload_libraries especially with the custom wait events for extensions in mind if we add our tests in this module. It does not change the fact that Ikeda-san is right about the launch of dynamic workers with this module being broken, so I have applied v1 with the comment I have suggested. This will ease a bit the implementation of any follow-up test scenarios, while avoiding an incorrect pattern in this template module. -- Michael signature.asc Description: PGP signature
Re: Use COPY for populating all pgbench tables
On Thu, Jul 20, 2023 at 02:22:51PM -0500, Tristan Partin wrote: > Thanks for your testing Michael. I went ahead and added a test to make sure > that this behavior doesn't regress accidentally, but I am struggling to get > the test to fail using the previous version of this patch. Do you have any > advice? This is my first time writing a test for Postgres. I can recreate > the issue outside of the test script, but not within it for whatever reason. We're all here to learn, and writing TAP tests is important these days for a lot of patches. +# Check that the pgbench_branches and pgbench_tellers filler columns are filled +# with NULLs +foreach my $table ('pgbench_branches', 'pgbench_tellers') { + my ($ret, $out, $err) = $node->psql( + 'postgres', + "SELECT COUNT(1) FROM $table; +SELECT COUNT(1) FROM $table WHERE filler is NULL;", + extra_params => ['--no-align', '--tuples-only']); + + is($ret, 0, "psql $table counts status is 0"); + is($err, '', "psql $table counts stderr is empty"); + if ($out =~ m/^(\d+)\n(\d+)$/g) { + is($1, $2, "psql $table filler column filled with NULLs"); + } else { + fail("psql $table stdout m/^(\\d+)\n(\\d+)$/g"); + } +} This is IMO hard to parse, and I'd rather add some checks for the accounts and history tables as well. Let's use four simple SQL queries like what I posted upthread (no data for history inserted after initialization), as of the attached. I'd be tempted to apply that first as a separate patch, because we've never add coverage for that and we have specific expectations in the code from this filler column for tpc-b. And this needs to cover both client-side and server-side data generation. Note that the indentation was a bit incorrect, so fixed while on it. Attached is a v7, with these tests (should be a patch on its own but I'm lazy to split this morning) and some more adjustments that I have done while going through the patch. What do you think? -- Michael From 211d66fc39338d522a72722fc3674e306826fd37 Mon Sep 17 00:00:00 2001 From: Michael Paquier Date: Fri, 21 Jul 2023 11:07:31 +0900 Subject: [PATCH v7] Use COPY instead of INSERT for populating all tables COPY is a better interface for bulk loading and/or high latency connections. Previously COPY was only used for the pgbench_accounts table because the amount of data was so much larger than the other tables. However, as already said there are also gains to be had in the high latency case, such as loading data across continents. --- src/bin/pgbench/pgbench.c| 155 +++ src/bin/pgbench/t/001_pgbench_with_server.pl | 35 + doc/src/sgml/ref/pgbench.sgml| 9 +- 3 files changed, 133 insertions(+), 66 deletions(-) diff --git a/src/bin/pgbench/pgbench.c b/src/bin/pgbench/pgbench.c index 087baa7d57..539c2795e2 100644 --- a/src/bin/pgbench/pgbench.c +++ b/src/bin/pgbench/pgbench.c @@ -835,6 +835,8 @@ static void add_socket_to_set(socket_set *sa, int fd, int idx); static int wait_on_socket_set(socket_set *sa, int64 usecs); static bool socket_has_input(socket_set *sa, int fd, int idx); +/* callback used to build rows for COPY during data loading */ +typedef void (*initRowMethod) (PQExpBufferData *sql, int64 curr); /* callback functions for our flex lexer */ static const PsqlScanCallbacks pgbench_callbacks = { @@ -4859,17 +4861,45 @@ initTruncateTables(PGconn *con) "pgbench_tellers"); } -/* - * Fill the standard tables with some data generated and sent from the client - */ static void -initGenerateDataClientSide(PGconn *con) +initBranch(PQExpBufferData *sql, int64 curr) { - PQExpBufferData sql; + /* "filler" column uses NULL */ + printfPQExpBuffer(sql, + INT64_FORMAT "\t0\t\\N\n", + curr + 1); +} + +static void +initTeller(PQExpBufferData *sql, int64 curr) +{ + /* "filler" column uses NULL */ + printfPQExpBuffer(sql, + INT64_FORMAT "\t" INT64_FORMAT "\t0\t\\N\n", + curr + 1, curr / ntellers + 1); +} + +static void +initAccount(PQExpBufferData *sql, int64 curr) +{ + /* "filler" column defaults to blank padded empty string */ + printfPQExpBuffer(sql, + INT64_FORMAT "\t" INT64_FORMAT "\t0\t\n", + curr + 1, curr / naccounts + 1); +} + +static void +initPopulateTable(PGconn *con, const char *table, int64 base, + initRowMethod init_row) +{ + int n; + int k; + int chars = 0; PGresult *res; - int i; - int64 k; - char *copy_statement; + PQExpBufferData sql; + char copy_statement[256]; + const char *copy_statement_fmt = "copy %s from stdin"; + int64 total = base * scale; /* used to track elapsed time and estimate of the remaining time */ pg_time_usec_t start; @@ -4878,50 +4908,24 @@ initGenerateDataClientSide(PGconn *con) /* Stay on the same line if reporting to a terminal */ char eol = isatty(fileno(stderr)) ? '\r' : '\n'; - fprintf(stderr, "generating data (client-side)...\n"); - - /* - * we do all of this in one transaction to
Re: [PATCH] Reuse Workers and Replication Slots during Logical Replication
Some review comments for v21-0001 == src/backend/replication/logical/worker.c 1. InitializeLogRepWorker if (am_tablesync_worker()) ereport(LOG, - (errmsg("logical replication table synchronization worker for subscription \"%s\", table \"%s\" has started", + (errmsg("logical replication worker for subscription \"%s\", table \"%s\" has started", MySubscription->name, get_rel_name(MyLogicalRepWorker->relid; I think this should not be changed. IIUC that decision for using the generic worker name for translations was only when the errmsg was in shared code where the worker type was not clear from existing conditions. See also previous review comments [1]. ~~~ 2. StartLogRepWorker /* Common function to start the leader apply or tablesync worker. */ void StartLogRepWorker(int worker_slot) { /* Attach to slot */ logicalrep_worker_attach(worker_slot); /* Setup signal handling */ pqsignal(SIGHUP, SignalHandlerForConfigReload); pqsignal(SIGTERM, die); BackgroundWorkerUnblockSignals(); /* * We don't currently need any ResourceOwner in a walreceiver process, but * if we did, we could call CreateAuxProcessResourceOwner here. */ /* Initialise stats to a sanish value */ MyLogicalRepWorker->last_send_time = MyLogicalRepWorker->last_recv_time = MyLogicalRepWorker->reply_time = GetCurrentTimestamp(); /* Load the libpq-specific functions */ load_file("libpqwalreceiver", false); InitializeLogRepWorker(); /* Connect to the origin and start the replication. */ elog(DEBUG1, "connecting to publisher using connection string \"%s\"", MySubscription->conninfo); /* * Setup callback for syscache so that we know when something changes in * the subscription relation state. */ CacheRegisterSyscacheCallback(SUBSCRIPTIONRELMAP, invalidate_syncing_table_states, (Datum) 0); } ~ 2a. The function name seems a bit misleading because it is not really "starting" anything here - it is just more "initialization" code, right? Nor is it common to all kinds of LogRepWorker. Maybe the function could be named something else like 'InitApplyOrSyncWorker()'. -- see also #2c ~ 2b. Should this have Assert to ensure this is only called from leader apply or tablesync? -- see also #2c ~ 2c. IMO maybe the best/tidiest way to do this is not to introduce a new function at all. Instead, just put all this "common init" code into the existing "common init" function ('InitializeLogRepWorker') and execute it only if (am_tablesync_worker() || am_leader_apply_worker()) { }. == src/include/replication/worker_internal.h 3. extern void pa_xact_finish(ParallelApplyWorkerInfo *winfo, XLogRecPtr remote_lsn); +extern void set_stream_options(WalRcvStreamOptions *options, +char *slotname, +XLogRecPtr *origin_startpos); + +extern void start_apply(XLogRecPtr origin_startpos); +extern void DisableSubscriptionAndExit(void); +extern void StartLogRepWorker(int worker_slot); This placement (esp. with the missing whitespace) seems to be grouping the set_stream_options with the other 'pa' externs, which are all under the comment "/* Parallel apply worker setup and interactions */". Putting all these up near the other "extern void InitializeLogRepWorker(void)" might be less ambiguous. -- [1] worker name in errmsg - https://www.postgresql.org/message-id/CAA4eK1%2B%2BwkxxMjsPh-z2aKa9ZjNhKsjv0Tnw%2BTVX-hCBkDHusw%40mail.gmail.com Kind Regards, Peter Smith. Fujitsu Australia
Re: [PATCH] Reuse Workers and Replication Slots during Logical Replication
On Thu, Jul 20, 2023 at 11:41 PM Melih Mutlu wrote: > > Hi, > > Attached the updated patches with recent reviews addressed. > > See below for my comments: > > Peter Smith , 19 Tem 2023 Çar, 06:08 tarihinde şunu > yazdı: >> >> Some review comments for v19-0001 >> >> 2. LogicalRepSyncTableStart >> >> /* >> * Finally, wait until the leader apply worker tells us to catch up and >> * then return to let LogicalRepApplyLoop do it. >> */ >> wait_for_worker_state_change(SUBREL_STATE_CATCHUP); >> >> ~ >> >> Should LogicalRepApplyLoop still be mentioned here, since that is >> static in worker.c? Maybe it is better to refer instead to the common >> 'start_apply' wrapper? (see also #5a below) > > > Isn't' LogicalRepApplyLoop static on HEAD and also mentioned in the exact > comment in tablesync.c while the common "start_apply" function also exists? > I'm not sure how such a change would be related to this patch. > Fair enough. I thought it was questionable for one module to refer to another module's static functions, but you are correct - it is not really related to your patch. Sorry for the noise. -- Kind Regards, Peter Smith. Fujitsu Australia
Re: doc: improve the restriction description of using indexes on REPLICA IDENTITY FULL table.
On Wed, Jul 19, 2023 at 5:09 PM Önder Kalacı wrote: > > Hi Masahiko, Amit, all > >> I've updated the patch. > > > I think the flow is much nicer now compared to the HEAD. I really don't have > any > comments regarding the accuracy of the code changes, all looks good to me. > Overall, I cannot see any behavioral changes as you already alluded to. Thank you for reviewing the patch. > > Maybe few minor notes regarding the comments: > >> /* >> + * And must reference the remote relation column. This is because if it >> + * doesn't, the sequential scan is favorable over index scan in most >> + * cases.. >> + */ > > > I think the reader might have lost the context (or say in the future due to > another refactor etc). So maybe start with: > >> /* And the leftmost index field must refer to the ... Fixed. > > > Also, now in IsIndexUsableForReplicaIdentityFull() some of the conditions > have comments > some don't. Should we comment on the missing ones as well, maybe such as: > >> /* partial indexes are not support * >> if (indexInfo->ii_Predicate != NIL) > > and, >> >> /* all indexes must have at least one attribute */ >> Assert(indexInfo->ii_NumIndexAttrs >= 1); Agreed. But I don't think the latter comment is necessary as it's obvious. > > > >> >>> >>> >>> BTW, IsIndexOnlyExpression() is not necessary but the current code >>> still works fine. So do we need to backpatch it to PG16? I'm thinking >>> we can apply it to only HEAD. >> >> Either way is fine but I think if we backpatch it then the code >> remains consistent and the backpatching would be easier. > > > Yeah, I also have a slight preference for backporting. It could make it > easier to maintain the code > in the future in case of another backport(s). With the cost of making it > slightly harder for you now :) Agreed. I've attached the updated patch. I'll push it early next week, barring any objections. Regards, -- Masahiko Sawada Amazon Web Services: https://aws.amazon.com v4-0001-Remove-unnecessary-checks-for-indexes-for-REPLICA.patch Description: Binary data
Re: logicalrep_message_type throws an error
On Fri, Jul 21, 2023 at 1:39 AM Ashutosh Bapat wrote: > > On Thu, Jul 20, 2023 at 9:11 AM Amit Kapila wrote: > > > > > > Okay, changed it accordingly. > > > > > > > oops, forgot to attach the patch. > > > > WFM > > @@ -3367,7 +3367,7 @@ apply_dispatch(StringInfo s) > default: > ereport(ERROR, > (errcode(ERRCODE_PROTOCOL_VIOLATION), > - errmsg("invalid logical replication message type > \"%c\"", action))); > + errmsg("invalid logical replication message type > \"??? (%d)\"", action))); > > I think we should report character here since that's what is visible > in the code and also the message types are communicated as characters > not integers. Makes debugging easier. Context may report integer as an > additional data point. I think it could confuse us if an invalid message is not a printable character. Regards, -- Masahiko Sawada Amazon Web Services: https://aws.amazon.com
Re: Row pattern recognition
On 7/21/23 01:36, Jacob Champion wrote: There's also the issue of backtracking in the face of reclassification, as I think Vik was alluding to upthread. We definitely need some kind of backtracking or other reclassification method. (I've attached two failing tests against v2, to hopefully better illustrate, along with what I_think_ should be the correct results.) Almost. You are matching 07-01-2023 but the condition is "price > 100". -- Vik Fearing
Re: Row pattern recognition
Hi Ishii-san, On 7/19/23 22:15, Tatsuo Ishii wrote: > Currently my patch has a limitation for the sake of simple > implementation: a pattern like "A+" is parsed and analyzed in the raw > parser. This makes subsequent process much easier because the pattern > element variable (in this case "A") and the quantifier (in this case > "+") is already identified by the raw parser. However there are much > more cases are allowed in the standard as you already pointed out. For > those cases probably we should give up to parse PATTERN items in the > raw parser, and instead the raw parser just accepts the elements as > Sconst? Is there a concern that the PATTERN grammar can't be represented in Bison? I thought it was all context-free... Or is the concern that the parse tree of the pattern is hard to feed into a regex engine? > Any comments, especially on the PREV/NEXT implementation part is > welcome. Currently the DEFINE expression like "price > PREV(price)" is > prepared in ExecInitWindowAgg using ExecInitExpr,tweaking var->varno > in Var node so that PREV uses OUTER_VAR, NEXT uses INNER_VAR. Then > evaluate the expression in ExecWindowAgg using ExecEvalExpr, setting > previous row TupleSlot to ExprContext->ecxt_outertuple, and next row > TupleSlot to ExprContext->ecxt_innertuple. I think this is temporary > hack and should be gotten ride of before v1 is committed. Better idea? I'm not familiar enough with this code yet to offer very concrete suggestions, sorry... But at some point in the future, we need to be able to skip forward and backward from arbitrary points, like DEFINE B AS B.price > PREV(FIRST(A.price), 3) so there won't be just one pair of "previous and next" tuples. Maybe that can help clarify the design? It feels like it'll need to eventually be a "real" function that operates on the window state, even if it doesn't support all of the possible complexities in v1. -- Taking a closer look at the regex engine: It looks like the + qualifier has trouble when it meets the end of the frame. For instance, try removing the last row of the 'stock' table in rpr.sql; some of the final matches will disappear unexpectedly. Or try a pattern like PATTERN ( a+ ) DEFINE a AS TRUE which doesn't seem to match anything in my testing. There's also the issue of backtracking in the face of reclassification, as I think Vik was alluding to upthread. The pattern PATTERN ( a+ b+ ) DEFINE a AS col = 2, b AS col = 2 doesn't match a sequence of values (2 2 ...) with the current implementation, even with a dummy row at the end to avoid the end-of-frame bug. (I've attached two failing tests against v2, to hopefully better illustrate, along with what I _think_ should be the correct results.) I'm not quite understanding the match loop in evaluate_pattern(). It looks like we're building up a string to pass to the regex engine, but by the we call regexp_instr, don't we already know whether or not the pattern will match based on the expression evaluation we've done? Thanks, --Jacobdiff --git a/src/test/regress/expected/rpr.out b/src/test/regress/expected/rpr.out index 6bf8818911..68e9a98684 100644 --- a/src/test/regress/expected/rpr.out +++ b/src/test/regress/expected/rpr.out @@ -230,6 +230,79 @@ SELECT company, tdate, price, rpr(price) OVER w FROM stock company2 | 07-10-2023 | 1300 | (20 rows) +-- match everything +SELECT company, tdate, price, rpr(price) OVER w FROM stock + WINDOW w AS ( + PARTITION BY company + ORDER BY tdate + ROWS BETWEEN CURRENT ROW AND UNBOUNDED FOLLOWING + AFTER MATCH SKIP TO NEXT ROW + INITIAL + PATTERN (A+) + DEFINE + A AS TRUE +); + company | tdate| price | rpr +--++---+- + company1 | 07-01-2023 | 100 | 100 + company1 | 07-02-2023 | 200 | + company1 | 07-03-2023 | 150 | + company1 | 07-04-2023 | 140 | + company1 | 07-05-2023 | 150 | + company1 | 07-06-2023 |90 | + company1 | 07-07-2023 | 110 | + company1 | 07-08-2023 | 130 | + company1 | 07-09-2023 | 120 | + company1 | 07-10-2023 | 130 | + company2 | 07-01-2023 |50 | 50 + company2 | 07-02-2023 | 2000 | + company2 | 07-03-2023 | 1500 | + company2 | 07-04-2023 | 1400 | + company2 | 07-05-2023 | 1500 | + company2 | 07-06-2023 |60 | + company2 | 07-07-2023 | 1100 | + company2 | 07-08-2023 | 1300 | + company2 | 07-09-2023 | 1200 | + company2 | 07-10-2023 | 1300 | +(20 rows) + +-- backtracking with reclassification of rows +SELECT company, tdate, price, rpr(price) OVER w FROM stock + WINDOW w AS ( + PARTITION BY company + ORDER BY tdate + ROWS BETWEEN CURRENT ROW AND UNBOUNDED FOLLOWING + AFTER MATCH SKIP TO NEXT ROW + INITIAL + PATTERN (A+ B+) + DEFINE + A AS price > 100, + B AS price > 100 +); + company | tdate| price | rpr +--++---+-- + company1 | 07-01-2023 | 100 | 100 + company1 | 07-02-2023 | 200 | +
Re: Atomic ops for unlogged LSN
Based on your feedback, I’ve updated the patch with additional comments. 1. Explain the two cases when writing to the control file, and 2. a bit more emphasis on unloggedLSNs not being valid after a crash. Thanks to y’all. * John v2-0001-Use-atomic-ops-for-unloggedLSNs.patch Description: v2-0001-Use-atomic-ops-for-unloggedLSNs.patch
Re: Performance degradation on concurrent COPY into a single relation in PG16.
On Thu, 20 Jul 2023 at 20:37, Dean Rasheed wrote: > > On Thu, 20 Jul 2023 at 00:56, David Rowley wrote: > I agree with the principal though. In the attached updated patch, I > replaced that test with a simpler one: > > +/* > + * Process the number's digits. We optimize for decimal input (expected > to > + * be the most common case) first. Anything that doesn't start with a > base > + * prefix indicator must be decimal. > + */ > + > + /* process decimal digits */ > + if (likely(ptr[0] != '0' || !isalpha((unsigned char) ptr[1]))) > > So hopefully any compiler should only need to do the one comparison > against '0' for any non-zero decimal input. > > Doing that didn't give any measurable performance improvement for me, > but it did at least not make it noticeably worse, and seems more > likely to generate better code with not-so-smart compilers. I'd be > interested to know how that performs for you (and if your compiler > really does generate 3 "first digit is '0'" tests for the unpatched > code). That seems better. I benchmarked it on two machines: gcc12.2/linux/amd3990x create table a (a int); insert into a select x from generate_series(1,1000)x; copy a to '/tmp/a.dump'; master @ ab29a7a9c postgres=# truncate a; copy a from '/tmp/a.dump'; Time: 2226.912 ms (00:02.227) Time: 2214.168 ms (00:02.214) Time: 2206.481 ms (00:02.206) Time: 2219.640 ms (00:02.220) Time: 2205.004 ms (00:02.205) master + pg_strtoint32_base_10_first.v2.patch postgres=# truncate a; copy a from '/tmp/a.dump'; Time: 2067.108 ms (00:02.067) Time: 2070.401 ms (00:02.070) Time: 2073.423 ms (00:02.073) Time: 2065.407 ms (00:02.065) Time: 2066.536 ms (00:02.067) (~7% faster) apple m2 pro/clang master @ 9089287a postgres=# truncate a; copy a from '/tmp/a.dump'; Time: 1286.369 ms (00:01.286) Time: 1300.534 ms (00:01.301) Time: 1295.661 ms (00:01.296) Time: 1296.404 ms (00:01.296) Time: 1268.361 ms (00:01.268) Time: 1259.321 ms (00:01.259) master + pg_strtoint32_base_10_first.v2.patch postgres=# truncate a; copy a from '/tmp/a.dump'; Time: 1253.519 ms (00:01.254) Time: 1235.256 ms (00:01.235) Time: 1269.501 ms (00:01.270) Time: 1267.801 ms (00:01.268) Time: 1275.758 ms (00:01.276) Time: 1261.478 ms (00:01.261) (a bit noisy but avg of ~1.8% faster) David
Re: Bytea PL/Perl transform
>Friday, 14 July 2023, 23:27 +03:00 от Tom Lane : > >=?UTF-8?B?SXZhbiBQYW5jaGVua28=?= < w...@mail.ru > writes: >> Четверг, 6 июля 2023, 14:48 +03:00 от Peter Eisentraut < >> pe...@eisentraut.org >: >>> If the transform deals with a built-in type, then they should just be >>> added to the respective pl extension directly. > >> The new extension bytea_plperl can be easily moved into plperl now, but what >> should be do with the existing ones, namely jsonb_plperl and bool_plperl ? >> If we leave them where they are, it would be hard to explain why some >> transforms are inside plperl while other ones live separately. If we move >> them into plperl also, wouldn’t it break some compatibility? > >It's kind of a mess, indeed. But I think we could make plperl 1.1 >contain the additional transforms and just tell people they have >to drop the obsolete extensions before they upgrade to 1.1. >Fortunately, it doesn't look like functions using a transform >have any hard dependency on the transform, so "drop extension >jsonb_plperl" followed by "alter extension plperl update" should >work without cascading to all your plperl functions. > >Having said that, we'd still be in the position of having to >explain why some transforms are packaged with plperl and others >not. The distinction between built-in and contrib types might >be obvious to us hackers, but I bet a lot of users see it as >pretty artificial. So maybe the existing packaging design is >fine and we should just look for a way to reduce the code >duplication. The code duplication between different transforms is not in C code, but mostly in copy-pasted Makefile, *.control, *.sql and meson-build. These files could be generated from some universal templates. But, keeping in mind Windows compatibility and presence of three build system, this hardly looks like a simplification. Probably at present time it would be better to keep the existing code duplication until plperl 1.1. Nevertheless, dealing with code duplication is a wider task than the bytea transform, so let me suggest to keep this extension in the present form. If we decide what to do with the duplication, it would be another patch. The mesonified and rebased version of the transform patch is attached. > >regards, tom lane > Regards, Ivandiff --git a/contrib/Makefile b/contrib/Makefile index bbf220407b..bb997dda69 100644 --- a/contrib/Makefile +++ b/contrib/Makefile @@ -78,9 +78,9 @@ ALWAYS_SUBDIRS += sepgsql endif ifeq ($(with_perl),yes) -SUBDIRS += bool_plperl hstore_plperl jsonb_plperl +SUBDIRS += bool_plperl bytea_plperl hstore_plperl jsonb_plperl else -ALWAYS_SUBDIRS += bool_plperl hstore_plperl jsonb_plperl +ALWAYS_SUBDIRS += bool_plperl bytea_plperl hstore_plperl jsonb_plperl endif ifeq ($(with_python),yes) diff --git a/contrib/bytea_plperl/bytea_plperl--1.0.sql b/contrib/bytea_plperl/bytea_plperl--1.0.sql new file mode 100644 index 00..6544b2ac85 --- /dev/null +++ b/contrib/bytea_plperl/bytea_plperl--1.0.sql @@ -0,0 +1,19 @@ +/* contrib/bytea_plperl/bytea_plperl--1.0.sql */ + +-- complain if script is sourced in psql, rather than via CREATE EXTENSION +\echo Use "CREATE EXTENSION bytea_plperl" to load this file. \quit + +CREATE FUNCTION bytea_to_plperl(val internal) RETURNS internal +LANGUAGE C STRICT IMMUTABLE +AS 'MODULE_PATHNAME'; + +CREATE FUNCTION plperl_to_bytea(val internal) RETURNS bytea +LANGUAGE C STRICT IMMUTABLE +AS 'MODULE_PATHNAME'; + +CREATE TRANSFORM FOR bytea LANGUAGE plperl ( +FROM SQL WITH FUNCTION bytea_to_plperl(internal), +TO SQL WITH FUNCTION plperl_to_bytea(internal) +); + +COMMENT ON TRANSFORM FOR bytea LANGUAGE plperl IS 'transform between bytea and Perl'; diff --git a/contrib/bytea_plperl/bytea_plperl.c b/contrib/bytea_plperl/bytea_plperl.c new file mode 100644 index 00..7ff16040c9 --- /dev/null +++ b/contrib/bytea_plperl/bytea_plperl.c @@ -0,0 +1,36 @@ +/* + * contrib/bytea_plperl/bytea_plperl.c + */ + +#include "postgres.h" + +#include "fmgr.h" +#include "plperl.h" +#include "varatt.h" + +PG_MODULE_MAGIC; + +PG_FUNCTION_INFO_V1(bytea_to_plperl); +PG_FUNCTION_INFO_V1(plperl_to_bytea); + +Datum +bytea_to_plperl(PG_FUNCTION_ARGS) +{ + dTHX; + bytea *in = PG_GETARG_BYTEA_PP(0); + return PointerGetDatum(newSVpvn_flags( (char *) VARDATA_ANY(in), VARSIZE_ANY_EXHDR(in), 0 )); +} + +Datum +plperl_to_bytea(PG_FUNCTION_ARGS) +{ + dTHX; + bytea *result; + STRLEN len; + SV *in = (SV *) PG_GETARG_POINTER(0); + char *ptr = SvPVbyte(in, len); + result = palloc(VARHDRSZ + len ); + SET_VARSIZE(result, VARHDRSZ + len ); + memcpy(VARDATA_ANY(result), ptr,len ); + PG_RETURN_BYTEA_P(result); +} diff --git a/contrib/bytea_plperl/bytea_plperl.control b/contrib/bytea_plperl/bytea_plperl.control new file mode 100644 index 00..9ff0f2a8dd --- /dev/null +++ b/contrib/bytea_plperl/bytea_plperl.control @@ -0,0 +1,7 @@ +# bytea_plperl extension +comment = 'transform between bytea and plperl' +default_version = '1.0' +module_pathname =
Re: In Postgres 16 BETA, should the ParseNamespaceItem have the same index as it's RangeTableEntry?
Hello, Thank you for the help guys and I'm so sorry for my late response. Indeed, the error relies on the ResultRelInfo. In GetResultRTEPermissionInfo() function, it does a checking on the relinfo->ri_RootResultRelInfo variable. I believe that it should go inside this scope: if (relinfo->ri_RootResultRelInfo) { /* * For inheritance child result relations (a partition routing target * of an INSERT or a child UPDATE target), this returns the root * parent's RTE to fetch the RTEPermissionInfo because that's the only * one that has one assigned. */ rti = relinfo->ri_RootResultRelInfo->ri_RangeTableIndex; } The relinfo contains: {type = T_ResultRelInfo, ri_RangeTableIndex = 5, ri_RelationDesc = 0x7f44e3308cc8, ri_NumIndices = 0, ri_IndexRelationDescs = 0x0, ri_IndexRelationInfo = 0x0, ri_RowIdAttNo = 0, ri_extraUpdatedCols = 0x0, ri_projectNew = 0x0, ri_newTupleSlot = 0x0, ri_oldTupleSlot = 0x0, ri_projectNewInfoValid = false, ri_TrigDesc = 0x0, ri_TrigFunctions = 0x0, ri_TrigWhenExprs = 0x0, ri_TrigInstrument = 0x0, ri_ReturningSlot = 0x0, ri_TrigOldSlot = 0x0, ri_TrigNewSlot = 0x0, ri_FdwRoutine = 0x0, ri_FdwState = 0x0, ri_usesFdwDirectModify = false, ri_NumSlots = 0, ri_NumSlotsInitialized = 0, ri_BatchSize = 0, ri_Slots = 0x0, ri_PlanSlots = 0x0, ri_WithCheckOptions = 0x0, ri_WithCheckOptionExprs = 0x0, ri_ConstraintExprs = 0x0, ri_GeneratedExprsI = 0x0, ri_GeneratedExprsU = 0x0, ri_NumGeneratedNeededI = 0, ri_NumGeneratedNeededU = 0, ri_returningList = 0x0, ri_projectReturning = 0x0, ri_onConflictArbiterIndexes = 0x0, ri_onConflict = 0x0, ri_matchedMergeAction = 0x0, ri_notMatchedMergeAction = 0x0, ri_PartitionCheckExpr = 0x0, ri_ChildToRootMap = 0x0, ri_ChildToRootMapValid = false, ri_RootToChildMap = 0x0, ri_RootToChildMapValid = false, ri_RootResultRelInfo = 0x0, ri_PartitionTupleSlot = 0x0, ri_CopyMultiInsertBuffer = 0x0, ri_ancestorResultRels = 0x0} Since relinfo->ri_RootResultRelInfo = 0x0, the rti will have no value and Postgres will interpret that the ResultRelInfo must've been created only for filtering triggers and the relation is not being inserted into. The relinfo variable is created with the create_entity_result_rel_info() function: ResultRelInfo *create_entity_result_rel_info(EState *estate, char *graph_name, char *label_name) { RangeVar *rv; Relation label_relation; ResultRelInfo *resultRelInfo; ParseState *pstate = make_parsestate(NULL); resultRelInfo = palloc(sizeof(ResultRelInfo)); if (strlen(label_name) == 0) { rv = makeRangeVar(graph_name, AG_DEFAULT_LABEL_VERTEX, -1); } else { rv = makeRangeVar(graph_name, label_name, -1); } label_relation = parserOpenTable(pstate, rv, RowExclusiveLock); // initialize the resultRelInfo InitResultRelInfo(resultRelInfo, label_relation, list_length(estate->es_range_table), NULL, estate->es_instrument); // open the parse state ExecOpenIndices(resultRelInfo, false); free_parsestate(pstate); return resultRelInfo; } In this case, how can we get the relinfo->ri_RootResultRelInfo to store the appropriate data? Thank you, Matheus Farias Em ter., 18 de jul. de 2023 às 06:58, Amit Langote escreveu: > Hello, > > On Sat, Jul 15, 2023 at 4:43 AM Farias de Oliveira > wrote: > > I believe I have found something interesting that might be the root of > the problem with RTEPermissionInfo. But I do not know how to fix it > exactly. In AGE's code, the execution of it goes through a function called > analyze_cypher_clause() which does the following: > > > > It ends up going inside other functions and changing it more a bit, but > at the end of one of these functions it assigns values to some members of > the query: > > > > query->targetList = lappend(query->targetList, tle); > > query->rtable = pstate->p_rtable; > > query->jointree = makeFromExpr(pstate->p_joinlist, NULL); > > > > I assume that here is missing the assignment of query->rteperminfos to > be the same as pstate->p_rteperminfos, but the pstate has the following > values: > > > > {pstate = {parentParseState = 0x0, p_sourcetext = 0x2b06ef0 "MATCH (n) > SET n.i = 3", p_rtable = 0x2c6e590, > > p_rteperminfos = 0x0, p_joinexprs = 0x0, p_nullingrels = 0x0, > p_joinlist = 0x2c6e678, p_namespace = 0x2c6e6c8, > > p_lateral_active = false, p_ctenamespace = 0x0, p_future_ctes = 0x0, > p_parent_cte = 0x0, p_target_relation = 0x0, > > p_target_nsitem = 0x0, p_is_insert = false, p_windowdefs = 0x0, > p_expr_kind = EXPR_KIND_NONE, p_next_resno = 3, > > p_multiassign_exprs = 0x0, p_locking_clause = 0x0, > p_locked_from_parent = false, p_resolve_unknowns = true, > > p_queryEnv = 0x0, p_hasAggs = false, p_hasWindowFuncs = false, > p_hasTargetSRFs = false,
Re: Use COPY for populating all pgbench tables
On Wed Jul 19, 2023 at 10:07 PM CDT, Michael Paquier wrote: So this patch causes pgbench to not stick with its historical behavior, and the change is incompatible with the comments because the tellers and branches tables don't use NULL for their filler attribute anymore. Great find. This was a problem of me just not understanding the COPY command properly. Relevant documentation snippet: Specifies the string that represents a null value. The default is \N (backslash-N) in text format, and an unquoted empty string in CSV format. You might prefer an empty string even in text format for cases where you don't want to distinguish nulls from empty strings. This option is not allowed when using binary format. This new revision populates the column with the NULL value. psql (17devel) Type "help" for help. tristan957=# select count(1) from pgbench_branches; count --- 1 (1 row) tristan957=# select count(1) from pgbench_branches where filler is null; count --- 1 (1 row) Thanks for your testing Michael. I went ahead and added a test to make sure that this behavior doesn't regress accidentally, but I am struggling to get the test to fail using the previous version of this patch. Do you have any advice? This is my first time writing a test for Postgres. I can recreate the issue outside of the test script, but not within it for whatever reason. -- Tristan Partin Neon (https://neon.tech) From cf84b3ea2d7b583aaee3f80807b26ef4521db0f6 Mon Sep 17 00:00:00 2001 From: Tristan Partin Date: Tue, 23 May 2023 11:48:16 -0500 Subject: [PATCH v6] Use COPY instead of INSERT for populating all tables COPY is a better interface for bulk loading and/or high latency connections. Previously COPY was only used for the pgbench_accounts table because the amount of data was so much larger than the other tables. However, as already said there are also gains to be had in the high latency case, such as loading data across continents. --- doc/src/sgml/ref/pgbench.sgml| 8 +- src/bin/pgbench/pgbench.c| 151 +++ src/bin/pgbench/t/001_pgbench_with_server.pl | 18 +++ 3 files changed, 110 insertions(+), 67 deletions(-) diff --git a/doc/src/sgml/ref/pgbench.sgml b/doc/src/sgml/ref/pgbench.sgml index 850028557d..4424595cc6 100644 --- a/doc/src/sgml/ref/pgbench.sgml +++ b/doc/src/sgml/ref/pgbench.sgml @@ -231,10 +231,10 @@ pgbench options d extensively through a COPY. pgbench uses the FREEZE option with version 14 or later of PostgreSQL to speed up -subsequent VACUUM, unless partitions are enabled. -Using g causes logging to print one message -every 100,000 rows while generating data for the -pgbench_accounts table. +subsequent VACUUM. If partitions are enabled for +the pgbench_accounts table, the FREEZE option is +not enabled. Using g causes logging to print one +message every 100,000 rows while generating data for all tables. With G (server-side data generation), diff --git a/src/bin/pgbench/pgbench.c b/src/bin/pgbench/pgbench.c index 087baa7d57..ef01d049a1 100644 --- a/src/bin/pgbench/pgbench.c +++ b/src/bin/pgbench/pgbench.c @@ -4859,17 +4859,44 @@ initTruncateTables(PGconn *con) "pgbench_tellers"); } -/* - * Fill the standard tables with some data generated and sent from the client - */ +typedef void initRow(PQExpBufferData *sql, int64 curr); + static void -initGenerateDataClientSide(PGconn *con) +initBranch(PQExpBufferData *sql, int64 curr) +{ + printfPQExpBuffer(sql, + INT64_FORMAT "\t0\t\\N\n", + curr + 1); +} + +static void +initTeller(PQExpBufferData *sql, int64 curr) +{ + printfPQExpBuffer(sql, + INT64_FORMAT "\t" INT64_FORMAT "\t0\t\\N\n", + curr + 1, curr / ntellers + 1); +} + +static void +initAccount(PQExpBufferData *sql, int64 curr) { + /* "filler" column defaults to blank padded empty string */ + printfPQExpBuffer(sql, + INT64_FORMAT "\t" INT64_FORMAT "\t0\t\n", + curr + 1, curr / naccounts + 1); +} + +static void +initPopulateTable(PGconn *con, const char *table, int64 base, initRow *init_row) +{ + int n; + int k; + int chars = 0; + PGresult *res; PQExpBufferData sql; - PGresult *res; - int i; - int64 k; - char *copy_statement; + char copy_statement[256]; + const char *copy_statement_fmt = "copy %s from stdin"; + const int64 total = base * scale; /* used to track elapsed time and estimate of the remaining time */ pg_time_usec_t start; @@ -4878,50 +4905,24 @@ initGenerateDataClientSide(PGconn *con) /* Stay on the same line if reporting to a terminal */ char eol = isatty(fileno(stderr)) ? '\r' : '\n'; - fprintf(stderr, "generating data (client-side)...\n"); - - /* - * we do all of this in one transaction to enable the backend's - * data-loading optimizations - */ -
Re: Inefficiency in parallel pg_restore with many tables
Here is a work-in-progress patch set for converting ready_list to a priority queue. On my machine, Tom's 100k-table example [0] takes 11.5 minutes without these patches and 1.5 minutes with them. One item that requires more thought is binaryheap's use of Datum. AFAICT the Datum definitions live in postgres.h and aren't available to frontend code. I think we'll either need to move the Datum definitions to c.h or to adjust binaryheap to use "void *". [0] https://postgr.es/m/3612876.1689443232%40sss.pgh.pa.us -- Nathan Bossart Amazon Web Services: https://aws.amazon.com >From bbf7b32f0abd0e9e2f7a0759cc79fcccbb5f9281 Mon Sep 17 00:00:00 2001 From: Nathan Bossart Date: Wed, 19 Jul 2023 15:54:00 -0700 Subject: [PATCH v2 1/4] misc binaryheap fixes --- src/backend/postmaster/pgarch.c | 12 ++-- src/backend/replication/logical/reorderbuffer.c | 2 +- 2 files changed, 7 insertions(+), 7 deletions(-) diff --git a/src/backend/postmaster/pgarch.c b/src/backend/postmaster/pgarch.c index 46af349564..c28e6f2070 100644 --- a/src/backend/postmaster/pgarch.c +++ b/src/backend/postmaster/pgarch.c @@ -108,7 +108,7 @@ static ArchiveModuleState *archive_module_state; * archive_status. Minimizing the number of directory scans when there are * many files to archive can significantly improve archival rate. * - * arch_heap is a max-heap that is used during the directory scan to track + * arch_heap is a min-heap that is used during the directory scan to track * the highest-priority files to archive. After the directory scan * completes, the file names are stored in ascending order of priority in * arch_files. pgarch_readyXlog() returns files from arch_files until it @@ -248,7 +248,7 @@ PgArchiverMain(void) arch_files = palloc(sizeof(struct arch_files_state)); arch_files->arch_files_size = 0; - /* Initialize our max-heap for prioritizing files to archive. */ + /* Initialize our min-heap for prioritizing files to archive. */ arch_files->arch_heap = binaryheap_allocate(NUM_FILES_PER_DIRECTORY_SCAN, ready_file_comparator, NULL); @@ -624,7 +624,7 @@ pgarch_readyXlog(char *xlog) basename[basenamelen] = '\0'; /* - * Store the file in our max-heap if it has a high enough priority. + * Store the file in our min-heap if it has a high enough priority. */ if (arch_files->arch_heap->bh_size < NUM_FILES_PER_DIRECTORY_SCAN) { @@ -644,15 +644,15 @@ pgarch_readyXlog(char *xlog) * Remove the lowest priority file and add the current one to the * heap. */ - arch_file = DatumGetCString(binaryheap_remove_first(arch_files->arch_heap)); + arch_file = DatumGetCString(binaryheap_first(arch_files->arch_heap)); strcpy(arch_file, basename); - binaryheap_add(arch_files->arch_heap, CStringGetDatum(arch_file)); + binaryheap_replace_first(arch_files->arch_heap, CStringGetDatum(arch_file)); } } FreeDir(rldir); /* If no files were found, simply return. */ - if (arch_files->arch_heap->bh_size == 0) + if (binaryheap_empty(arch_files->arch_heap)) return false; /* diff --git a/src/backend/replication/logical/reorderbuffer.c b/src/backend/replication/logical/reorderbuffer.c index 26d252bd87..05bbcecd29 100644 --- a/src/backend/replication/logical/reorderbuffer.c +++ b/src/backend/replication/logical/reorderbuffer.c @@ -1381,7 +1381,7 @@ ReorderBufferIterTXNNext(ReorderBuffer *rb, ReorderBufferIterTXNState *state) int32 off; /* nothing there anymore */ - if (state->heap->bh_size == 0) + if (binaryheap_empty(state->heap)) return NULL; off = DatumGetInt32(binaryheap_first(state->heap)); -- 2.25.1 >From 8f0339263dd21bb04c1b96a0e850a9b57a2e3f8b Mon Sep 17 00:00:00 2001 From: Nathan Bossart Date: Thu, 20 Jul 2023 09:45:03 -0700 Subject: [PATCH v2 2/4] make binaryheap available to frontend --- src/backend/lib/Makefile | 1 - src/backend/lib/meson.build | 1 - src/common/Makefile | 1 + src/{backend/lib => common}/binaryheap.c | 15 src/common/meson.build | 1 + src/include/lib/binaryheap.h | 29 6 files changed, 46 insertions(+), 2 deletions(-) rename src/{backend/lib => common}/binaryheap.c (96%) diff --git a/src/backend/lib/Makefile b/src/backend/lib/Makefile index 9dad31398a..b6cefd9cca 100644 --- a/src/backend/lib/Makefile +++ b/src/backend/lib/Makefile @@ -13,7 +13,6 @@ top_builddir = ../../.. include $(top_builddir)/src/Makefile.global OBJS = \ - binaryheap.o \ bipartite_match.o \ bloomfilter.o \ dshash.o \ diff --git a/src/backend/lib/meson.build b/src/backend/lib/meson.build index 974cab8776..b4e88f54ae 100644 --- a/src/backend/lib/meson.build +++ b/src/backend/lib/meson.build @@ -1,7 +1,6 @@ # Copyright (c) 2022-2023, PostgreSQL Global Development Group backend_sources += files( - 'binaryheap.c', 'bipartite_match.c', 'bloomfilter.c', 'dshash.c', diff --git
Re: There should be a way to use the force flag when restoring databases
On Thu, Jul 20, 2023 at 2:10 AM Daniel Gustafsson wrote: > > > On 19 Jul 2023, at 19:28, Gurjeet Singh wrote: > > > > The docs for 'pg_restore --create` say "Create the database before > > restoring into it. If --clean is also specified, drop and recreate the > > target database before connecting to it." > > > > If we provided a force option, it may then additionally say: "If the > > --clean and --force options are specified, DROP DATABASE ... WITH > > FORCE command will be used to drop the database." > > pg_restore --clean refers to dropping any pre-existing database objects and > not > just databases, but --force would only apply to databases. > > I wonder if it's worth complicating pg_restore with that when running dropdb > --force before pg_restore is an option for those wanting to use WITH FORCE. Fair point. But the same argument could've been applied to --clean option, as well; why overload the meaning of --clean and make it drop database, when a dropdb before pg_restore was an option. IMHO, if pg_restore offers to drop database, providing an option to the user to do it forcibly is not that much of a stretch, and within reason for the user to expect it to be there, like Joan did. Best regards, Gurjeet http://Gurje.et
Re: logical decoding and replication of sequences, take 2
FWIW there's two questions related to the switch to XLOG_SMGR_CREATE. 1) Does smgr_decode() need to do the same block as sequence_decode()? /* Skip the change if already processed (per the snapshot). */ if (transactional && !SnapBuildProcessChange(builder, xid, buf->origptr)) return; else if (!transactional && (SnapBuildCurrentState(builder) != SNAPBUILD_CONSISTENT || SnapBuildXactNeedsSkip(builder, buf->origptr))) return; I don't think it does. Also, we don't have any transactional flag here. Or rather, everything is transactional ... 2) Currently, the sequences hash table is in reorderbuffer, i.e. global. I was thinking maybe we should have it in the transaction (because we need to do cleanup at the end). It seem a bit inconvenient, because then we'd need to either search htabs in all subxacts, or transfer the entries to the top-level xact (otoh, we already do that with snapshots), and cleanup on abort. What do you think? regards -- Tomas Vondra EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Re: logicalrep_message_type throws an error
On Thu, Jul 20, 2023 at 9:11 AM Amit Kapila wrote: > > > > Okay, changed it accordingly. > > > > oops, forgot to attach the patch. > WFM @@ -3367,7 +3367,7 @@ apply_dispatch(StringInfo s) default: ereport(ERROR, (errcode(ERRCODE_PROTOCOL_VIOLATION), - errmsg("invalid logical replication message type \"%c\"", action))); + errmsg("invalid logical replication message type \"??? (%d)\"", action))); I think we should report character here since that's what is visible in the code and also the message types are communicated as characters not integers. Makes debugging easier. Context may report integer as an additional data point. -- Best Wishes, Ashutosh Bapat
Re: Atomic ops for unlogged LSN
> what happens if … reader here stores the old unloggedLSN value > to control file and the server restarts (clean exit). So, the old >value is loaded back to unloggedLSN upon restart and the callers of > GetFakeLSNForUnloggedRel() will see an old/repeated value too. Is it a > problem? First, a clarification. The value being saved is the “next” unlogged LSN, not one which has already been used. (we are doing “fetch and add”, not “add and fetch”) You have a good point about shutdown and startup. It is vital we don’t repeat an unlogged LSN. This situation could easily happen If other readers were active while we were shutting down. >With an atomic variable, it is guaranteed that the readers >don't see a torn-value, but no synchronization is provided. The atomic increment also ensures the sequence of values is correct, specifically we don’t see repeated values like we might with a conventional increment. As a side effect, the instruction enforces a memory barrier, but we are not relying on a barrier in this case.
Re: remaining sql/json patches
On 2023-Jul-21, Amit Langote wrote: > I’m thinking of pushing 0001 and 0002 tomorrow barring objections. 0001 looks reasonable to me. I think you asked whether to squash that one with the other bugfix commit for the same code that you already pushed to master; I think there's no point in committing as separate patches, because the first one won't show up in the git_changelog output as a single entity with the one in 16, so it'll just be additional noise. I've looked at 0002 at various points in time and I think it looks generally reasonable. I think your removal of a couple of newlines (where originally two appear in sequence) is unwarranted; that the name to_json[b]_worker is ugly for exported functions (maybe "datum_to_json" would be better, or you may have better ideas); and that the omission of the stock comment in the new stanzas in FigureColnameInternal() is strange. But I don't have anything serious. Do add some ecpg tests ... Also, remember to pgindent and bump catversion, if you haven't already. -- Álvaro Herrera 48°01'N 7°57'E — https://www.EnterpriseDB.com/ "No hay hombre que no aspire a la plenitud, es decir, la suma de experiencias de que un hombre es capaz"
Re: pg15.3: dereference null *plan in CachedPlanIsSimplyValid/plpgsql
Justin Pryzby writes: > A production instance crashed like so. Oh, interesting. Apparently we got past the "Has cache invalidation fired on this plan?" check: if (!plansource->is_valid || plan != plansource->gplan || !plan->is_valid) return false; because "plan" and "plansource->gplan" were *both* null, which is a case the code wasn't expecting. plansource.c seems to be paranoid about gplan possibly being null everywhere else but here :-( > Note that the prior query seems to have timed out in the same / similar > plpgsql > statement: Hm. Perhaps that helps explain how we got into this state. It must be some weird race condition, given the lack of prior reports. Anyway, adding a check for plan not being null seems like the obvious fix. regards, tom lane
Re: remaining sql/json patches
On Thu, Jul 20, 2023 at 17:19 Amit Langote wrote: > On Wed, Jul 19, 2023 at 5:17 PM Amit Langote > wrote: > > On Wed, Jul 19, 2023 at 12:53 AM Alvaro Herrera > wrote: > > > On 2023-Jul-18, Amit Langote wrote: > > > > > > > Attached updated patches. In 0002, I removed the mention of the > > > > RETURNING clause in the JSON(), JSON_SCALAR() documentation, which I > > > > had forgotten to do in the last version which removed its support in > > > > code. > > > > > > > I think 0001 looks ready to go. Alvaro? > > > > > > It looks reasonable to me. > > > > Thanks for taking another look. > > > > I will push this tomorrow. > > Pushed. > > > > > Also, I've been wondering if it isn't too late to apply the following > > > > to v16 too, so as to make the code look similar in both branches: > > > > > > Hmm. > > > > > > > 785480c953 Pass constructName to transformJsonValueExpr() > > > > > > I think 785480c953 can easily be considered a bugfix on 7081ac46ace8, > so > > > I agree it's better to apply it to 16. > > > > OK. > > Pushed to 16. > > > > > b6e1157e7d Don't include CaseTestExpr in JsonValueExpr.formatted_expr > > > > > > I feel a bit uneasy about this one. It seems to assume that > > > formatted_expr is always set, but at the same time it's not obvious > that > > > it is. (Maybe this aspect just needs some more commentary). > > > > Hmm, I agree that the comments about formatted_expr could be improved > > further, for which I propose the attached. Actually, staring some > > more at this, I'm inclined to change makeJsonValueExpr() to allow > > callers to pass it the finished 'formatted_expr' rather than set it by > > themselves. > > > > > I agree > > > that it would be better to make both branches identical, because if > > > there's a problem, we are better equipped to get a fix done to both. > > > > > > As for the removal of makeCaseTestExpr(), I agree -- of the six callers > > > of makeNode(CastTestExpr), only two of them would be able to use the > new > > > function, so it doesn't look of general enough usefulness. > > > > OK, so you agree with back-patching this one too, though perhaps only > > after applying something like the aforementioned patch. > > I looked at this some more and concluded that it's fine to think that > all JsonValueExpr nodes leaving the parser have their formatted_expr > set. I've updated the commentary some more in the patch attached as > 0001. > > Rebased SQL/JSON patches also attached. I've fixed the JSON_TABLE > syntax synopsis in the documentation as mentioned in my other email. I’m thinking of pushing 0001 and 0002 tomorrow barring objections. > -- Thanks, Amit Langote EDB: http://www.enterprisedb.com
pg15.3: dereference null *plan in CachedPlanIsSimplyValid/plpgsql
A production instance crashed like so. < 2023-07-20 05:41:36.557 MDT >LOG: server process (PID 106001) was terminated by signal 11: Segmentation fault [168417.649549] postmaster[106001]: segfault at 12 ip 008f1d40 sp 7ffe885502e0 error 4 in postgres[40+7dc000] Core was generated by `postgres: main main 127.0.0.1(51936) SELECT'. (gdb) bt #0 CachedPlanIsSimplyValid (plansource=0x2491398, plan=0x0, owner=0x130a198) at plancache.c:1441 #1 0x7fcab2e93d15 in exec_eval_simple_expr (rettypmod=0x7ffe88550374, rettype=0x7ffe88550370, isNull=0x7ffe8855036f, result=, expr=0x2489dc8, estate=0x7ffe885507f0) at pl_exec.c:6067 #2 exec_eval_expr (estate=0x7ffe885507f0, expr=0x2489dc8, isNull=0x7ffe8855036f, rettype=0x7ffe88550370, rettypmod=0x7ffe88550374) at pl_exec.c:5696 #3 0x7fcab2e97d42 in exec_assign_expr (estate=estate@entry=0x7ffe885507f0, target=0x16250f8, expr=0x2489dc8) at pl_exec.c:5028 #4 0x7fcab2e98675 in exec_stmt_assign (stmt=0x2489d80, stmt=0x2489d80, estate=0x7ffe885507f0) at pl_exec.c:2155 #5 exec_stmts (estate=estate@entry=0x7ffe885507f0, stmts=0x2489ad8) at pl_exec.c:2019 #6 0x7fcab2e9b672 in exec_stmt_block (estate=estate@entry=0x7ffe885507f0, block=block@entry=0x248af48) at pl_exec.c:1942 #7 0x7fcab2e9b6cb in exec_toplevel_block (estate=estate@entry=0x7ffe885507f0, block=0x248af48) at pl_exec.c:1633 #8 0x7fcab2e9bf84 in plpgsql_exec_function (func=func@entry=0x1035388, fcinfo=fcinfo@entry=0x40a76b8, simple_eval_estate=simple_eval_estate@entry=0x0, simple_eval_resowner=simple_eval_resowner@entry=0x0, procedure_resowner=procedure_resowner@entry=0x0, atomic=atomic@entry=true) at pl_exec.c:622 #9 0x7fcab2ea7454 in plpgsql_call_handler (fcinfo=0x40a76b8) at pl_handler.c:277 #10 0x00658307 in ExecAggPlainTransByRef (setno=, aggcontext=, pergroup=0x41a3358, pertrans=, aggstate=) at execExprInterp.c:4379 #11 ExecInterpExpr (state=0x40a7870, econtext=0x1843a58, isnull=) at execExprInterp.c:1770 #12 0x00670282 in ExecEvalExprSwitchContext (isNull=0x7ffe88550b47, econtext=, state=) at ../../../src/include/executor/executor.h:341 #13 advance_aggregates (aggstate=, aggstate=) at nodeAgg.c:824 #14 agg_fill_hash_table (aggstate=0x1843630) at nodeAgg.c:2544 #15 ExecAgg (pstate=0x1843630) at nodeAgg.c:2154 #16 0x00662c58 in ExecProcNodeInstr (node=0x1843630) at execProcnode.c:480 #17 0x0067c7f3 in ExecProcNode (node=0x1843630) at ../../../src/include/executor/executor.h:259 #18 ExecHashJoinImpl (parallel=false, pstate=0x1843390) at nodeHashjoin.c:268 #19 ExecHashJoin (pstate=0x1843390) at nodeHashjoin.c:621 #20 0x00662c58 in ExecProcNodeInstr (node=0x1843390) at execProcnode.c:480 #21 0x0067c7f3 in ExecProcNode (node=0x1843390) at ../../../src/include/executor/executor.h:259 #22 ExecHashJoinImpl (parallel=false, pstate=0x18430f0) at nodeHashjoin.c:268 #23 ExecHashJoin (pstate=0x18430f0) at nodeHashjoin.c:621 #24 0x00662c58 in ExecProcNodeInstr (node=0x18430f0) at execProcnode.c:480 #25 0x0068de76 in ExecProcNode (node=0x18430f0) at ../../../src/include/executor/executor.h:259 #26 ExecSort (pstate=0x1842ee0) at nodeSort.c:149 #27 0x00662c58 in ExecProcNodeInstr (node=0x1842ee0) at execProcnode.c:480 #28 0x0066ced9 in ExecProcNode (node=0x1842ee0) at ../../../src/include/executor/executor.h:259 #29 fetch_input_tuple (aggstate=aggstate@entry=0x1842908) at nodeAgg.c:563 #30 0x006701b2 in agg_retrieve_direct (aggstate=0x1842908) at nodeAgg.c:2346 #31 ExecAgg (pstate=0x1842908) at nodeAgg.c:2161 #32 0x00662c58 in ExecProcNodeInstr (node=0x1842908) at execProcnode.c:480 #33 0x0065be12 in ExecProcNode (node=0x1842908) at ../../../src/include/executor/executor.h:259 #34 ExecutePlan (execute_once=, dest=0x5fb39f8, direction=, numberTuples=0, sendTuples=true, operation=CMD_SELECT, use_parallel_mode=, planstate=0x1842908, estate=0x1842298) at execMain.c:1636 #35 standard_ExecutorRun (queryDesc=0x2356ab8, direction=, count=0, execute_once=) at execMain.c:363 #36 0x7fcab32bd67d in pgss_ExecutorRun (queryDesc=0x2356ab8, direction=ForwardScanDirection, count=0, execute_once=) at pg_stat_statements.c:1010 #37 0x7fcab30b7781 in explain_ExecutorRun (queryDesc=0x2356ab8, direction=ForwardScanDirection, count=0, execute_once=) at auto_explain.c:320 #38 0x007d010e in PortalRunSelect (portal=portal@entry=0xf79598, forward=forward@entry=true, count=0, count@entry=9223372036854775807, dest=dest@entry=0x5fb39f8) at pquery.c:924 #39 0x007d15bf in PortalRun (portal=, count=9223372036854775807, isTopLevel=, run_once=, dest=0x5fb39f8, altdest=0x5fb39f8, qc=0x7ffe885512d0) at pquery.c:768 #40 0x007cd417 in exec_simple_query ( query_string=0x15ee788 "-- report: thresrept: None: E-UTRAN/eNB KPIs: ['2023-07-19 2023-07-20', '0 23', '4', '{\"report_type\": \"hourly\",
Re: sslinfo extension - add notbefore and notafter timestamps
> On 17 Jul 2023, at 20:26, Cary Huang wrote: >>> Perhaps calling "tm2timestamp(_time, 0, NULL, )" without checking >>> the return code would be just fine. I see some other usages of >>> tm2timstamp() in other code areas also skip checking the return code. >> >> I think we want to know about any failures, btu we can probably make it into >> an >> elog() instead, as it should never fail. > > Yes, sure. I have corrected the error message to elog(ERROR, "timestamp out > of range") on a rare tm2timestamp() failure. I went over this again and ended up pushing it along with a catversion bump. Due to a mistake in my testing I didn't however catch that it was using an API only present in OpenSSL 1.1.1 and higher, which caused buildfailures when using older OpenSSL versions, so I ended up reverting it again (leaving certificate changes in place) to keep the buildfarm green. Will look closer at an implementation which works across all supported versions of OpenSSL when I have more time. -- Daniel Gustafsson
Re: Printing backtrace of postgres processes
> On 11 Jan 2023, at 15:44, Bharath Rupireddy > wrote: > > On Wed, Nov 30, 2022 at 11:43 AM Bharath Rupireddy > wrote: >> >> I'm attaching the v22 patch set for further review. > > Needed a rebase due to 216a784829c2c5f03ab0c43e009126cbb819e9b2. > Attaching v23 patch set for further review. This thread has stalled for well over 6 months with the patch going from CF to CF. From skimming the thread it seems that a lot of the details have been ironed out with most (all?) objections addressed. Is any committer interested in picking this up? If not we should probably mark it returned with feedback. -- Daniel Gustafsson
Re: odd buildfarm failure - "pg_ctl: control file appears to be corrupt"
This patch no longer applies and needs a rebase. Given where we are in the commitfest, do you think this patch has the potential to go in or should it be moved? -- Daniel Gustafsson
Re: psql: Add role's membership options to the \du+ command
On 7/19/23 1:44 PM, Pavel Luzanov wrote: On 19.07.2023 19:47, Tom Lane wrote: And done, with some minor editorialization. Thanks to everyone who participated in the work. Special thanks to David for moving forward this patch for a long time, and to Tom for taking commit responsibilities. [RMT] +1; thanks to everyone for seeing this through! Jonathan OpenPGP_signature Description: OpenPGP digital signature
Re: [PATCH] Reuse Workers and Replication Slots during Logical Replication
Hi, Attached the updated patches with recent reviews addressed. See below for my comments: Peter Smith , 19 Tem 2023 Çar, 06:08 tarihinde şunu yazdı: > Some review comments for v19-0001 > > 2. LogicalRepSyncTableStart > > /* > * Finally, wait until the leader apply worker tells us to catch up and > * then return to let LogicalRepApplyLoop do it. > */ > wait_for_worker_state_change(SUBREL_STATE_CATCHUP); > > ~ > > Should LogicalRepApplyLoop still be mentioned here, since that is > static in worker.c? Maybe it is better to refer instead to the common > 'start_apply' wrapper? (see also #5a below) Isn't' LogicalRepApplyLoop static on HEAD and also mentioned in the exact comment in tablesync.c while the common "start_apply" function also exists? I'm not sure how such a change would be related to this patch. --- 5. > + /* Found a table for next iteration */ > + finish_sync_worker(true); > + > + StartTransactionCommand(); > + ereport(LOG, > + (errmsg("logical replication worker for subscription \"%s\" will be > reused to sync table \"%s\" with relid %u.", > + MySubscription->name, > + get_rel_name(MyLogicalRepWorker->relid), > + MyLogicalRepWorker->relid))); > + CommitTransactionCommand(); > + > + done = false; > + break; > + } > + LWLockRelease(LogicalRepWorkerLock); > 5b. > Isn't there a missing call to that LWLockRelease, if the 'break' happens? Lock is already released before break, if that's the lock you meant: /* Update worker state for the next table */ > MyLogicalRepWorker->relid = rstate->relid; > MyLogicalRepWorker->relstate = rstate->state; > MyLogicalRepWorker->relstate_lsn = rstate->lsn; > LWLockRelease(LogicalRepWorkerLock); > /* Found a table for next iteration */ > finish_sync_worker(true); > done = false; > break; --- 2. > As for the publisher node, this patch allows to reuse logical > walsender processes > after the streaming is done once. > ~ > Is this paragraph even needed? Since the connection is reused then it > already implies the other end (the Wlasender) is being reused, right? I actually see no harm in explaining this explicitly. Thanks, -- Melih Mutlu Microsoft v21-0001-Refactor-to-split-Apply-and-Tablesync-Workers.patch Description: Binary data v21-0002-Reuse-Tablesync-Workers.patch Description: Binary data v21-0003-Reuse-connection-when-tablesync-workers-change-t.patch Description: Binary data
Re: [PATCH] Reuse Workers and Replication Slots during Logical Replication
On Thu, Jul 20, 2023 at 5:12 PM Melih Mutlu wrote: > > Peter Smith , 20 Tem 2023 Per, 05:41 tarihinde şunu > yazdı: >> >> 7. InitializeLogRepWorker >> >> if (am_tablesync_worker()) >> ereport(LOG, >> - (errmsg("logical replication worker for subscription \"%s\", table >> \"%s\" has started", >> + (errmsg("logical replication worker for subscription \"%s\", table >> \"%s\" with relid %u has started", >> MySubscription->name, >> - get_rel_name(MyLogicalRepWorker->relid; >> + get_rel_name(MyLogicalRepWorker->relid), >> + MyLogicalRepWorker->relid))); >> >> But this is certainly a tablesync worker so the message here should >> say "logical replication table synchronization worker" like the HEAD >> code used to do. >> >> It seems this mistake was introduced in patch v20-0001. > > > I'm a bit confused here. Isn't it decided to use "logical replication worker" > regardless of the worker's type [1]. That's why I made this change. If that's > not the case here, I'll put it back. > I feel where the worker type is clear, it is better to use it unless the same can lead to translation issues. -- With Regards, Amit Kapila.
Re: Extracting cross-version-upgrade knowledge from buildfarm client
On 2023-07-20 Th 05:52, Alvaro Herrera wrote: On 2023-Jul-19, Andrew Dunstan wrote: The result you report suggest to me that somehow the old version is no longer a PostgreSQL::Version object. Here's the patch I suggest: Ahh, okay, that makes more sense; and yes, it does work. Your patch LGTM cheers andrew -- Andrew Dunstan EDB:https://www.enterprisedb.com
Re: Add SPLIT PARTITION/MERGE PARTITIONS commands
The following review has been posted through the commitfest application: make installcheck-world: tested, passed Implements feature: tested, passed Spec compliant: tested, passed Documentation:tested, passed It is just a rebase I check with make and meson run manual split and merge on list and range partition Doc fits The new status of this patch is: Ready for Committer
Re: POC: GROUP BY optimization
On 7/20/23 08:37, Andrey Lepikhov wrote: > On 3/10/2022 21:56, Tom Lane wrote: >> Revert "Optimize order of GROUP BY keys". >> >> This reverts commit db0d67db2401eb6238ccc04c6407a4fd4f985832 and >> several follow-on fixes. >> ... >> Since we're hard up against the release deadline for v15, let's >> revert these changes for now. We can always try again later. > > It may be time to restart the project. As a first step, I rebased the > patch on the current master. It wasn't trivial because of some latest > optimizations (a29eab, 1349d27 and 8d83a5d). > Now, Let's repeat the review and rewrite the current path according to > the reasons uttered in the revert commit. I think the fundamental task is to make the costing more reliable, and the commit message 443df6e2db points out a couple challenges in this area. Not sure how feasible it is to address enough of them ... 1) procost = 1.0 - I guess we could make this more realistic by doing some microbenchmarks and tuning the costs for the most expensive cases. 2) estimating quicksort comparisons - This relies on ndistinct estimates, and I'm not sure how much more reliable we can make those. Probably not much :-( Not sure what to do about this, the only thing I can think of is to track "reliability" of the estimates and only do the reordering if we have high confidence in the estimates. That means we'll miss some optimization opportunities, but it should limit the risk. regards -- Tomas Vondra EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Re: [PATCH] Reuse Workers and Replication Slots during Logical Replication
Hi, Peter Smith , 20 Tem 2023 Per, 05:41 tarihinde şunu yazdı: > 7. InitializeLogRepWorker > > if (am_tablesync_worker()) > ereport(LOG, > - (errmsg("logical replication worker for subscription \"%s\", table > \"%s\" has started", > + (errmsg("logical replication worker for subscription \"%s\", table > \"%s\" with relid %u has started", > MySubscription->name, > - get_rel_name(MyLogicalRepWorker->relid; > + get_rel_name(MyLogicalRepWorker->relid), > + MyLogicalRepWorker->relid))); > > But this is certainly a tablesync worker so the message here should > say "logical replication table synchronization worker" like the HEAD > code used to do. > > It seems this mistake was introduced in patch v20-0001. > I'm a bit confused here. Isn't it decided to use "logical replication worker" regardless of the worker's type [1]. That's why I made this change. If that's not the case here, I'll put it back. [1] https://www.postgresql.org/message-id/flat/CAHut%2BPt1xwATviPGjjtJy5L631SGf3qjV9XUCmxLu16cHamfgg%40mail.gmail.com Thanks, -- Melih Mutlu Microsoft
Re: Use of additional index columns in rows filtering
On 7/19/23 23:38, Peter Geoghegan wrote: > On Wed, Jul 19, 2023 at 1:46 PM Jeff Davis wrote: >> On Wed, 2023-07-19 at 20:03 +0200, Tomas Vondra wrote: >>> Makes sense, I also need to think about maybe not having duplicate >>> clauses in the two lists. What annoys me on that it partially >>> prevents >>> the cost-based reordering done by order_qual_clauses(). So maybe we >>> should have three lists ... Also, some of the expressions count be >>> fairly expensive. >> >> Can we just calculate the costs of the pushdown and do it when it's a >> win? If the random_page_cost savings exceed the costs from evaluating >> the clause earlier, then push down. > > My patch that teaches nbtree to execute ScalarArrayOps intelligently > (by dynamically choosing to not re-descend the btree to perform > another "primitive index scan" when the data we need is located on the > same leaf page as the current ScalarArrayOps arrays) took an > interesting turn recently -- one that seems related. > > I found that certain kinds of queries are dramatically faster once you > teach the optimizer to accept that multi-column ScalarArrayOps can be > trusted to return tuples in logical/index order, at least under some > circumstances. For example: > > pg@regression: [583930]=# create index order_by_saop on > tenk1(two,four,twenty); > CREATE INDEX > > pg@regression: [583930]=# EXPLAIN (ANALYZE, BUFFERS) > select ctid, thousand from tenk1 > where two in (0,1) and four in (1,2) and twenty in (1,2) > order by two, four, twenty limit 20; > > This shows "Buffers: shared hit=1377" on HEAD, versus "Buffers: shared > hit=13" with my patch. All because we can safely terminate the scan > early now. The vast majority of the buffer hits the patch will avoid > are against heap pages, even though I started out with the intention > of eliminating unnecessary repeat index page accesses. > > Note that build_index_paths() currently refuses to allow SAOP clauses > to be recognized as ordered with a multi-column index and a query with > a clause for more than the leading column -- that is something that > the patch needs to address (to get this particular improvement, at > least). Allowing such an index path to have useful pathkeys is > typically safe (in the sense that it won't lead to wrong answers to > queries), but we still make a conservative assumption that they can > lead to wrong answers. There are comments about "equality constraints" > that describe the restrictions right now. > > But it's not just the question of basic correctness -- the optimizer > is very hesitant to use multi-column SAOPs, even in cases that don't > care about ordering. So it's also (I think, implicitly) a question of > *risk*. The risk of getting very inefficient SAOP execution in nbtree, > when it turns out that a huge number of "primitive index scans" are > needed. But, if nbtree is taught to "coalesce together" primitive > index scans at runtime, that risk goes way down. > > Anyway, this seems related to what you're talking about because the > relationship between selectivity and ordering seems particularly > important in this context. And because it suggests that there is at > least some scope for adding "run time insurance" to the executor, > which is valuable in the optimizer if it bounds the potential > downside. If you can be practically certain that there is essentially > zero risk of serious problems when the costing miscalculates (for a > limited subset of cases), then life might be a lot easier -- clearly > we should be biased in one particular direction with a case that has > that kind of general profile. > > My current understanding of the optimizer side of things -- > particularly things like costing for "filter quals/pqquals" versus > costing for "true index quals" -- is rather basic. That will have to > change. Curious to hear your thoughts (if any) on how what you're > discussing may relate to what I need to do with my patch. Right now my > patch assumes that making SAOP clauses into proper index quals (that > usually preserve index ordering) is an unalloyed good (when safe!). > This assumption is approximately true on average, as far as I can > tell. But it's probably quite untrue in various specific cases, that > somebody is bound to care about. > I think the SAOP patch may need to be much more careful about this, but for this patch it's simpler because it doesn't really change any of the index internals, or the indexscan in general. If I simplify that a bit, we're just reordering the clauses in a way to maybe eliminate the heap fetch. The main risk seems to be that this will force an expensive qual to the front of the list, just because it can be evaluated on the index tuple. But the difference would need to be worse than what we save by not doing the I/O - considering how expensive the I/O is, that seems unlikely. Could happen for expensive quals that don't really eliminate many rows, I guess. Anyway, I see this as extension of what
Re: Synchronizing slots from primary to standby
On Fri, Jun 16, 2023 at 3:26 PM Amit Kapila wrote: > > On Mon, Apr 17, 2023 at 7:37 PM Drouvot, Bertrand > wrote: > > > 3. As mentioned in the initial email, I think it would be better to > replace LIST_SLOTS command with a SELECT query. > I had a look at this thread. I am interested to work on this and can spend some time addressing the comments given here. I tried to replace LIST_SLOTS command with a SELECT query. Attached rebased patch and PoC patch for LIST_SLOTS removal. For LIST_SLOTs cmd removal, below are the points where more analysis is needed. 1) I could not use the exposed libpqwalreceiver's functions walrcv_exec/libpqrcv_exec in LogicalRepLauncher to run select query instead of LIST_SLOTS cmd. This is because libpqrcv_exec() needs database connection but since in LogicalReplauncher, we do not have any (MyDatabseId is not set), so the API gives an error. Thus to make it work for the time-being, I used 'libpqrcv_PQexec' which is not dependent upon database connection. But since it is not exposed "yet" to other layers, I temporarily added the new code to libpqwalreceiver.c itself. In fact I reused the existing function wrapper libpqrcv_list_slots and changed the functionality to get info using select query rather than list_slots. 2) While using connect API walrcv_connect/libpqrcv_connect(), we need to tell it whether it is for logical or physical replication. In the existing patch, where we were using LIST_SLOTS cmd, we have this connection made with logical=false. But now since we need to run select query to get the same info, using connection with logical=false gives error on primary while executing select query. "ERROR: cannot execute SQL commands in WAL sender for physical replication". And thus in ApplyLauncherStartSlotSync(), I have changed connect API to use logical=true for the time being. I noticed that in the existing patch, it was using logical=false in ApplyLauncherStartSlotSync() while logical=true in synchronize_slots(). Possibly due to the same fact that logical=false connection will not allow synchronize_slots() to run select query on primary while it worked for ApplyLauncherStartSlotSync() as it was running list_slots cmd instead of select query. I am exploring further on these points to figure out which one is the correct way to deal with these. Meanwhile posting this WIP patch for early feedback. I will try addressing other comments as well in next versions. thanks Shveta v1-0001-Remove-list_slots-command.patch Description: Binary data v7-0001-Synchronize-logical-replication-slots-from-primar.patch Description: Binary data
Re: Partial aggregates pushdown
fujii.y...@df.mitsubishielectric.co.jp писал 2023-07-19 03:43: Hi Mr.Pyhalov, hackers. 3) I modified the patch to safely do a partial aggregate pushdown for queries which contain having clauses. Hi. Sorry, but I don't see how it could work. For example, the attached test returns wrong result: CREATE FUNCTION f() RETURNS INT AS $$ begin return 10; end $$ LANGUAGE PLPGSQL; SELECT b, sum(a) FROM pagg_tab GROUP BY b HAVING sum(a) < f() ORDER BY 1; b | sum +- 0 | 0 10 | 0 20 | 0 30 | 0 40 | 0 +(5 rows) In fact the above query should have returned 0 rows, as SELECT b, sum(a) FROM pagg_tab GROUP BY b ORDER BY 1; b | sum +-- 0 | 600 1 | 660 2 | 720 3 | 780 4 | 840 5 | 900 6 | 960 7 | 1020 8 | 1080 9 | 1140 10 | 600 11 | 660 12 | 720 shows no such rows. Or, on the same data SELECT b, sum(a) FROM pagg_tab GROUP BY b HAVING sum(a) > 660 ORDER BY 1; You'll get 0 rows. But SELECT b, sum(a) FROM pagg_tab GROUP BY b; b | sum +-- 42 | 720 29 | 1140 4 | 840 34 | 840 41 | 660 0 | 600 40 | 600 gives. The issue is that you can't calculate "partial" having. You should compare full aggregate in filter, but it's not possible on the level of one partition. And you have this in plans Finalize GroupAggregate Output: pagg_tab.b, avg(pagg_tab.a), max(pagg_tab.a), count(*) Group Key: pagg_tab.b Filter: (sum(pagg_tab.a) < 700) -> Sort Output: pagg_tab.b, (PARTIAL avg(pagg_tab.a)), (PARTIAL max(pagg_tab.a)), (PARTIAL count(*)), (PARTIAL sum(pagg_tab.a)) Sort Key: pagg_tab.b -> Append -> Foreign Scan Output: pagg_tab.b, (PARTIAL avg(pagg_tab.a)), (PARTIAL max(pagg_tab.a)), (PARTIAL count(*)), (PARTIAL sum(pagg_tab.a)) Filter: ((PARTIAL sum(pagg_tab.a)) < 700) <--- here we can't compare anything yet, sum is incomplete. Relations: Aggregate on (public.fpagg_tab_p1 pagg_tab) Remote SQL: SELECT b, avg_p_int4(a), max(a), count(*), sum(a) FROM public.pagg_tab_p1 GROUP BY 1 -> Foreign Scan Output: pagg_tab_1.b, (PARTIAL avg(pagg_tab_1.a)), (PARTIAL max(pagg_tab_1.a)), (PARTIAL count(*)), (PARTIAL sum(pagg_tab_1.a)) Filter: ((PARTIAL sum(pagg_tab_1.a)) < 700) Relations: Aggregate on (public.fpagg_tab_p2 pagg_tab_1) Remote SQL: SELECT b, avg_p_int4(a), max(a), count(*), sum(a) FROM public.pagg_tab_p2 GROUP BY 1 -> Foreign Scan Output: pagg_tab_2.b, (PARTIAL avg(pagg_tab_2.a)), (PARTIAL max(pagg_tab_2.a)), (PARTIAL count(*)), (PARTIAL sum(pagg_tab_2.a)) Filter: ((PARTIAL sum(pagg_tab_2.a)) < 700) Relations: Aggregate on (public.fpagg_tab_p3 pagg_tab_2) Remote SQL: SELECT b, avg_p_int4(a), max(a), count(*), sum(a) FROM public.pagg_tab_p3 GROUP BY 1 In foreign_grouping_ok() 6586 if (IsA(expr, Aggref)) 6587 { 6588 if (partial) 6589 { 6590 mark_partial_aggref((Aggref *) expr, AGGSPLIT_INITIAL_SERIAL); 6591 continue; 6592 } 6593 else if (!is_foreign_expr(root, grouped_rel, expr)) 6594 return false; 6595 6596 tlist = add_to_flat_tlist(tlist, list_make1(expr)); 6597 } at least you shouldn't do anything with expr, if is_foreign_expr() returned false. If we restrict pushing down queries with havingQuals, I'm not quite sure how Aggref can appear in local_conds. As for changes in planner.c (setGroupClausePartial()) I have several questions. 1) Why don't we add non_group_exprs to pathtarget->exprs when partial_target->exprs is not set? 2) We replace extra->partial_target->exprs with partial_target->exprs after processing. Why are we sure that after this tleSortGroupRef is correct? -- Best regards, Alexander Pyhalov, Postgres Professionaldiff --git a/contrib/postgres_fdw/sql/postgres_fdw.sql b/contrib/postgres_fdw/sql/postgres_fdw.sql index 1824cb67fe9..c6f613019c3 100644 --- a/contrib/postgres_fdw/sql/postgres_fdw.sql +++ b/contrib/postgres_fdw/sql/postgres_fdw.sql @@ -3025,6 +3025,22 @@ EXPLAIN (VERBOSE, COSTS OFF) SELECT b, avg(a), max(a), count(*) FROM pagg_tab GROUP BY b HAVING sum(a) < 700 ORDER BY 1; SELECT b, avg(a), max(a), count(*) FROM pagg_tab GROUP BY b HAVING sum(a) < 700 ORDER BY 1; +CREATE FUNCTION f() RETURNS INT AS $$ +begin + return 10; +end +$$ LANGUAGE PLPGSQL; + +EXPLAIN (VERBOSE, COSTS OFF) +SELECT b, sum(a) FROM pagg_tab GROUP BY b HAVING sum(a) < f() ORDER BY 1; +SELECT b,
Re: inconsistency between the VM page visibility status and the visibility status of the page
On 7/20/23 08:44, yanhui.xiong wrote: > Hi, > > i had a issue here, When executing a SELECT statement using an > index-only scan, i got a wrong row number because there may be an > inconsistency between the VM page visibility status and the visibility > status of the page,the VM bit is set and page-level is clear > > > i read the code and note that there has a chance to happen like this,but > how it happens? > > the code do clear the page-level visibility and vm bit at the same time, > i don not understand how it happens > Well, by only looking at the code you're assuming two things: 1) the code is correct 2) the environment is perfect Either (or both) of these assumptions may be wrong. There certainly could be some subtle bug in the visibility map code, who knows. Or maybe this is due to some sort of data corruption outside postgres. It's impossible to answer without you digging much deeper and providing much more information. What's the environment? What hardware? What PG version? How long is it running? Any crashes? Any other cases of similar issues? What does the page look like in pageinspect? regards -- Tomas Vondra EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Re: Support worker_spi to execute the function dynamically.
On Thu, Jul 20, 2023 at 2:38 PM Masahiro Ikeda wrote: > > Thanks for discussing about the patch. I updated the patch from your > comments > * v2-0001-Support-worker_spi-to-execute-the-function-dynamical.patch > > I found another thing to be changed better. Though the tests was assumed > "shared_preload_libraries = worker_spi", the background workers failed > to > be launched in initialized phase because the database is not created > yet. > > ``` > # make check# in src/test/modules/worker_spi > # cat log/postmaster.log # in src/test/modules/worker_spi/ > 2023-07-20 17:58:47.958 JST worker_spi[853620] FATAL: database > "contrib_regression" does not exist > 2023-07-20 17:58:47.958 JST worker_spi[853621] FATAL: database > "contrib_regression" does not exist > 2023-07-20 17:58:47.959 JST postmaster[853612] LOG: background worker > "worker_spi" (PID 853620) exited with exit code 1 > 2023-07-20 17:58:47.959 JST postmaster[853612] LOG: background worker > "worker_spi" (PID 853621) exited with exit code 1 > ``` > > It's better to remove "shared_preload_libraries = worker_spi" from the > test configuration. I misunderstood that two background workers would > be launched and waiting at the start of the test. I don't think that change is correct. The worker_spi essentially shows how to start bg workers with RegisterBackgroundWorker and dynamic bg workers with RegisterDynamicBackgroundWorker. If shared_preload_libraries = worker_spi not specified in there, you will miss to start RegisterBackgroundWorkers. Is giving an initidb time database name to worker_spi.database work there? If the database for bg workers doesn't exist, changing bgw_restart_time from BGW_NEVER_RESTART to say 1 will help to see bg workers coming up eventually. I think it's worth adding test cases for the expected number of bg workers (after creating worker_spi extension) and dynamic bg workers (after calling worker_spi_launch()). Also, to distinguish bg workers and dynamic bg workers, you can change bgw_type in worker_spi_launch to "worker_spi dynamic worker". -/* get the configuration */ +/* Get the configuration */ -/* set up common data for all our workers */ +/* Set up common data for all our workers */ These unrelated changes better be there as-is. Because, the postgres code has both commenting styles /* Get */ or /* get */, IOW, single line comments starting with both uppercase and lowercase. -- Bharath Rupireddy PostgreSQL Contributors Team RDS Open Source Databases Amazon Web Services: https://aws.amazon.com
Re: Extracting cross-version-upgrade knowledge from buildfarm client
On 2023-Jul-19, Andrew Dunstan wrote: > The result you report suggest to me that somehow the old version is no > longer a PostgreSQL::Version object. Here's the patch I suggest: Ahh, okay, that makes more sense; and yes, it does work. -- Álvaro Herrera PostgreSQL Developer — https://www.EnterpriseDB.com/ >From 055b0d89f7c6667b79ba0b97f54ea5aef64dfbb1 Mon Sep 17 00:00:00 2001 From: Alvaro Herrera Date: Wed, 19 Jul 2023 17:54:09 +0200 Subject: [PATCH v2] Compare only major versions in AdjustUpgrade.pm Because PostgreSQL::Version is very nuanced about development version numbers, the comparison to 16beta2 makes it think that that release is older than 16, therefore applying a database tweak that doesn't work there. Fix by having AdjustUpgrade create a separate PostgreSQL::Version object that only contains the major version number. While at it, have it ensure that the objects given are of the expected type. Co-authored-by: Andrew Dunstan Discussion: https://postgr.es/m/20230719110504.zbu74o54bqqlsufb@alvherre.pgsql --- src/test/perl/PostgreSQL/Test/AdjustUpgrade.pm | 12 1 file changed, 12 insertions(+) diff --git a/src/test/perl/PostgreSQL/Test/AdjustUpgrade.pm b/src/test/perl/PostgreSQL/Test/AdjustUpgrade.pm index a241d2ceff..64305fdcce 100644 --- a/src/test/perl/PostgreSQL/Test/AdjustUpgrade.pm +++ b/src/test/perl/PostgreSQL/Test/AdjustUpgrade.pm @@ -76,6 +76,10 @@ sub adjust_database_contents my ($old_version, %dbnames) = @_; my $result = {}; + die "wrong type for \$old_version\n" + unless $old_version->isa("PostgreSQL::Version"); + $old_version = PostgreSQL::Version->new($old_version->major); + # remove dbs of modules known to cause pg_upgrade to fail # anything not builtin and incompatible should clean up its own db foreach my $bad_module ('test_ddl_deparse', 'tsearch2') @@ -262,6 +266,10 @@ sub adjust_old_dumpfile { my ($old_version, $dump) = @_; + die "wrong type for \$old_version\n" + unless $old_version->isa("PostgreSQL::Version"); + $old_version = PostgreSQL::Version->new($old_version->major); + # use Unix newlines $dump =~ s/\r\n/\n/g; @@ -579,6 +587,10 @@ sub adjust_new_dumpfile { my ($old_version, $dump) = @_; + die "wrong type for \$old_version\n" + unless $old_version->isa("PostgreSQL::Version"); + $old_version = PostgreSQL::Version->new($old_version->major); + # use Unix newlines $dump =~ s/\r\n/\n/g; -- 2.39.2
Re: Support worker_spi to execute the function dynamically.
On Thu, Jul 20, 2023 at 2:59 PM Michael Paquier wrote: > > On Thu, Jul 20, 2023 at 05:54:55PM +0900, Masahiro Ikeda wrote: > > Yes, you're right. When I tried using worker_spi to test wait event, > > I found the behavior. And thanks a lot for your patch. I wasn't aware > > of the way. I'll merge your patch to the tests for wait events. > > Be careful when using that. I have not spent more than a few minutes > to show my point, but what I sent lacks a shmem_request_hook in > _PG_init(), for example, to request an amount of shared memory equal > to the size of the state structure. I think the preferred way to grab a chunk of shared memory for an external module is by using shmem_request_hook and shmem_startup_hook. Wait events shared memory too can use them. -- Bharath Rupireddy PostgreSQL Contributors Team RDS Open Source Databases Amazon Web Services: https://aws.amazon.com
Re: Support worker_spi to execute the function dynamically.
On Thu, Jul 20, 2023 at 05:54:55PM +0900, Masahiro Ikeda wrote: > Yes, you're right. When I tried using worker_spi to test wait event, > I found the behavior. And thanks a lot for your patch. I wasn't aware > of the way. I'll merge your patch to the tests for wait events. Be careful when using that. I have not spent more than a few minutes to show my point, but what I sent lacks a shmem_request_hook in _PG_init(), for example, to request an amount of shared memory equal to the size of the state structure. -- Michael signature.asc Description: PGP signature
Re: There should be a way to use the force flag when restoring databases
> On 19 Jul 2023, at 19:28, Gurjeet Singh wrote: > > On Tue, Jul 18, 2023 at 12:53 AM Joan wrote: >> >> Since posgres 13 there's the option to do a FORCE when dropping a database >> (so it disconnects current users) Documentation here: >> https://www.postgresql.org/docs/current/sql-dropdatabase.html >> >> I am currently using dir format for the output >> pg_dump -d "bdname" -F d -j 4 -v -f /tmp/dir >> >> And restoring the database with >> pg_restore -d postgres -C -c --exit-on-error -F d -j 3 -v /tmp/dir >> >> Having an option to add the FORCE option to either the generated dump by >> pg_dump, or in the pg_restore would be very useful when restoring the >> databases to another servers so it would avoid having to do scripting. >> >> In my specific case I am using this to refresh periodically a development >> environment with data from production servers for a small database (~200M). > > Making force-drop a part of pg_dump output may be dangerous, and not > provide much flexibility at restore time. > > Adding a force option to pg_restore feels like providing the right tradeoff. > > The docs for 'pg_restore --create` say "Create the database before > restoring into it. If --clean is also specified, drop and recreate the > target database before connecting to it." > > If we provided a force option, it may then additionally say: "If the > --clean and --force options are specified, DROP DATABASE ... WITH > FORCE command will be used to drop the database." pg_restore --clean refers to dropping any pre-existing database objects and not just databases, but --force would only apply to databases. I wonder if it's worth complicating pg_restore with that when running dropdb --force before pg_restore is an option for those wanting to use WITH FORCE. -- Daniel Gustafsson
Re: WAL Insertion Lock Improvements
On Fri, Jul 14, 2023 at 4:17 AM Andres Freund wrote: > > Hi, > > On 2023-07-13 14:04:31 -0700, Andres Freund wrote: > > From b74b6e953cb5a7e7ea1a89719893f6ce9e231bba Mon Sep 17 00:00:00 2001 > > From: Bharath Rupireddy > > Date: Fri, 19 May 2023 15:00:21 + > > Subject: [PATCH v8] Optimize WAL insertion lock acquisition and release > > > > This commit optimizes WAL insertion lock acquisition and release > > in the following way: > > I think this commit does too many things at once. I've split the patch into three - 1) Make insertingAt 64-bit atomic. 2) Have better commenting on why there's no memory barrier or spinlock in and around LWLockWaitForVar call sites. 3) Have a quick exit for LWLockUpdateVar. > > 1. WAL insertion lock's variable insertingAt is currently read and > > written with the help of lwlock's wait list lock to avoid > > torn-free reads/writes. This wait list lock can become a point of > > contention on a highly concurrent write workloads. Therefore, make > > insertingAt a 64-bit atomic which inherently provides torn-free > > reads/writes. > > "inherently" is a bit strong, given that we emulate 64bit atomics where not > available... Modified. > > 2. LWLockUpdateVar currently acquires lwlock's wait list lock even when > > there are no waiters at all. Add a fastpath exit to LWLockUpdateVar when > > there are no waiters to avoid unnecessary locking. > > I don't think there's enough of an explanation for why this isn't racy. > > The reason it's, I think, safe, is that anyone using LWLockConflictsWithVar() > will do so twice in a row, with a barrier inbetween. But that really relies on > what I explain in the paragraph below: The twice-in-a-row lock acquisition protocol used by LWLockWaitForVar is helping us out have quick exit in LWLockUpdateVar. Because, LWLockWaitForVar ensures that they are added to the wait queue even if LWLockUpdateVar thinks that there aren't waiters. Is my understanding correct here? > > It also adds notes on why LWLockConflictsWithVar doesn't need a > > memory barrier as far as its current usage is concerned. > > I don't think: > * NB: LWLockConflictsWithVar (which is called from > * LWLockWaitForVar) relies on the spinlock used > above in this > * function and doesn't use a memory barrier. > > helps to understand why any of this is safe to a meaningful degree. > > The existing comments aren't obviously aren't sufficient to explain this, but > the reason it's, I think, safe today, is that we are only waiting for > insertions that started before WaitXLogInsertionsToFinish() was called. The > lack of memory barriers in the loop means that we might see locks as "unused" > that have *since* become used, which is fine, because they only can be for > later insertions that we wouldn't want to wait on anyway. Right. > Not taking a lock to acquire the current insertingAt value means that we might > see older insertingAt value. Which should also be fine, because if we read a > too old value, we'll add ourselves to the queue, which contains atomic > operations. Right. An older value adds ourselves to the queue in LWLockWaitForVar, and we should be woken up eventually by LWLockUpdateVar. This matches with my understanding. I used more or less your above wording in 0002 patch. > > /* > > - * Read value using the lwlock's wait list lock, as we can't generally > > - * rely on atomic 64 bit reads/stores. TODO: On platforms with a way > > to > > - * do atomic 64 bit reads/writes the spinlock should be optimized > > away. > > + * Reading the value atomically ensures that we don't need any > > explicit > > + * locking. Note that in general, 64 bit atomic APIs in postgres > > inherently > > + * provide explicit locking for the platforms without atomics support. > >*/ > > This comment seems off to me. Using atomics doesn't guarantee not needing > locking. It just guarantees that we are reading a non-torn value. Modified the comment. > > @@ -1605,9 +1602,14 @@ LWLockConflictsWithVar(LWLock *lock, > > * > > * Note: this function ignores shared lock holders; if the lock is held > > * in shared mode, returns 'true'. > > + * > > + * Be careful that LWLockConflictsWithVar() does not include a memory > > barrier, > > + * hence the caller of this function may want to rely on an explicit > > barrier or > > + * a spinlock to avoid memory ordering issues. > > */ > > s/careful/aware/? > > s/spinlock/implied barrier via spinlock or lwlock/? Done. -- Bharath Rupireddy PostgreSQL Contributors Team RDS Open Source Databases Amazon Web Services: https://aws.amazon.com v9-0001-Optimize-WAL-insertion-lock-acquisition-and-relea.patch Description: Binary data v9-0002-Improve-commets-in-and-around-LWLockWaitForVar-ca.patch Description: Binary data v9-0003-Have-a-quick-exit-for-LWLockUpdateVar-when-there-.patch Description: Binary data
Re: Support worker_spi to execute the function dynamically.
Hi, On 2023-07-20 13:50, Bharath Rupireddy wrote: On Thu, Jul 20, 2023 at 10:09 AM Michael Paquier wrote: On Thu, Jul 20, 2023 at 09:43:37AM +0530, Bharath Rupireddy wrote: > +1. However, a comment above helps one to understand why some GUCs are > defined before if (!process_shared_preload_libraries_in_progress). As > this is an example extension, it will help understand the reasoning > better. I know we will it in the commit message, but a direct comment > helps: > > /* > * Note that this GUC is defined irrespective of worker_spi shared library > * presence in shared_preload_libraries. It's possible to create the > * worker_spi extension and use functions without it being specified in > * shared_preload_libraries. If we return from here without defining this > * GUC, the dynamic workers launched by worker_spi_launch() will keep > * crashing and restarting. > */ WFM to be more talkative here and document things, but I don't think that's it. How about a simple "These GUCs are defined even if this library is not loaded with shared_preload_libraries, for worker_spi_launch()." LGTM. Thanks for discussing about the patch. I updated the patch from your comments * v2-0001-Support-worker_spi-to-execute-the-function-dynamical.patch I found another thing to be changed better. Though the tests was assumed "shared_preload_libraries = worker_spi", the background workers failed to be launched in initialized phase because the database is not created yet. ``` # make check# in src/test/modules/worker_spi # cat log/postmaster.log # in src/test/modules/worker_spi/ 2023-07-20 17:58:47.958 JST worker_spi[853620] FATAL: database "contrib_regression" does not exist 2023-07-20 17:58:47.958 JST worker_spi[853621] FATAL: database "contrib_regression" does not exist 2023-07-20 17:58:47.959 JST postmaster[853612] LOG: background worker "worker_spi" (PID 853620) exited with exit code 1 2023-07-20 17:58:47.959 JST postmaster[853612] LOG: background worker "worker_spi" (PID 853621) exited with exit code 1 ``` It's better to remove "shared_preload_libraries = worker_spi" from the test configuration. I misunderstood that two background workers would be launched and waiting at the start of the test. Regards, -- Masahiro Ikeda NTT DATA CORPORATIONFrom 02ef0d8daddd43305842987a6aeac6732b2415a9 Mon Sep 17 00:00:00 2001 From: Masahiro Ikeda Date: Thu, 20 Jul 2023 10:34:50 +0900 Subject: [PATCH] Support worker_spi to execute the function dynamically. Currently, the database name to connect is initialized only when process_shared_preload_libraries_in_progress is true. It means that worker_spi_launch() fails if shared_preload_libraries is empty because the database to connect is NULL. The patch changes that the database name is always initilized when called _PG_init(). We can call worker_spi_launch() and launch SPI workers dynamically. It also changes the configuration for the test because we don't need "shared_preload_libraries = worker_spi" anymore, and background workers launched in initialized phase always have failed because the "contrib_regression" database is not created yet. --- src/test/modules/worker_spi/Makefile | 4 ++-- src/test/modules/worker_spi/dynamic.conf | 3 +-- src/test/modules/worker_spi/worker_spi.c | 27 ++-- 3 files changed, 19 insertions(+), 15 deletions(-) diff --git a/src/test/modules/worker_spi/Makefile b/src/test/modules/worker_spi/Makefile index cbf9b2e37f..c2d32d9b72 100644 --- a/src/test/modules/worker_spi/Makefile +++ b/src/test/modules/worker_spi/Makefile @@ -8,10 +8,10 @@ PGFILEDESC = "worker_spi - background worker example" REGRESS = worker_spi -# enable our module in shared_preload_libraries for dynamic bgworkers REGRESS_OPTS = --temp-config $(top_srcdir)/src/test/modules/worker_spi/dynamic.conf -# Disable installcheck to ensure we cover dynamic bgworkers. +# Disable because these tests require "worker_spi.database = contrib_regression", +# which typical installcheck users do not have (e.g. buildfarm clients). NO_INSTALLCHECK = 1 ifdef USE_PGXS diff --git a/src/test/modules/worker_spi/dynamic.conf b/src/test/modules/worker_spi/dynamic.conf index bfe015f664..d9fd463a53 100644 --- a/src/test/modules/worker_spi/dynamic.conf +++ b/src/test/modules/worker_spi/dynamic.conf @@ -1,2 +1 @@ -shared_preload_libraries = worker_spi -worker_spi.database = contrib_regression +worker_spi.database = contrib_regression \ No newline at end of file diff --git a/src/test/modules/worker_spi/worker_spi.c b/src/test/modules/worker_spi/worker_spi.c index 7227cfaa45..f691fbec9f 100644 --- a/src/test/modules/worker_spi/worker_spi.c +++ b/src/test/modules/worker_spi/worker_spi.c @@ -282,7 +282,12 @@ _PG_init(void) { BackgroundWorker worker; - /* get the configuration */ + /* Get the configuration */ + + /* + * These GUCs are defined even if this library is not loaded with + * shared_preload_libraries, for
Re: Support worker_spi to execute the function dynamically.
On 2023-07-20 12:55, Michael Paquier wrote: On Thu, Jul 20, 2023 at 11:15:51AM +0900, Masahiro Ikeda wrote: While I'm working on the thread[1], I found that the function of worker_spi module fails if 'shared_preload_libraries' doesn't have worker_spi. I guess that you were patching worker_spi to register dynamically a wait event and embed that in a TAP test or similar without loading it in shared_preload_libraries? FWIW, you could use a trick like what I am attaching here to load a wait event dynamically with the custom wait event API. You would need to make worker_spi_init_shmem() a bit more aggressive with an extra hook to reserve a shmem area size, but that's enough to show the custom wait event in the same backend as the one that launches a worker_spi dynamically, while demonstrating how the API can be used in this case. Yes, you're right. When I tried using worker_spi to test wait event, I found the behavior. And thanks a lot for your patch. I wasn't aware of the way. I'll merge your patch to the tests for wait events. Regards, -- Masahiro Ikeda NTT DATA CORPORATION
Re: [PATCH] Reuse Workers and Replication Slots during Logical Replication
Hi Peter, Peter Smith , 20 Tem 2023 Per, 07:10 tarihinde şunu yazdı: > Hi, I had a look at the latest 3 patch (v20-0003). > > Although this patch was recently modified, the updates are mostly only > to make it compatible with the updated v20-0002 patch. Specifically, > the v20-0003 updates did not yet address my review comments from > v17-0003 [1]. > Yes, I only addressed your reviews for 0001 and 0002, and rebased 0003 in latest patches as stated here [1]. I'll update the patch soon according to recent reviews, including yours for 0003. [1] https://www.postgresql.org/message-id/CAGPVpCTvALKEXe0%3DN-%2BiMmVxVQ-%2BP8KZ_1qQ1KsSSZ-V9wJ5hw%40mail.gmail.com Thanks for the reminder. -- Melih Mutlu Microsoft
Re: Performance degradation on concurrent COPY into a single relation in PG16.
On Thu, 20 Jul 2023 at 00:56, David Rowley wrote: > > I noticed that 6fcda9aba added quite a lot of conditions that need to > be checked before we process a normal decimal integer string. I think > we could likely do better and code it to assume that most strings will > be decimal and put that case first rather than last. That sounds sensible, but ... > In the attached, I've changed that for the 32-bit version only. A > more complete patch would need to do the 16 and 64-bit versions too. > > Without that, we need to check if the first digit is '0' a total of 3 > times and also check if the 2nd digit is any of 'x', 'X', 'o', 'O', > 'b' or 'B'. That's not what I see. For me, the compiler (gcc 11, using either -O2 or -O3) is smart enough to spot that the first part of each of the 3 checks is the same, and it only tests whether the first digit is '0' once, before immediately jumping to the decimal parsing code. I didn't test other compilers though, so I can believe that different compilers might not be so smart. OTOH, this test in your patch: +/* process decimal digits */ +if (isdigit((unsigned char) ptr[0]) && +(isdigit((unsigned char) ptr[1]) || ptr[1] == '_' || ptr[1] == '\0' || isspace(ptr[1]))) is doing more work than it needs to, and actually makes things noticeably worse for me. It needs to do a minimum of 2 isdigit() checks before it will parse the input as a decimal, whereas before (for me at least) it just did one simple comparison of ptr[0] against '0'. I agree with the principal though. In the attached updated patch, I replaced that test with a simpler one: +/* + * Process the number's digits. We optimize for decimal input (expected to + * be the most common case) first. Anything that doesn't start with a base + * prefix indicator must be decimal. + */ + + /* process decimal digits */ + if (likely(ptr[0] != '0' || !isalpha((unsigned char) ptr[1]))) So hopefully any compiler should only need to do the one comparison against '0' for any non-zero decimal input. Doing that didn't give any measurable performance improvement for me, but it did at least not make it noticeably worse, and seems more likely to generate better code with not-so-smart compilers. I'd be interested to know how that performs for you (and if your compiler really does generate 3 "first digit is '0'" tests for the unpatched code). Regards, Dean --- Here were my test results (where P1 is the "fix_COPY_DEFAULT.patch"), and I tested COPY loading 50M rows: HEAD + P1 = 7137.966 ms 7193.190 ms 7094.491 ms 7123.520 ms HEAD + P1 + pg_strtoint32_base_10_first.patch = 7561.658 ms 7548.282 ms 7551.360 ms 7560.671 ms HEAD + P1 + pg_strtoint32_base_10_first.v2.patch 7238.775 ms 7184.937 ms 7123.257 ms 7159.618 ms diff --git a/src/backend/utils/adt/numutils.c b/src/backend/utils/adt/numutils.c new file mode 100644 index 471fbb7..e2320e6 --- a/src/backend/utils/adt/numutils.c +++ b/src/backend/utils/adt/numutils.c @@ -314,113 +314,124 @@ pg_strtoint32_safe(const char *s, Node * else if (*ptr == '+') ptr++; - /* process digits */ - if (ptr[0] == '0' && (ptr[1] == 'x' || ptr[1] == 'X')) + /* + * Process the number's digits. We optimize for decimal input (expected to + * be the most common case) first. Anything that doesn't start with a base + * prefix indicator must be decimal. + */ + + /* process decimal digits */ + if (likely(ptr[0] != '0' || !isalpha((unsigned char) ptr[1]))) { - firstdigit = ptr += 2; + firstdigit = ptr; while (*ptr) { - if (isxdigit((unsigned char) *ptr)) + if (isdigit((unsigned char) *ptr)) { -if (unlikely(tmp > -(PG_INT32_MIN / 16))) +if (unlikely(tmp > -(PG_INT32_MIN / 10))) goto out_of_range; -tmp = tmp * 16 + hexlookup[(unsigned char) *ptr++]; +tmp = tmp * 10 + (*ptr++ - '0'); } else if (*ptr == '_') { -/* underscore must be followed by more digits */ +/* underscore may not be first */ +if (unlikely(ptr == firstdigit)) + goto invalid_syntax; +/* and it must be followed by more digits */ ptr++; -if (*ptr == '\0' || !isxdigit((unsigned char) *ptr)) +if (*ptr == '\0' || !isdigit((unsigned char) *ptr)) goto invalid_syntax; } else break; } } - else if (ptr[0] == '0' && (ptr[1] == 'o' || ptr[1] == 'O')) + /* process hex digits */ + else if (ptr[1] == 'x' || ptr[1] == 'X') { firstdigit = ptr += 2; while (*ptr) { - if (*ptr >= '0' && *ptr <= '7') + if (isxdigit((unsigned char) *ptr)) { -if (unlikely(tmp > -(PG_INT32_MIN / 8))) +if (unlikely(tmp > -(PG_INT32_MIN / 16))) goto out_of_range; -tmp = tmp * 8 + (*ptr++ - '0'); +tmp = tmp * 16 + hexlookup[(unsigned char) *ptr++]; } else if (*ptr == '_') { /* underscore must be followed by more digits */ ptr++; -if (*ptr == '\0' || *ptr <
Re: Assertion failure in SnapBuildInitialSnapshot()
> On 9 Feb 2023, at 07:32, Masahiko Sawada wrote: > I've attached the patch of this idea for discussion. Amit, Andres: have you had a chance to look at the updated version of this patch? -- Daniel Gustafsson
Re: Atomic ops for unlogged LSN
On Thu, Jul 20, 2023 at 12:25 AM John Morris wrote: > > >> Why don't we just use a barrier when around reading the value? It's not > >> like > >> CreateCheckPoint() is frequent? > > One reason is that a barrier isn’t needed, and adding unnecessary barriers > can also be confusing. > > With respect to the “debug only” comment in the original code, whichever > value is written to the structure during a checkpoint will be reset when > restarting. Nobody will ever see that value. We could just as easily write a > zero. > > Shutting down is a different story. It isn’t stated, but we require the > unlogged LSN be quiescent before shutting down. This patch doesn’t change > that requirement. > > We could also argue that memory ordering doesn’t matter when filling in a > conventional, unlocked structure. But the fact we have only two cases 1) > checkpoint where the value is ignored, and 2) shutdown where it is quiescent, > makes the additional argument unnecessary. > > Would you be more comfortable if I added comments describing the two cases? > My intent was to be minimalistic, but the original code could use better > explanation. Here, the case for unloggedLSN is that there can exist multiple backends incrementing/writing it (via GetFakeLSNForUnloggedRel), and there can exist one reader at a time in CreateCheckPoint with LWLockAcquire(ControlFileLock, LW_EXCLUSIVE);. Is incrementing unloggedLSN atomically (with an implied memory barrier from pg_atomic_fetch_add_u64) helping out synchronize multiple writers and readers? With a spinlock, writers-readers synchronization is guaranteed. With an atomic variable, it is guaranteed that the readers don't see a torn-value, but no synchronization is provided. If the above understanding is right, what happens if readers see an old unloggedLSN value - reader here stores the old unloggedLSN value to control file and the server restarts (clean exit). So, the old value is loaded back to unloggedLSN upon restart and the callers of GetFakeLSNForUnloggedRel() will see an old/repeated value too. Is it a problem? -- Bharath Rupireddy PostgreSQL Contributors Team RDS Open Source Databases Amazon Web Services: https://aws.amazon.com
Re: logical decoding and replication of sequences, take 2
Thanks Tomas for the updated patches. Here are my comments on 0006 patch as well as 0002 patch. On Wed, Jul 19, 2023 at 4:23 PM Tomas Vondra wrote: > > I think this is an accurate description of what the current patch does. > And I think it's a reasonable behavior. > > My point is that if this gets released in PG17, it'll be difficult to > change, even if it does not change on-disk format. > Yes. I agree. And I don't see any problem even if we are not able to change it. > > I need to think behavior about this a bit more, and maybe check how > difficult would be implementing it. Ok. In most of the comments and in documentation, there are some phrases which do not look accurate. Change to a sequence is being refered to as "sequence increment". While ascending sequences are common, PostgreSQL supports descending sequences as well. The changes there will be decrements. But that's not the only case. A sequence may be restarted with an older value, in which case the change could increment or a decrement. I think correct usage is 'changes to sequence' or 'sequence changes'. Sequence being assigned a new relfilenode is referred to as sequence being created. This is confusing. When an existing sequence is ALTERed, we will not "create" a new sequence but we will "create" a new relfilenode and "assign" it to that sequence. PFA such edits in 0002 and 0006 patches. Let me know if those look correct. I think we need similar changes to the documentation and comments in other places. > > I did however look at the proposed alternative to the "created" flag. > The attached 0006 part ditches the flag with XLOG_SMGR_CREATE decoding. > The smgr_decode code needs a review (I'm not sure the > skipping/fast-forwarding part is correct), but it seems to be working > fine overall, although we need to ensure the WAL record has the correct XID. > Briefly describing the patch. When decoding a XLOG_SMGR_CREATE WAL record, it adds the relfilenode mentioned in it to the sequences hash. When decoding a sequence change record, it checks whether the relfilenode in the WAL record is in hash table. If it is the sequence changes is deemed transactional otherwise non-transactional. The change looks good to me. It simplifies the logic to decide whether a sequence change is transactional or not. In sequence_decode() we skip sequence changes when fast forwarding. Given that smgr_decode() is only to supplement sequence_decode(), I think it's correct to do the same in smgr_decode() as well. Simillarly skipping when we don't have full snapshot. Some minor comments on 0006 patch +/* make sure the relfilenode creation is associated with the XID */ +if (XLogLogicalInfoActive()) +GetCurrentTransactionId(); I think this change is correct and is inline with similar changes in 0002. But I looked at other places from where DefineRelation() is called. For regular tables it is called from ProcessUtilitySlow() which in turn does not call GetCurrentTransactionId(). I am wondering whether we are just discovering a class of bugs caused by not associating an xid with a newly created relfilenode. +/* + * If we don't have snapshot or we are just fast-forwarding, there is no + * point in decoding changes. + */ +if (SnapBuildCurrentState(builder) < SNAPBUILD_FULL_SNAPSHOT || +ctx->fast_forward) +return; This code block is repeated. +void +ReorderBufferAddRelFileLocator(ReorderBuffer *rb, TransactionId xid, + RelFileLocator rlocator) +{ ... snip ... + +/* sequence changes require a transaction */ +if (xid == InvalidTransactionId) +return; IIUC, with your changes in DefineSequence() in this patch, this should not happen. So this condition will never be true. But in case it happens, this code will not add the relfilelocation to the hash table and we will deem the sequence change as non-transactional. Isn't it better to just throw an error and stop replication if that (ever) happens? Also some comments on 0002 patch @@ -405,8 +405,19 @@ fill_seq_fork_with_data(Relation rel, HeapTuple tuple, ForkNumber forkNum) /* check the comment above nextval_internal()'s equivalent call. */ if (RelationNeedsWAL(rel)) +{ GetTopTransactionId(); +/* + * Make sure the subtransaction has a XID assigned, so that the sequence + * increment WAL record is properly associated with it. This matters for + * increments of sequences created/altered in the transaction, which are + * handled as transactional. + */ +if (XLogLogicalInfoActive()) +GetCurrentTransactionId(); +} + I think we should separately commit the changes which add a call to GetCurrentTransactionId(). That looks like an existing bug/anomaly which can stay irrespective of this patch. +/* + * To support logical decoding of sequences, we require the sequence + * callback. We decide it here, but only check
Re: [feature]COPY FROM enable FORCE_NULL/FORCE_NOT_NULL on all columns
Hi, On Jul 7, 2023 at 18:00 +0800, Damir Belyalov , wrote: > > V2 patch still have some errors when apply file doc/src/sgml/ref/copy.sgml, > rebased and fixed it in V3 path. > Thanks a lot for review. I have updated https://commitfest.postgresql.org/43/3896/ to staus Ready for Committer, thanks again. Zhang Mingli www.hashdata.xyz
Re: remaining sql/json patches
Hello, On Thu, Jul 20, 2023 at 10:35 AM jian he wrote: > On Tue, Jul 18, 2023 at 5:11 PM Amit Langote wrote: > > > Op 7/17/23 om 07:00 schreef jian he: > > > > hi. > > > > seems there is no explanation about, json_api_common_syntax in > > > > functions-json.html > > > > > > > > I can get json_query full synopsis from functions-json.html as follows: > > > > json_query ( context_item, path_expression [ PASSING { value AS > > > > varname } [, ...]] [ RETURNING data_type [ FORMAT JSON [ ENCODING UTF8 > > > > ] ] ] [ { WITHOUT | WITH { CONDITIONAL | [UNCONDITIONAL] } } [ ARRAY ] > > > > WRAPPER ] [ { KEEP | OMIT } QUOTES [ ON SCALAR STRING ] ] [ { ERROR | > > > > NULL | EMPTY { [ ARRAY ] | OBJECT } | DEFAULT expression } ON EMPTY ] > > > > [ { ERROR | NULL | EMPTY { [ ARRAY ] | OBJECT } | DEFAULT expression } > > > > ON ERROR ]) > > > > > > > > seems doesn't have a full synopsis for json_table? only partial one > > > > by one explanation. > > > > I looked through the history of the docs portion of the patch and it > > looks like the synopsis for JSON_TABLE(...) used to be there but was > > taken out during one of the doc reworks [1]. > > > > I've added it back in the patch as I agree that it would help to have > > it. Though, I am not totally sure where I've put it is the right > > place for it. JSON_TABLE() is a beast that won't fit into the table > > that JSON_QUERY() et al are in, so maybe that's how it will have to > > be? I have no better idea. > > attached screenshot render json_table syntax almost plain html. It looks fine. Thanks for checking. > based on syntax, then I am kind of confused with following 2 cases: > --1 > SELECT * FROM JSON_TABLE(jsonb '1', '$' > COLUMNS (a int PATH 'strict $.a' default 1 ON EMPTY default 2 on error) > ERROR ON ERROR) jt; > > --2 > SELECT * FROM JSON_TABLE(jsonb '1', '$' > COLUMNS (a int PATH 'strict $.a' default 1 ON EMPTY default 2 on error)) > jt; > > the first one should yield syntax error? No. Actually, the synopsis missed the optional ON ERROR clause that can appear after COLUMNS(...). Will fix it. -- Thanks, Amit Langote EDB: http://www.enterprisedb.com
Re: There should be a way to use the force flag when restoring databases
HI Gurjeet, that woulld be great, all the cases where a FORCE won't apply make totally sense (either complex scenarios or permission issues) > It doesn't terminate if prepared transactions, active logical replication > slots or subscriptions are present in the target database. > This will fail if the current user has no permissions to terminate other > connections > Regards Missatge de Gurjeet Singh del dia dc., 19 de jul. 2023 a les 19:28: > On Tue, Jul 18, 2023 at 12:53 AM Joan wrote: > > > > Since posgres 13 there's the option to do a FORCE when dropping a > database (so it disconnects current users) Documentation here: > https://www.postgresql.org/docs/current/sql-dropdatabase.html > > > > I am currently using dir format for the output > >pg_dump -d "bdname" -F d -j 4 -v -f /tmp/dir > > > > And restoring the database with > > pg_restore -d postgres -C -c --exit-on-error -F d -j 3 -v /tmp/dir > > > > Having an option to add the FORCE option to either the generated dump by > pg_dump, or in the pg_restore would be very useful when restoring the > databases to another servers so it would avoid having to do scripting. > > > > In my specific case I am using this to refresh periodically a > development environment with data from production servers for a small > database (~200M). > > Making force-drop a part of pg_dump output may be dangerous, and not > provide much flexibility at restore time. > > Adding a force option to pg_restore feels like providing the right > tradeoff. > > The docs for 'pg_restore --create` say "Create the database before > restoring into it. If --clean is also specified, drop and recreate the > target database before connecting to it." > > If we provided a force option, it may then additionally say: "If the > --clean and --force options are specified, DROP DATABASE ... WITH > FORCE command will be used to drop the database." > > Using WITH FORCE is not a guarantee, as the DROP DATABASE docs clarify > the conditions under which it may fail. > > Best regards, > Gurjeet > http://Gurje.et >
Re: POC: GROUP BY optimization
On 3/10/2022 21:56, Tom Lane wrote: > Revert "Optimize order of GROUP BY keys". > > This reverts commit db0d67db2401eb6238ccc04c6407a4fd4f985832 and > several follow-on fixes. > ... > Since we're hard up against the release deadline for v15, let's > revert these changes for now. We can always try again later. It may be time to restart the project. As a first step, I rebased the patch on the current master. It wasn't trivial because of some latest optimizations (a29eab, 1349d27 and 8d83a5d). Now, Let's repeat the review and rewrite the current path according to the reasons uttered in the revert commit. -- regards, Andrey Lepikhov Postgres Professional From 913d55ee887dccfeba360f5f44ed347dd1ba9044 Mon Sep 17 00:00:00 2001 From: "Andrey V. Lepikhov" Date: Fri, 14 Jul 2023 10:29:36 +0700 Subject: [PATCH] When evaluating a query with a multi-column GROUP BY clause using sort, the cost may be heavily dependent on the order in which the keys are compared when building the groups. Grouping does not imply any ordering, so we're allowed to compare the keys in arbitrary order, and a Hash Agg leverages this. But for Group Agg, we simply compared keys in the order as specified in the query. This commit explores alternative ordering of the keys, trying to find a cheaper one. MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit In principle, we might generate grouping paths for all permutations of the keys, and leave the rest to the optimizer. But that might get very expensive, so we try to pick only a couple interesting orderings based on both local and global information. When planning the grouping path, we explore statistics (number of distinct values, cost of the comparison function) for the keys and reorder them to minimize comparison costs. Intuitively, it may be better to perform more expensive comparisons (for complex data types etc.) last, because maybe the cheaper comparisons will be enough. Similarly, the higher the cardinality of a key, the lower the probability we’ll need to compare more keys. The patch generates and costs various orderings, picking the cheapest ones. The ordering of group keys may interact with other parts of the query, some of which may not be known while planning the grouping. E.g. there may be an explicit ORDER BY clause, or some other ordering-dependent operation, higher up in the query, and using the same ordering may allow using either incremental sort or even eliminate the sort entirely. The patch generates orderings and picks those minimizing the comparison cost (for various pathkeys), and then adds orderings that might be useful for operations higher up in the plan (ORDER BY, etc.). Finally, it always keeps the ordering specified in the query, on the assumption the user might have additional insights. This introduces a new GUC enable_group_by_reordering, so that the optimization may be disabled if needed. --- .../postgres_fdw/expected/postgres_fdw.out| 15 +- src/backend/optimizer/path/costsize.c | 363 ++- src/backend/optimizer/path/equivclass.c | 13 +- src/backend/optimizer/path/pathkeys.c | 602 ++ src/backend/optimizer/plan/planner.c | 467 -- src/backend/optimizer/util/pathnode.c | 4 +- src/backend/utils/adt/selfuncs.c | 38 +- src/backend/utils/misc/guc_tables.c | 10 + src/backend/utils/misc/postgresql.conf.sample | 1 + src/include/nodes/pathnodes.h | 10 + src/include/optimizer/cost.h | 4 +- src/include/optimizer/paths.h | 7 + src/include/utils/selfuncs.h | 5 + src/test/regress/expected/aggregates.out | 244 ++- .../regress/expected/incremental_sort.out | 2 +- src/test/regress/expected/join.out| 51 +- src/test/regress/expected/merge.out | 15 +- .../regress/expected/partition_aggregate.out | 44 +- src/test/regress/expected/partition_join.out | 75 +-- src/test/regress/expected/sysviews.out| 3 +- src/test/regress/expected/union.out | 60 +- src/test/regress/sql/aggregates.sql | 99 +++ src/test/regress/sql/incremental_sort.sql | 2 +- 23 files changed, 1760 insertions(+), 374 deletions(-) diff --git a/contrib/postgres_fdw/expected/postgres_fdw.out b/contrib/postgres_fdw/expected/postgres_fdw.out index 852b5b4707..3f3c181ac8 100644 --- a/contrib/postgres_fdw/expected/postgres_fdw.out +++ b/contrib/postgres_fdw/expected/postgres_fdw.out @@ -9593,13 +9593,16 @@ SELECT t1.a,t2.b,t3.c FROM fprt1 t1 INNER JOIN fprt2 t2 ON (t1.a = t2.b) INNER J -- left outer join + nullable clause EXPLAIN (VERBOSE, COSTS OFF) SELECT t1.a,t2.b,t2.c FROM fprt1 t1 LEFT JOIN (SELECT * FROM fprt2 WHERE a < 10) t2 ON (t1.a = t2.b and t1.b = t2.a) WHERE t1.a < 10 ORDER BY 1,2,3; -
Re: ObjectIdGetDatum() missing from SearchSysCache*() callers
On Tue, Jul 18, 2023 at 07:27:02AM +0900, Michael Paquier wrote: > On Mon, Jul 17, 2023 at 05:33:42PM +0300, Aleksander Alekseev wrote: > > I can't be 100% sure but it looks like that's all of them. PFA the > > updated patch v2. > > Thanks. Yes, this stuff is easy to miss. I was just grepping for a > few patterns and missed these two. Spotted a few more of these things after a second lookup. One for subscriptions: src/backend/commands/alter.c: if (SearchSysCacheExists2(SUBSCRIPTIONNAME, MyDatabaseId, And two for transforms: src/backend/utils/cache/lsyscache.c: tup = SearchSysCache2(TRFTYPELANG, typid, langid); src/backend/utils/cache/lsyscache.c: tup = SearchSysCache2(TRFTYPELANG, typid, langid); And applied the whole. Thanks for looking and spot more of these inconsistencies! -- Michael signature.asc Description: PGP signature