Re: Parallel worker hangs while handling errors.
Thanks for your comments Bharath. On Mon, Jul 27, 2020 at 10:13 AM Bharath Rupireddy wrote: > 1. Do we need "worker" as a function argument in > update_parallel_worker_sigmask(BackgroundWorker *worker, ? Since > MyBgworkerEntry is a global variable, can't we have a local variable > instead? Fixed, I have moved the worker check to the caller function. > 2. Instead of update_parallel_worker_sigmask() serving only for > parallel workers, can we make it generic, so that for any bgworker, > given a signal it unblocks it, although there's no current use case > for a bg worker unblocking a single signal other than a parallel > worker doing it for SIGUSR1 for this hang issue. Please note that we > have BackgroundWorkerBlockSignals() and > BackgroundWorkerUnblockSignals(). Fixed. I have slightly modified the changes to break into BackgroundWorkerRemoveBlockSignal & BackgroundWorkerAddBlockSignal. This maintains the consistency similar to BackgroundWorkerBlockSignals() and BackgroundWorkerUnblockSignals(). > If okay, with the BackgroundWorkerUpdateSignalMask() function, please > note that we might have to add it in bgworker.sgml as well. Included the documentation. Attached the updated patch for the same. Regards, Vignesh EnterpriseDB: http://www.enterprisedb.com From f06a5966d754d6622a3425c6cac1ad72314832ef Mon Sep 17 00:00:00 2001 From: Vignesh C Date: Tue, 28 Jul 2020 10:48:17 +0530 Subject: [PATCH v4] Fix for Parallel worker hangs while handling errors. Worker is not able to receive the signals while processing error flow. Worker hangs in this case because when the worker is started the signals will be masked using sigprocmask. Unblocking of signals is done by calling BackgroundWorkerUnblockSignals in ParallelWorkerMain. Now due to error handling the worker has jumped to setjmp in StartBackgroundWorker function. Here the signals are in blocked state, hence the signal is not received by the worker process. Authors: Vignesh C, Bharath Rupireddy --- doc/src/sgml/bgworker.sgml | 6 +- src/backend/postmaster/bgworker.c | 20 src/backend/postmaster/postmaster.c | 17 + src/include/postmaster/bgworker.h | 4 4 files changed, 46 insertions(+), 1 deletion(-) diff --git a/doc/src/sgml/bgworker.sgml b/doc/src/sgml/bgworker.sgml index 6e1cf12..d900fc9 100644 --- a/doc/src/sgml/bgworker.sgml +++ b/doc/src/sgml/bgworker.sgml @@ -210,7 +210,11 @@ typedef struct BackgroundWorker allow the process to customize its signal handlers, if necessary. Signals can be unblocked in the new process by calling BackgroundWorkerUnblockSignals and blocked by calling - BackgroundWorkerBlockSignals. + BackgroundWorkerBlockSignals. A particular signal can + be unblocked in the new process by calling + BackgroundWorkerRemoveBlockSignal(int signum) + and blocked by calling + BackgroundWorkerAddBlockSignal(int signum). diff --git a/src/backend/postmaster/bgworker.c b/src/backend/postmaster/bgworker.c index beb5e85..458b513 100644 --- a/src/backend/postmaster/bgworker.c +++ b/src/backend/postmaster/bgworker.c @@ -747,6 +747,17 @@ StartBackgroundWorker(void) */ if (sigsetjmp(local_sigjmp_buf, 1) != 0) { + /* + * In case of parallel workers, unblock SIGUSR1 signal, it was blocked + * when the postmaster forked us. Leader process will send SIGUSR1 signal + * to the worker process(worker process will be in waiting state as + * there is no space available) to indicate shared memory space is freed + * up. Once the signal is received worker process will start populating + * the error message further. + */ + if ((worker->bgw_flags & BGWORKER_CLASS_PARALLEL) != 0) + BackgroundWorkerRemoveBlockSignal(SIGUSR1); + /* Since not using PG_TRY, must reset error stack by hand */ error_context_stack = NULL; @@ -757,6 +768,15 @@ StartBackgroundWorker(void) EmitErrorReport(); /* + * Undo the unblocking of SIGUSR1 which was done above, as to + * not cause any further issues from unblocking SIGUSR1 during + * the execution of callbacks and other processing that will be + * done during proc_exit(). + */ + if ((worker->bgw_flags & BGWORKER_CLASS_PARALLEL) != 0) + BackgroundWorkerAddBlockSignal(SIGUSR1); + + /* * Do we need more cleanup here? For shmem-connected bgworkers, we * will call InitProcess below, which will install ProcKill as exit * callback. That will take care of releasing locks, etc. diff --git a/src/backend/postmaster/postmaster.c b/src/backend/postmaster/postmaster.c index dec0258..751cd16 100644 --- a/src/backend/postmaster/postmaster.c +++ b/src/backend/postmaster/postmaster.c @@ -5765,6 +5765,23 @@ BackgroundWorkerUnblockSignals(void) PG_SETMASK(); } +/* + * Block/unblock a particular signal in a background worker. + */ +void +BackgroundWorkerAddBlockSignal(int signum) +{ + sigaddset(, signum); + PG_SETMASK(); +} + +void +BackgroundWorkerRemoveBlockSignal(int signum) +{ +
Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions
On Sun, Jul 26, 2020 at 11:04 AM Dilip Kumar wrote: > > > Today, I have again looked at the first patch > > (v42-0001-Extend-the-logical-decoding-output-plugin-API-wi) and didn't > > find any more problems with it so planning to commit the same unless > > you or someone else want to add more to it. Just for ease of others, > > "the next patch extends the logical decoding output plugin API with > > stream methods". It adds seven methods to the output plugin API, > > adding support for streaming changes for large in-progress > > transactions. The methods are stream_start, stream_stop, stream_abort, > > stream_commit, stream_change, stream_message, and stream_truncate. > > Most of this is a simple extension of the existing methods, with the > > semantic difference that the transaction (or subtransaction) is > > incomplete and may be aborted later (which is something the regular > > API does not really need to deal with). > > > > This also extends the 'test_decoding' plugin, implementing these new > > stream methods. The stream_start/start_stop are used to demarcate a > > chunk of changes streamed for a particular toplevel transaction. > > > > This commit simply adds these new APIs and the upcoming patch to > > "allow the streaming mode in ReorderBuffer" will use these APIs. > > LGTM > Pushed. Feel free to submit the remaining patches. -- With Regards, Amit Kapila.
Re: HashAgg's batching counter starts at 0, but Hash's starts at 1.
On Tue, 28 Jul 2020 at 15:21, Peter Geoghegan wrote: > Separately, I wonder what your opinion is about what should happen for > the partial sort related EXPLAIN ANALYZE format open item, described > here: > > https://www.postgresql.org/message-id/flat/20200619040358.GZ17995%40telsasoft.com#b20bd205851a0390220964f7c31b23d1 > > ISTM that EXPLAIN ANALYZE for incremental sort manages to show the > same information as the sort case, aggregated across each tuplesort in > a fairly sensible way. > > (No activity over on the incremental sort thread, so I thought I'd ask > again here, while I was reminded of that issue.) TBH, I've not really looked at that. Tom did mention his view on this in [1]. I think that's a pretty good policy. However, I've not looked at the incremental sort EXPLAIN output enough to know how it'll best apply there. David [1] https://www.postgresql.org/message-id/2276865.1593102811%40sss.pgh.pa.us
Re: HashAgg's batching counter starts at 0, but Hash's starts at 1.
On Tue, 28 Jul 2020 at 15:01, Justin Pryzby wrote: > > On Tue, Jul 28, 2020 at 12:54:35PM +1200, David Rowley wrote: > > On Mon, 27 Jul 2020 at 14:54, Justin Pryzby wrote: > > > It's unrelated to hashAgg vs hashJoin, but I also noticed that this is > > > shown > > > only conditionally: > > > > > > if (es->format != EXPLAIN_FORMAT_TEXT) > > > { > > > if (es->costs && aggstate->hash_planned_partitions > 0) > > > { > > > ExplainPropertyInteger("Planned Partitions", NULL, > > > > > > aggstate->hash_planned_partitions, es); > > > > > > That was conditional since it was introduced at 1f39bce02: > > > > > > if (es->costs && aggstate->hash_planned_partitions > 0) > > > { > > > ExplainPropertyInteger("Planned Partitions", NULL, > > > > > > aggstate->hash_planned_partitions, es); > > > } > > > > > > I think 40efbf870 should've handled this, too. > > > > hmm. I'm not sure. I think this should follow the same logic as what > > "Disk Usage" follows, and right now we don't show Disk Usage unless we > > spill. > > Huh ? I'm referring to non-text format, which is what you changed, on the > reasoning that the same plan *could* spill: Oh, right. ... (Sudden bout of confusion due to lack of sleep) Looks like it'll just need this line: if (es->costs && aggstate->hash_planned_partitions > 0) changed to: if (es->costs) I think we'll likely need to maintain not showing that property with explain (costs off) as it'll be a bit more difficult to write regression tests if we display it regardless of that option. David
Re: HashAgg's batching counter starts at 0, but Hash's starts at 1.
On Mon, Jul 27, 2020 at 08:20:45PM -0700, Peter Geoghegan wrote: > On Mon, Jul 27, 2020 at 5:54 PM David Rowley wrote: > > hmm. I'm not sure. I think this should follow the same logic as what > > "Disk Usage" follows, and right now we don't show Disk Usage unless we > > spill. Since we only use partitions when spilling, I don't think it > > makes sense to show the estimated partitions when we don't plan on > > spilling. > > I'm confused about what the guiding principles for EXPLAIN ANALYZE > output (text or otherwise) are. I don't know of a guideline for text format, but the issues I've raised for HashAgg and IncrSort are based on previous commits which change to "show field even if its value is zero" for non-text format: 7d91b604d9b5d6ec8c19c57a9ffd2f27129cdd94 8ebb69f85445177575684a0ba5cfedda8d840a91 3ec20c7091e97a554e7447ac2b7f4ed795631395 > > I think if we change this then we should change Disk Usage too. > > However, I don't think we should as Sort will only show "Disk" if the > > sort spills. I think Hash Agg should follow that. > > I don't follow your remarks here. > > Separately, I wonder what your opinion is about what should happen for > the partial sort related EXPLAIN ANALYZE format open item, described > here: > > https://www.postgresql.org/message-id/flat/20200619040358.GZ17995%40telsasoft.com#b20bd205851a0390220964f7c31b23d1 > > ISTM that EXPLAIN ANALYZE for incremental sort manages to show the > same information as the sort case, aggregated across each tuplesort in > a fairly sensible way. > > (No activity over on the incremental sort thread, so I thought I'd ask > again here, while I was reminded of that issue.) Note that I did mail recently, on this 2nd thread: https://www.postgresql.org/message-id/20200723141454.gf4...@telsasoft.com -- Justin
Re: stress test for parallel workers
Thomas Munro writes: > Hehe, the dodgy looking magic numbers *were* wrong: > - * The kernel signal delivery code writes up to about 1.5kB > + * The kernel signal delivery code writes a bit over 4KB > https://patchwork.ozlabs.org/project/linuxppc-dev/patch/20200724092528.1578671-2-...@ellerman.id.au/ ... and, having seemingly not learned a thing, they just replaced them with new magic numbers. Mumble sizeof() mumble. Anyway, I guess the interesting question for us is how long it will take for this fix to propagate into real-world systems. I don't have much of a clue about the Linux kernel workflow, anybody want to venture a guess? regards, tom lane
Re: HashAgg's batching counter starts at 0, but Hash's starts at 1.
On Mon, Jul 27, 2020 at 5:54 PM David Rowley wrote: > hmm. I'm not sure. I think this should follow the same logic as what > "Disk Usage" follows, and right now we don't show Disk Usage unless we > spill. Since we only use partitions when spilling, I don't think it > makes sense to show the estimated partitions when we don't plan on > spilling. I'm confused about what the guiding principles for EXPLAIN ANALYZE output (text or otherwise) are. > I think if we change this then we should change Disk Usage too. > However, I don't think we should as Sort will only show "Disk" if the > sort spills. I think Hash Agg should follow that. I don't follow your remarks here. Separately, I wonder what your opinion is about what should happen for the partial sort related EXPLAIN ANALYZE format open item, described here: https://www.postgresql.org/message-id/flat/20200619040358.GZ17995%40telsasoft.com#b20bd205851a0390220964f7c31b23d1 ISTM that EXPLAIN ANALYZE for incremental sort manages to show the same information as the sort case, aggregated across each tuplesort in a fairly sensible way. (No activity over on the incremental sort thread, so I thought I'd ask again here, while I was reminded of that issue.) -- Peter Geoghegan
Re: HashAgg's batching counter starts at 0, but Hash's starts at 1.
On Tue, Jul 28, 2020 at 12:54:35PM +1200, David Rowley wrote: > On Mon, 27 Jul 2020 at 14:54, Justin Pryzby wrote: > > It's unrelated to hashAgg vs hashJoin, but I also noticed that this is shown > > only conditionally: > > > > if (es->format != EXPLAIN_FORMAT_TEXT) > > { > > if (es->costs && aggstate->hash_planned_partitions > 0) > > { > > ExplainPropertyInteger("Planned Partitions", NULL, > > > > aggstate->hash_planned_partitions, es); > > > > That was conditional since it was introduced at 1f39bce02: > > > > if (es->costs && aggstate->hash_planned_partitions > 0) > > { > > ExplainPropertyInteger("Planned Partitions", NULL, > > > > aggstate->hash_planned_partitions, es); > > } > > > > I think 40efbf870 should've handled this, too. > > hmm. I'm not sure. I think this should follow the same logic as what > "Disk Usage" follows, and right now we don't show Disk Usage unless we > spill. Huh ? I'm referring to non-text format, which is what you changed, on the reasoning that the same plan *could* spill: 40efbf8706cdd96e06bc4d1754272e46d9857875 if (es->format != EXPLAIN_FORMAT_TEXT) { if (es->costs && aggstate->hash_planned_partitions > 0) { ExplainPropertyInteger("Planned Partitions", NULL, aggstate->hash_planned_partitions, es); } ... /* EXPLAIN ANALYZE */ ExplainPropertyInteger("Peak Memory Usage", "kB", memPeakKb, es); - if (aggstate->hash_batches_used > 0) - { ExplainPropertyInteger("Disk Usage", "kB", aggstate->hash_disk_used, es); ExplainPropertyInteger("HashAgg Batches", NULL, aggstate->hash_batches_used, es); > Since we only use partitions when spilling, I don't think it > makes sense to show the estimated partitions when we don't plan on > spilling. In any case, my thinking is that we *should* show PlannedPartitions=0, specifically to indicate *that* we didn't plan to spill. > I think if we change this then we should change Disk Usage too. > However, I don't think we should as Sort will only show "Disk" if the > sort spills. I think Hash Agg should follow that. > > For the patch I posted yesterday, I'll go ahead in push it in about 24 > hours unless there are any objections.
Re: Online checksums patch - once again
On Thu, Jun 25, 2020 at 11:43:00AM +0200, Daniel Gustafsson wrote: > The attached v19 fixes a few doc issues I had missed. + They can also be enabled or disabled at a later timne, either as an offline => time +* awaiting shutdown, but we can continue turning off checksums anyway => a waiting +* We are starting a checksumming process scratch, and need to start by => FROM scratch +* to inprogress new relations will set relhaschecksums in pg_class so it => inprogress COMMA +* Relation no longer exist. We don't consider this an error since => exists +* so when the cluster comes back up processing will habe to be resumed. => have +"completed one pass over all databases for checksum enabling, %i databases processed", => I think this will be confusing to be hardcoded "one". It'll say "one" over and over. +* still exist. => exists In many places, you refer to "datachecksumsworker" (sums) but in nine places you refer to datachecksumworker (sum). +ProcessSingleRelationFork(Relation reln, ForkNumber forkNum, BufferAccessStrategy strategy) +{ + BlockNumber numblocks = RelationGetNumberOfBlocksInFork(reln, forkNum); => I think looping over numblocks is safe since new blocks are intended to be written with checksum, right? Maybe it's good to say that here. + BlockNumber b; blknum will be easier to grep for + (errmsg("background worker \"datachecksumsworker\" starting for database oid %d", => Should be %u or similar (several of these) Some questions: It looks like you rewrite every page, even if it already has correct checksum, to handle replicas. I wonder if it's possible/reasonable/good to skip pages with correct checksum when wal_level=minimal ? It looks like it's not possible to change the checksum delay while a checksum worker is already running. That may be important to allow: 1) decreased delay during slow periods; 2) increased delay if the process is significantly done, but needs to be throttled to avoid disrupting production environment. Have you collaborated with Julien about this one? His patch adds new GUCs: https://www.postgresql.org/message-id/20200714090808.GA20780@nol checksum_cost_delay checksum_cost_page checksum_cost_limit Maybe you'd say that Julien's pg_check_relation() should accept parameters instead of adding GUCs. I think you should be in agreement on that. It'd be silly if the verification function added three GUCs and allowed adjusting throttle midcourse, but the checksum writer process didn't use them. If you used something like that, I guess you'd also want to distinguish checksum_cost_page_read vs write. Possibly, the GUCs part should be a preliminary shared patch 0001 that you both used. -- Justin
Re: max_slot_wal_keep_size and wal_keep_segments
On 2020/07/20 21:21, David Steele wrote: On 7/20/20 6:02 AM, Fujii Masao wrote: On 2020/07/20 13:48, Fujii Masao wrote: On 2020/07/17 20:24, David Steele wrote: On 7/17/20 5:11 AM, Fujii Masao wrote: On 2020/07/14 20:30, David Steele wrote: On 7/14/20 12:00 AM, Fujii Masao wrote: The patch was no longer applied cleanly because of recent commit. So I updated the patch. Attached. Barring any objection, I will commit this patch. This doesn't look right: + the most recent megabytes + WAL files plus one WAL file are How about: + megabytes of + WAL files plus one WAL file are Thanks for the comment! Isn't it better to keep "most recent" part? If so, what about either of the followings? 1. megabytes of WAL files plus one WAL file that were most recently generated are kept all time. 2. megabytes + bytes of WAL files that were most recently generated are kept all time. "most recent" seemed implied to me, but I see your point. How about: + the most recent megabytes of + WAL files plus one additional WAL file are I adopted this and pushed the patch. Thanks! Also we need to update the release note for v13. What about adding the following? Rename configuration parameter wal_keep_segments to wal_keep_size. This allows how much WAL files to retain for the standby server, by bytes instead of the number of files. If you previously used wal_keep_segments, the following formula will give you an approximately equivalent setting: wal_keep_size = wal_keep_segments * wal_segment_size (typically 16MB) I would rework that first sentence a bit. How about: + This determines how much WAL to retain for the standby server, + specified in megabytes rather than number of files. The rest looks fine to me. Thanks for the review! I adopted your suggestion in the updated version of the patch and pushed it. Regards, -- Fujii Masao Advanced Computing Technology Center Research and Development Headquarters NTT DATA CORPORATION
printing oid with %d
+JsonEncodeDateTime(char *buf, Datum value, Oid typid) ... + elog(ERROR, "unknown jsonb value datetime type oid %d", typid); I think this should be %u. commit cc4feded0a31d2b732d4ea68613115cb720e624e Author: Andrew Dunstan Date: Tue Jan 16 19:07:13 2018 -0500 Centralize json and jsonb handling of datetime types
Re: stress test for parallel workers
On Wed, Dec 11, 2019 at 3:22 PM Thomas Munro wrote: > On Tue, Oct 15, 2019 at 4:50 AM Tom Lane wrote: > > > Filed at > > > https://bugzilla.kernel.org/show_bug.cgi?id=205183 > > For the curious-and-not-subscribed, there's now a kernel patch > proposed for this. We guessed pretty close, but the problem wasn't > those dodgy looking magic numbers, it was that the bad stack expansion > check only allows for user space to expand the stack > (FAULT_FLAG_USER), and here the kernel itself wants to build a stack > frame. Hehe, the dodgy looking magic numbers *were* wrong: - * The kernel signal delivery code writes up to about 1.5kB + * The kernel signal delivery code writes a bit over 4KB https://patchwork.ozlabs.org/project/linuxppc-dev/patch/20200724092528.1578671-2-...@ellerman.id.au/
Re: expose parallel leader in CSV and log_line_prefix
On Sun, Jul 26, 2020 at 01:54:27PM -0500, Justin Pryzby wrote: > + > + %P > + For a parallel worker, this is the Process ID of its > leader > + process. > + > + no > + Let's be a maximum simple and consistent with surrounding descriptions here and what we have for pg_stat_activity, say: "Process ID of the parallel group leader, if this process is a parallel query worker." > +case 'P': > +if (MyProc) > +{ > +PGPROC *leader = MyProc->lockGroupLeader; > +if (leader == NULL || leader->pid == MyProcPid) > +/* padding only */ > +appendStringInfoSpaces(buf, > +padding > 0 ? padding : -padding); > +else if (padding != 0) > +appendStringInfo(buf, "%*d", padding, leader->pid); > +else > +appendStringInfo(buf, "%d", leader->pid); It seems to me we should document here that the check on MyProcPid ensures that this only prints the leader PID only for parallel workers and discards the leader. > +appendStringInfoChar(, ','); > + > +/* leader PID */ > +if (MyProc) > +{ > +PGPROC *leader = MyProc->lockGroupLeader; > +if (leader && leader->pid != MyProcPid) > +appendStringInfo(, "%d", leader->pid); > +} > + Same here. Except for those nits, I have tested the patch and things behave as we want (including padding and docs), so this looks good to me. -- Michael signature.asc Description: PGP signature
Re: Default setting for enable_hashagg_disk
On Mon, Jul 27, 2020 at 5:10 PM Jeff Davis wrote: > I noticed that one of the conditionals, "cheapest_total_path != NULL", > was already redundant with the outer conditional before your patch. I > guess that was just a mistake which your patch corrects along the way? Makes sense. > Anyway, the patch looks good to me. We can have a separate discussion > about pessimizing the costing, if necessary. Pushed the hashagg_avoid_disk_plan patch -- thanks! -- Peter Geoghegan
Re: HashAgg's batching counter starts at 0, but Hash's starts at 1.
On Mon, 27 Jul 2020 at 14:54, Justin Pryzby wrote: > It's unrelated to hashAgg vs hashJoin, but I also noticed that this is shown > only conditionally: > > if (es->format != EXPLAIN_FORMAT_TEXT) > { > if (es->costs && aggstate->hash_planned_partitions > 0) > { > ExplainPropertyInteger("Planned Partitions", NULL, > > aggstate->hash_planned_partitions, es); > > That was conditional since it was introduced at 1f39bce02: > > if (es->costs && aggstate->hash_planned_partitions > 0) > { > ExplainPropertyInteger("Planned Partitions", NULL, > > aggstate->hash_planned_partitions, es); > } > > I think 40efbf870 should've handled this, too. hmm. I'm not sure. I think this should follow the same logic as what "Disk Usage" follows, and right now we don't show Disk Usage unless we spill. Since we only use partitions when spilling, I don't think it makes sense to show the estimated partitions when we don't plan on spilling. I think if we change this then we should change Disk Usage too. However, I don't think we should as Sort will only show "Disk" if the sort spills. I think Hash Agg should follow that. For the patch I posted yesterday, I'll go ahead in push it in about 24 hours unless there are any objections. David
Re: [POC] Fast COPY FROM command for the table with foreign partitions
27.07.2020 21:34, Alexey Kondratov пишет: Hi Andrey, On 2020-07-23 09:23, Andrey V. Lepikhov wrote: On 7/16/20 2:14 PM, Amit Langote wrote: Amit Langote EnterpriseDB: http://www.enterprisedb.com Version 5 of the patch. With changes caused by Amit's comments. Just got a segfault with your v5 patch by deleting from a foreign table. It seems that ShutdownForeignScan inside node->fdwroutine doesn't have a correct pointer to the required function. I haven't had a chance to look closer on the code, but you can easily reproduce this error with the attached script (patched Postgres binaries should be available in the PATH). It works well with master and fails with your patch applied. I used master a3ab7a707d and v5 version of the patch with your script. No errors found. Can you check your test case? -- regards, Andrey Lepikhov Postgres Professional
Re: Default setting for enable_hashagg_disk
On Mon, 2020-07-27 at 11:30 -0700, Peter Geoghegan wrote: > The v4-0001-Remove-hashagg_avoid_disk_plan-GUC.patch changes are > surprisingly complicated. It would be nice if you could take a look > at > that aspect (or confirm that it's included in your review). I noticed that one of the conditionals, "cheapest_total_path != NULL", was already redundant with the outer conditional before your patch. I guess that was just a mistake which your patch corrects along the way? Anyway, the patch looks good to me. We can have a separate discussion about pessimizing the costing, if necessary. Regards, Jeff Davis
Re: deferred primary key and logical replication
On Fri, 24 Jul 2020 at 05:16, Ajin Cherian wrote: > The patch no longer applies, because of additions in the test source. > Otherwise, I have tested the patch and confirmed that updates and deletes > on tables with deferred primary keys work with logical replication. > > The new status of this patch is: Waiting on Author > Thanks for testing. I attached a rebased patch. -- Euler Taveira http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services From c1ad1581962a56c83183bec1501df6f54406db89 Mon Sep 17 00:00:00 2001 From: Euler Taveira Date: Sun, 19 Apr 2020 20:04:39 -0300 Subject: [PATCH] Table with deferred PK is not updatable in logical replication In logical replication, an UPDATE or DELETE cannot be executed if a table has a primary key that is marked as deferred. RelationGetIndexList does not fill rd_replidindex accordingly. The consequence is that UPDATE or DELETE cannot be executed in a deferred PK table. Deferrable primary key cannot prevent a primary key to be used as replica identity. --- src/backend/utils/cache/relcache.c | 7 +++ src/test/subscription/t/001_rep_changes.pl | 21 +++-- 2 files changed, 22 insertions(+), 6 deletions(-) diff --git a/src/backend/utils/cache/relcache.c b/src/backend/utils/cache/relcache.c index a2453cf1f4..8996279f3b 100644 --- a/src/backend/utils/cache/relcache.c +++ b/src/backend/utils/cache/relcache.c @@ -4556,12 +4556,11 @@ RelationGetIndexList(Relation relation) result = lappend_oid(result, index->indexrelid); /* - * Invalid, non-unique, non-immediate or predicate indexes aren't - * interesting for either oid indexes or replication identity indexes, - * so don't check them. + * Invalid, non-unique or predicate indexes aren't interesting for + * either oid indexes or replication identity indexes, so don't check + * them. */ if (!index->indisvalid || !index->indisunique || - !index->indimmediate || !heap_attisnull(htup, Anum_pg_index_indpred, NULL)) continue; diff --git a/src/test/subscription/t/001_rep_changes.pl b/src/test/subscription/t/001_rep_changes.pl index 3f8318fc7c..f164d23a94 100644 --- a/src/test/subscription/t/001_rep_changes.pl +++ b/src/test/subscription/t/001_rep_changes.pl @@ -3,7 +3,7 @@ use strict; use warnings; use PostgresNode; use TestLib; -use Test::More tests => 23; +use Test::More tests => 24; # Initialize publisher node my $node_publisher = get_new_node('publisher'); @@ -38,6 +38,8 @@ $node_publisher->safe_psql('postgres', "CREATE TABLE tab_full_pk (a int primary key, b text)"); $node_publisher->safe_psql('postgres', "ALTER TABLE tab_full_pk REPLICA IDENTITY FULL"); +$node_publisher->safe_psql('postgres', + "CREATE TABLE tab_deferred_pk (a int PRIMARY KEY DEFERRABLE)"); # Let this table with REPLICA IDENTITY NOTHING, allowing only INSERT changes. $node_publisher->safe_psql('postgres', "CREATE TABLE tab_nothing (a int)"); $node_publisher->safe_psql('postgres', @@ -66,13 +68,17 @@ $node_subscriber->safe_psql('postgres', "CREATE TABLE tab_include (a int, b text, CONSTRAINT covering PRIMARY KEY(a) INCLUDE(b))" ); +# replication of the table with deferrable primary key +$node_subscriber->safe_psql('postgres', + "CREATE TABLE tab_deferred_pk (a int PRIMARY KEY DEFERRABLE)"); + # Setup logical replication my $publisher_connstr = $node_publisher->connstr . ' dbname=postgres'; $node_publisher->safe_psql('postgres', "CREATE PUBLICATION tap_pub"); $node_publisher->safe_psql('postgres', "CREATE PUBLICATION tap_pub_ins_only WITH (publish = insert)"); $node_publisher->safe_psql('postgres', - "ALTER PUBLICATION tap_pub ADD TABLE tab_rep, tab_full, tab_full2, tab_mixed, tab_include, tab_nothing, tab_full_pk" + "ALTER PUBLICATION tap_pub ADD TABLE tab_rep, tab_full, tab_full2, tab_mixed, tab_include, tab_nothing, tab_full_pk, tab_deferred_pk" ); $node_publisher->safe_psql('postgres', "ALTER PUBLICATION tap_pub_ins_only ADD TABLE tab_ins"); @@ -122,6 +128,13 @@ $node_publisher->safe_psql('postgres', "DELETE FROM tab_include WHERE a > 20"); $node_publisher->safe_psql('postgres', "UPDATE tab_include SET a = -a"); +$node_publisher->safe_psql('postgres', + "INSERT INTO tab_deferred_pk VALUES (1),(2),(3)"); +$node_publisher->safe_psql('postgres', + "UPDATE tab_deferred_pk SET a = 11 WHERE a = 1"); +$node_publisher->safe_psql('postgres', + "DELETE FROM tab_deferred_pk WHERE a = 2"); + $node_publisher->wait_for_catchup('tap_sub'); $result = $node_subscriber->safe_psql('postgres', @@ -145,6 +158,10 @@ $result = $node_subscriber->safe_psql('postgres', is($result, qq(20|-20|-1), 'check replicated changes with primary key index with included columns'); +$result = $node_subscriber->safe_psql('postgres', + "SELECT count(*), min(a), max(a) FROM tab_deferred_pk"); +is($result, qq(2|3|11), 'check replicated changes with deferred primary key'); + # insert some duplicate rows
Re: Default setting for enable_hashagg_disk
On Mon, Jul 27, 2020 at 12:52 PM Alvaro Herrera wrote: > On 2020-Jul-27, Peter Geoghegan wrote: > > The v4-0001-Remove-hashagg_avoid_disk_plan-GUC.patch changes are > > surprisingly complicated. It would be nice if you could take a look at > > that aspect (or confirm that it's included in your review). > > I think you mean "it replaces surprisingly complicated code with > straightforward code". Right? Because in the previous code, there was > a lot of effort going into deciding whether the path needed to be > generated; the new code just generates the path always. Yes, that's what I meant. It's a bit tricky. For example, I have removed a redundant "cheapest_total_path != NULL" test in create_partial_grouping_paths() (two, actually). But these two tests were always redundant. I have to wonder if I missed the point. Though it seems likely that that was just an accident. Accretions of code over time made the code work like that; nothing more. -- Peter Geoghegan
Re: Default setting for enable_hashagg_disk
On 2020-Jul-27, Peter Geoghegan wrote: > The v4-0001-Remove-hashagg_avoid_disk_plan-GUC.patch changes are > surprisingly complicated. It would be nice if you could take a look at > that aspect (or confirm that it's included in your review). I think you mean "it replaces surprisingly complicated code with straightforward code". Right? Because in the previous code, there was a lot of effort going into deciding whether the path needed to be generated; the new code just generates the path always. Similarly the code to decide allow_hash in create_distinct_path, which used to be nontrivial, could (if you wanted) be simplified down to a single boolean condition. Previously, it was nontrivial only because it needed to consider memory usage -- not anymore. But maybe you're talking about something more subtle that I'm just too blind to see. -- Álvaro Herrerahttps://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Re: Default setting for enable_hashagg_disk
On Mon, Jul 27, 2020 at 11:24 AM Alvaro Herrera wrote: > > Are you proposing that I just put the prototype in miscadmin.h, while > > leaving the implementation where it is (in nodeHash.c)? > > Yes, that's in the part of my reply you didn't quote: > > : It remains strange to have the function in executor > : implementation, but I don't offhand see a better place, so maybe it's > : okay where it is. Got it. I tried putting the prototype in miscadmin.h, and I now agree that that's the best way to do it -- that's how I do it in the attached revision. No other changes. The v4-0001-Remove-hashagg_avoid_disk_plan-GUC.patch changes are surprisingly complicated. It would be nice if you could take a look at that aspect (or confirm that it's included in your review). -- Peter Geoghegan v4-0001-Remove-hashagg_avoid_disk_plan-GUC.patch Description: Binary data v4-0002-Add-hash_mem_multiplier-GUC.patch Description: Binary data
Re: Default setting for enable_hashagg_disk
On 2020-Jul-27, Peter Geoghegan wrote: > On Mon, Jul 27, 2020 at 10:30 AM Alvaro Herrera > wrote: > > On 2020-Jul-23, Peter Geoghegan wrote: > > I notice you put the prototype for get_hash_mem in nodeHash.h. This > > would be fine if not for the fact that optimizer needs to call the > > function too, which means now optimizer have to include executor headers > > -- not a great thing. I'd move the prototype elsewhere to avoid this, > > and I think miscadmin.h is a decent place for the prototype, next to > > work_mem and m_w_m. > > The location of get_hash_mem() is awkward, Yes. > but there is no obvious alternative. Agreed. > Are you proposing that I just put the prototype in miscadmin.h, while > leaving the implementation where it is (in nodeHash.c)? Yes, that's in the part of my reply you didn't quote: : It remains strange to have the function in executor : implementation, but I don't offhand see a better place, so maybe it's : okay where it is. > [...] moving the implementation of get_hash_mem() to either of those > two files seems worse to me. Sure. -- Álvaro Herrerahttps://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Re: Default setting for enable_hashagg_disk
On Mon, Jul 27, 2020 at 10:30 AM Alvaro Herrera wrote: > On 2020-Jul-23, Peter Geoghegan wrote: > I notice you put the prototype for get_hash_mem in nodeHash.h. This > would be fine if not for the fact that optimizer needs to call the > function too, which means now optimizer have to include executor headers > -- not a great thing. I'd move the prototype elsewhere to avoid this, > and I think miscadmin.h is a decent place for the prototype, next to > work_mem and m_w_m. The location of get_hash_mem() is awkward, but there is no obvious alternative. Are you proposing that I just put the prototype in miscadmin.h, while leaving the implementation where it is (in nodeHash.c)? Maybe that sounds like an odd question, but bear in mind that the natural place to put the implementation of a function declared in miscadmin.h is either utils/init/postinit.c or utils/init/miscinit.c -- moving the implementation of get_hash_mem() to either of those two files seems worse to me. That said, there is an existing oddball case in miscadmin.h, right at the end -- the two functions whose implementation is in access/transam/xlog.c. So I can see an argument for adding another oddball case (i.e. moving the prototype to the end of miscadmin.h without changing anything else). > Other than that admittedly trivial complaint, I found nothing to > complain about in this patch. Great. Thanks for the review. My intention is to commit hash_mem_multiplier on Wednesday. We need to move on from this, and get the release out the door. -- Peter Geoghegan
Re: Default setting for enable_hashagg_disk
On 2020-Jul-23, Peter Geoghegan wrote: > Attached is v3 of the hash_mem_multiplier patch series, which now has > a preparatory patch that removes hashagg_avoid_disk_plan. I notice you put the prototype for get_hash_mem in nodeHash.h. This would be fine if not for the fact that optimizer needs to call the function too, which means now optimizer have to include executor headers -- not a great thing. I'd move the prototype elsewhere to avoid this, and I think miscadmin.h is a decent place for the prototype, next to work_mem and m_w_m. It remains strange to have the function in executor implementation, but I don't offhand see a better place, so maybe it's okay where it is. Other than that admittedly trivial complaint, I found nothing to complain about in this patch. -- Álvaro Herrerahttps://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Re: [BUG] Error in BRIN summarization
On 2020-Jul-27, Anastasia Lubennikova wrote: > Here is the updated version of the fix. > The problem can be reproduced on all supported versions, so I suggest to > backpatch it. > Code slightly changed in v12, so here are two patches: one for versions 9.5 > to 11 and another for versions from 12 to master. Hi Anastasia, thanks for this report and fix. I was considering this last week and noticed that the patch changes the ABI of heap_get_root_tuples, which may be problematic in back branches. I suggest that for unreleased branches (12 and prior) we need to create a new function with the new signature, and keep heap_get_root_tuples unchanged. In 13 and master we don't need that trick, so we can keep the code as you have it in this version of the patch. OffsetNumber heap_get_root_tuples_new(Page page, OffsetNumber *root_offsets) { .. full implementation ... } /* ABI compatibility only */ void heap_get_root_tuples(Page page, OffsetNumber *root_offsets) { (void) heap_get_root_tuples_new(page, root_offsets); } (I was also considering whether it needs to be a loop to reobtain root tuples, in case a concurrent transaction can create a new item while we're checking that item; but I don't think that can really happen for one individual tuple.) Thanks -- Álvaro Herrerahttps://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Re: [POC] Fast COPY FROM command for the table with foreign partitions
Hi Andrey, On 2020-07-23 09:23, Andrey V. Lepikhov wrote: On 7/16/20 2:14 PM, Amit Langote wrote: Amit Langote EnterpriseDB: http://www.enterprisedb.com Version 5 of the patch. With changes caused by Amit's comments. Just got a segfault with your v5 patch by deleting from a foreign table. Here is a part of backtrace: * frame #0: 0x0001029069ec postgres`ExecShutdownForeignScan(node=0x7ff28c8909b0) at nodeForeignscan.c:385:3 frame #1: 0x0001028e7b06 postgres`ExecShutdownNode(node=0x7ff28c8909b0) at execProcnode.c:779:4 frame #2: 0x00010299b3fa postgres`planstate_walk_members(planstates=0x7ff28c8906d8, nplans=1, walker=(postgres`ExecShutdownNode at execProcnode.c:752), context=0x) at nodeFuncs.c:3998:7 frame #3: 0x00010299b010 postgres`planstate_tree_walker(planstate=0x7ff28c8904c0, walker=(postgres`ExecShutdownNode at execProcnode.c:752), context=0x) at nodeFuncs.c:3914:8 frame #4: 0x0001028e7ab7 postgres`ExecShutdownNode(node=0x7ff28c8904c0) at execProcnode.c:771:2 (lldb) f 0 frame #0: 0x0001029069ec postgres`ExecShutdownForeignScan(node=0x7ff28c8909b0) at nodeForeignscan.c:385:3 382 FdwRoutine *fdwroutine = node->fdwroutine; 383 384 if (fdwroutine->ShutdownForeignScan) -> 385 fdwroutine->ShutdownForeignScan(node); 386 } (lldb) p node->fdwroutine->ShutdownForeignScan (ShutdownForeignScan_function) $1 = 0x7f7f7f7f7f7f7f7f It seems that ShutdownForeignScan inside node->fdwroutine doesn't have a correct pointer to the required function. I haven't had a chance to look closer on the code, but you can easily reproduce this error with the attached script (patched Postgres binaries should be available in the PATH). It works well with master and fails with your patch applied. Regards -- Alexey Kondratov Postgres Professional https://www.postgrespro.com Russian Postgres Company#!/usr/bin/env sh pg_ctl -D node1 stop > /dev/null pg_ctl -D node2 stop > /dev/null rm -rf node1 node2 rm node1.log node2.log initdb -D node1 initdb -D node2 echo "port = 5433" >> node2/postgresql.conf pg_ctl -D node1 -l node1.log start pg_ctl -D node2 -l node2.log start createdb createdb -p5433 psql -p5433 -c "CREATE TABLE test (id INT) PARTITION BY HASH (id)" psql -c "CREATE EXTENSION IF NOT EXISTS postgres_fdw" psql -c "CREATE SERVER node2 FOREIGN DATA WRAPPER postgres_fdw OPTIONS (port '5433'); CREATE USER MAPPING FOR current_user SERVER node2" psql -c "CREATE FOREIGN TABLE test(id INT) SERVER node2" psql -c "DELETE FROM test"
Re: Should we remove a fallback promotion? take 2
On 2020/07/27 17:53, Hamid Akhtar wrote: Applying the patch to the current master branch throws 9 hunks. AFAICT, the patch is good otherwise. So you think that the patch can be marked as Ready for Committer? Regards, -- Fujii Masao Advanced Computing Technology Center Research and Development Headquarters NTT DATA CORPORATION
Re: Display individual query in pg_stat_activity
On Mon, Jul 27, 2020 at 4:28 PM Jeremy Schneider wrote: > On 7/27/20 07:57, Dave Page wrote: > > I'm not sure I'd want that to happen, as it could make it much harder to > track the activity back to a query in the application layer or server logs. > > Perhaps a separate field could be added for the current statement, or a > value to indicate what the current statement number in the query is? > > > Might be helpful to give some specifics about circumstances where strings > can appear in pg_stat_activity.query with multiple statements. > > 1) First of all, IIUC multiple statements are only supported in the first > place by the simple protocol and PLs. Anyone using parameterized > statements (bind variables) should be unaffected by this. > > 2) My read of the official pg JDBC driver is that even for batch > operations it currently iterates and sends each statement individually. I > don't think the JDBC driver has the capability to send multiple statements, > so java apps using this driver should be unaffected. > That is just one of a number of different popular drivers of course. > > 3) psql -c will always send the string as a single "simple protocol" > request. Scripts will be impacted. > > 4) PLs also seem to have a code path that can put multiple statements in > pg_stat_activity when parallel slaves are launched. PL code will be > impacted. > > 5) pgAdmin uses the simple protocol and when a user executes a block of > statements, pgAdmin seems to send the whole block as a single "simple > protocol" request. Tools like pgAdmin will be impacted. > It does. It also prepends some queries with comments, specifically to allow users to filter them out when they're analysing logs (a feature requested by users, not just something we thought was a good idea). I'm assuming that this patch would also strip those? > > At the application layer, it doesn't seem problematic to me if PostgreSQL > reports each query one at a time. IMO most people will find this to be a > more useful behavior and they will still find their queries in their app > code or app logs. > I think there are arguments to be made for both approaches. > > However at the PostgreSQL logging layer this is a good call-out. I just > did a quick test on 14devel to double-check my assumption and it does seem > that PostgreSQL logs the entire combined query for psql -c. I think it > would be better for PostgreSQL to report queries individually in the log > too - for example pgBadger summaries will be even more useful if they > report information for each individual query rather than a single big block > of multiple queries. > > Given how small this patch is, it seems worthwhile to at least investigate > whether the logging component could be addressed just as easily. > > -Jeremy > > -- > Jeremy Schneider > Database Engineer > Amazon Web Services > > > -- Dave Page Blog: http://pgsnake.blogspot.com Twitter: @pgsnake EDB: http://www.enterprisedb.com
Re: Default setting for enable_hashagg_disk
On Sun, Jul 26, 2020 at 11:34 AM Tomas Vondra wrote: > That's 1.6GB, if I read it right. Which is more than 200MB ;-) Sigh. That solves that "mystery": the behavior that my sorted vs random example exhibited is a known limitation in hash aggs that spill (and an acceptable one). The memory usage is reported on accurately by EXPLAIN ANALYZE. -- Peter Geoghegan
Re: hashagg slowdown due to spill changes
On Sun, Jul 26, 2020 at 4:17 PM Jeff Davis wrote: > I saw less of an improvement than you or Andres (perhaps just more > noise). But given that both you and Andres are reporting a measurable > improvement, then I went ahead and committed it. Thank you. Thanks! -- Peter Geoghegan
Re: [BUG] Error in BRIN summarization
On 23.07.2020 20:39, Anastasia Lubennikova wrote: One of our clients caught an error "failed to find parent tuple for heap-only tuple at (50661,130) in table "tbl'" in PostgreSQL v12. Steps to reproduce (REL_12_STABLE): 1) Create table with primary key, create brin index, fill table with some initial data: create table tbl (id int primary key, a int) with (fillfactor=50); create index idx on tbl using brin (a) with (autosummarize=on); insert into tbl select i, i from generate_series(0,10) as i; 2) Run script test_brin.sql using pgbench: pgbench postgres -f ../review/brin_test.sql -n -T 120 The script is a bit messy because I was trying to reproduce a problematic workload. Though I didn't manage to simplify it. The idea is that it inserts new values into the table to produce unindexed pages and also updates some values to trigger HOT-updates on these pages. 3) Open psql session and run brin_summarize_new_values select brin_summarize_new_values('idx'::regclass::oid); \watch 2 Wait a bit. And in psql you will see the ERROR. This error is caused by the problem with root_offsets array bounds. It occurs if a new HOT tuple was inserted after we've collected root_offsets, and thus we don't have root_offset for tuple's offnum. Concurrent insertions are possible, because brin_summarize_new_values() only holds ShareUpdateLock on table and no lock (only pin) on the page. The draft fix is in the attachments. It saves root_offsets_size and checks that we only access valid fields. Patch also adds some debug messages, just to ensure that problem was caught. TODO: - check if heapam_index_validate_scan() has the same problem - code cleanup - test other PostgreSQL versions [1] https://www.postgresql.org/message-id/flat/CA%2BTgmoYgwjmmjK24Qxb_vWAu8_Hh7gfVFcr3%2BR7ocdLvYOWJXg%40mail.gmail.com Here is the updated version of the fix. The problem can be reproduced on all supported versions, so I suggest to backpatch it. Code slightly changed in v12, so here are two patches: one for versions 9.5 to 11 and another for versions from 12 to master. As for heapam_index_validate_scan(), I've tried to reproduce the same error with CREATE INDEX CONCURRENTLY, but haven't found any problem with it. -- Anastasia Lubennikova Postgres Professional: http://www.postgrespro.com The Russian Postgres Company diff --git a/src/backend/access/heap/heapam_handler.c b/src/backend/access/heap/heapam_handler.c index fc19f40a2e3..cb29a833663 100644 --- a/src/backend/access/heap/heapam_handler.c +++ b/src/backend/access/heap/heapam_handler.c @@ -1176,6 +1176,7 @@ heapam_index_build_range_scan(Relation heapRelation, BlockNumber previous_blkno = InvalidBlockNumber; BlockNumber root_blkno = InvalidBlockNumber; OffsetNumber root_offsets[MaxHeapTuplesPerPage]; + OffsetNumber root_offsets_size = 0; /* * sanity checks @@ -1341,6 +1342,11 @@ heapam_index_build_range_scan(Relation heapRelation, * buffer continuously while visiting the page, so no pruning * operation can occur either. * + * It is essential, though, to check the root_offsets_size bound + * before accessing the array, because concurrently inserted HOT tuples + * don't have a valid cached root offset and we need to build the map + * once again for them. + * * Also, although our opinions about tuple liveness could change while * we scan the page (due to concurrent transaction commits/aborts), * the chain root locations won't, so this info doesn't need to be @@ -1355,7 +1361,7 @@ heapam_index_build_range_scan(Relation heapRelation, Page page = BufferGetPage(hscan->rs_cbuf); LockBuffer(hscan->rs_cbuf, BUFFER_LOCK_SHARE); - heap_get_root_tuples(page, root_offsets); + root_offsets_size = heap_get_root_tuples(page, root_offsets); LockBuffer(hscan->rs_cbuf, BUFFER_LOCK_UNLOCK); root_blkno = hscan->rs_cblock; @@ -1643,6 +1649,22 @@ heapam_index_build_range_scan(Relation heapRelation, rootTuple = *heapTuple; offnum = ItemPointerGetOffsetNumber(>t_self); + /* + * As we do not hold buffer lock, concurrent insertion can happen. + * If so, collect the map once again to find the root offset for + * the new tuple. + */ + if (root_offsets_size < offnum) + { +Page page = BufferGetPage(hscan->rs_cbuf); + +LockBuffer(hscan->rs_cbuf, BUFFER_LOCK_SHARE); +root_offsets_size = heap_get_root_tuples(page, root_offsets); +LockBuffer(hscan->rs_cbuf, BUFFER_LOCK_UNLOCK); + } + + Assert(root_offsets_size >= offnum); + if (!OffsetNumberIsValid(root_offsets[offnum - 1])) ereport(ERROR, (errcode(ERRCODE_DATA_CORRUPTED), diff --git a/src/backend/access/heap/pruneheap.c b/src/backend/access/heap/pruneheap.c index 0efe3ce9995..5ddb123f30c 100644 --- a/src/backend/access/heap/pruneheap.c +++ b/src/backend/access/heap/pruneheap.c @@ -740,12 +740,15 @@ heap_page_prune_execute(Buffer buffer, * Note: The information collected here is valid only as long as
Re: Display individual query in pg_stat_activity
Hi On Mon, Jul 27, 2020 at 3:40 PM Drouvot, Bertrand wrote: > Hi hackers, > > I've attached a patch to display individual query in the pg_stat_activity > query field when multiple SQL statements are currently displayed. > > *Motivation:* > > When multiple statements are displayed then we don’t know which one is > currently running. > I'm not sure I'd want that to happen, as it could make it much harder to track the activity back to a query in the application layer or server logs. Perhaps a separate field could be added for the current statement, or a value to indicate what the current statement number in the query is? -- Dave Page Blog: http://pgsnake.blogspot.com Twitter: @pgsnake EDB: http://www.enterprisedb.com
Re: display offset along with block number in vacuum errors
On Fri, Jul 24, 2020 at 11:18:43PM +0530, Mahendra Singh Thalor wrote: > Hi hackers, > We discussed in another email thread[1], that it will be helpful if we can > display offset along with block number in vacuum error. Here, proposing a > patch to add offset along with block number in vacuum errors. Thanks. I happened to see both threads, only by chance. I'd already started writing the same as your 0001, which is essentially the same as yours. Here: @@ -1924,14 +1932,22 @@ lazy_vacuum_page(Relation onerel, BlockNumber blkno, Buffer buffer, BlockNumber tblk; OffsetNumber toff; ItemId itemid; + LVSavedErrInfo loc_saved_err_info; tblk = ItemPointerGetBlockNumber(_tuples->itemptrs[tupindex]); if (tblk != blkno) break; /* past end of tuples for this block */ toff = ItemPointerGetOffsetNumber(_tuples->itemptrs[tupindex]); + + /* Update error traceback information */ + update_vacuum_error_info(vacrelstats, _saved_err_info, VACUUM_ERRCB_PHASE_VACUUM_HEAP, +blkno, toff); itemid = PageGetItemId(page, toff); ItemIdSetUnused(itemid); unused[uncnt++] = toff; + + /* Revert to the previous phase information for error traceback */ + restore_vacuum_error_info(vacrelstats, _saved_err_info); } I'm not sure why you use restore_vacuum_error_info() at all. It's already called at the end of lazy_vacuum_page() (and others) to allow functions to clean up after their own state changes, rather than requiring callers to do it. I don't think you should use it in a loop, nor introduce another LVSavedErrInfo. Since phase and blkno are already set, I think you should just set vacrelstats->offnum = toff, rather than calling update_vacuum_error_info(). Similar to whats done in lazy_vacuum_heap(): tblk = ItemPointerGetBlockNumber(>dead_tuples->itemptrs[tupindex]); vacrelstats->blkno = tblk; I think you should also do: @@ -2976,6 +2984,7 @@ heap_page_is_all_visible(Relation rel, Buffer buf, ItemId itemid; HeapTupleData tuple; + vacrelstats->offset = offnum; I'm not sure, but maybe you'd also want to do the same in more places: @@ -2024,6 +2030,7 @@ lazy_check_needs_freeze(Buffer buf, bool *hastup) @@ -2790,6 +2797,7 @@ count_nondeletable_pages(Relation onerel, LVRelStats *vacrelstats) -- Justin
Re: shared tempfile was not removed on statement_timeout
On Mon, Jul 27, 2020 at 08:00:46PM +1200, Thomas Munro wrote: > On Tue, Jul 21, 2020 at 4:33 PM Justin Pryzby wrote: > > /* > > * clean up a spool structure and its substructures. > > */ > > static void > > _bt_spooldestroy(BTSpool *btspool) > > { > > + void *fileset = tuplesort_shared_fileset(btspool->sortstate); > > + if (fileset) > > + SharedFileSetDeleteAll(fileset); > > tuplesort_end(btspool->sortstate); > > pfree(btspool); > > } > > Why can't tuplesort_end do it? Because then I think the parallel workers remove their own files, with tests failing like: +ERROR: could not open temporary file "0.0" from BufFile "0": No such file or directory I look around a bit more and came up with this, which works, but I don't know enough to say if it's right. diff --git a/src/backend/storage/file/fd.c b/src/backend/storage/file/fd.c index 5f6420efb2..f89d42f475 100644 --- a/src/backend/storage/file/fd.c +++ b/src/backend/storage/file/fd.c @@ -3344,6 +3344,7 @@ walkdir(const char *path, struct stat fst; int sret; + usleep(9); CHECK_FOR_INTERRUPTS(); if (strcmp(de->d_name, ".") == 0 || diff --git a/src/backend/utils/sort/tuplesort.c b/src/backend/utils/sort/tuplesort.c index 3c49476483..c6e5e6d00b 100644 --- a/src/backend/utils/sort/tuplesort.c +++ b/src/backend/utils/sort/tuplesort.c @@ -1387,6 +1387,9 @@ tuplesort_free(Tuplesortstate *state) void tuplesort_end(Tuplesortstate *state) { + if (state->shared && state->shared->workersFinished == state->nParticipants) + SharedFileSetDeleteAll(>shared->fileset); + tuplesort_free(state); /*
Re: Index Skip Scan (new UniqueKeys)
> On Tue, Jul 21, 2020 at 04:35:55PM -0700, Peter Geoghegan wrote: > > > Well, it's obviously wrong, thanks for noticing. What is necessary is to > > compare two index tuples, the start and the next one, to test if they're > > the same (in which case if I'm not mistaken probably we can compare item > > pointers). I've got this question when I was about to post a new version > > with changes to address feedback from Andy, now I'll combine them and > > send a cumulative patch. > > This sounds like approximately the same problem as the one that > _bt_killitems() has to deal with as of Postgres 13. This is handled in > a way that is admittedly pretty tricky, even though the code does not > need to be 100% certain that it's "the same" tuple. Deduplication kind > of makes that a fuzzy concept. In principle there could be one big > index tuple instead of 5 tuples, even though the logical contents of > the page have not been changed between the time we recording heap TIDs > in local and the time _bt_killitems() tried to match on those heap > TIDs to kill_prior_tuple-kill some index tuples -- a concurrent > deduplication pass could do that. Your code needs to be prepared for > stuff like that. > > Ultimately posting list tuples are just a matter of understanding the > on-disk representation -- a "Small Matter of Programming". Even > without deduplication there are potential hazards from the physical > deletion of LP_DEAD-marked tuples in _bt_vacuum_one_page() (which is > not code that runs in VACUUM, despite the name). Make sure that you > hold a buffer pin on the leaf page throughout, because you need to do > that to make sure that VACUUM cannot concurrently recycle heap TIDs. > If VACUUM *is* able to concurrently recycle heap TIDs then it'll be > subtly broken. _bt_killitems() is safe because it either holds on to a > pin or gives up when the LSN changes at all. (ISTM that your only > choice is to hold on to a leaf page pin, since you cannot just decide > to give up in the way that _bt_killitems() sometimes can.) I see, thanks for clarification. You're right, in this part of implementation there is no way to give up if LSN changes like _bt_killitems does. As far as I can see the leaf page is already pinned all the time between reading relevant tuples and comparing them, I only need to handle posting list tuples.
Re: Should we remove a fallback promotion? take 2
Applying the patch to the current master branch throws 9 hunks. AFAICT, the patch is good otherwise. On Wed, Jun 3, 2020 at 3:20 PM Fujii Masao wrote: > > > On 2020/06/03 12:06, Kyotaro Horiguchi wrote: > > At Wed, 3 Jun 2020 09:43:17 +0900, Fujii Masao < > masao.fu...@oss.nttdata.com> wrote in > >> I will change the status back to Needs Review. > > Thanks for the review! > > > record = ReadCheckpointRecord(xlogreader, checkPointLoc, 1, > false); > > if (record != NULL) > > { > > - fast_promoted = true; > > + promoted = true; > > > > Even if we missed the last checkpoint record, we don't give up > > promotion and continue fall-back promotion but the variable "promoted" > > stays false. That is confusiong. > > > > How about changing it to fallback_promotion, or some names with more > > behavior-specific name like immediate_checkpoint_needed? > > > I like doEndOfRecoveryCkpt or something, but I have no strong opinion > about that flag naming. So I'm ok with immediate_checkpoint_needed > if others also like that, too. > > Regards, > > -- > Fujii Masao > Advanced Computing Technology Center > Research and Development Headquarters > NTT DATA CORPORATION > -- Highgo Software (Canada/China/Pakistan) URL : www.highgo.ca ADDR: 10318 WHALLEY BLVD, Surrey, BC CELL:+923335449950 EMAIL: mailto:hamid.akh...@highgo.ca SKYPE: engineeredvirus
Re: shared tempfile was not removed on statement_timeout
On Tue, Jul 21, 2020 at 4:33 PM Justin Pryzby wrote: > /* > * clean up a spool structure and its substructures. > */ > static void > _bt_spooldestroy(BTSpool *btspool) > { > + void *fileset = tuplesort_shared_fileset(btspool->sortstate); > + if (fileset) > + SharedFileSetDeleteAll(fileset); > tuplesort_end(btspool->sortstate); > pfree(btspool); > } Why can't tuplesort_end do it?
Re: display offset along with block number in vacuum errors
Thanks Michael for looking into this. On Sat, 25 Jul 2020 at 15:02, Michael Paquier wrote: > > On Fri, Jul 24, 2020 at 11:18:43PM +0530, Mahendra Singh Thalor wrote: > > In commit b61d161(Introduce vacuum errcontext to display additional > > information), we added vacuum errcontext to display additional > > information(block number) so that in case of vacuum error, we can identify > > which block we are getting error. Addition to block number, if we can > > display offset, then it will be more helpful for users. So to display > > offset, here proposing two different methods(Thanks Robert for suggesting > > these 2 methods): > > new_rel_pages = count_nondeletable_pages(onerel, vacrelstats); > vacrelstats->blkno = new_rel_pages; > + vacrelstats->offnum = InvalidOffsetNumber; > > Adding more context would be interesting for some cases, but not all > contrary to what your patch does in some code paths like > lazy_truncate_heap() as you would show up an offset of 0 for an > invalid value. This could confuse some users. Note that we are > careful enough to not print a context message if the block number is > invalid. Okay. I agree with you. In case of inavlid offset, we can skip offset printing. I will do this change in the next patch. Thanks and Regards Mahendra Singh Thalor EnterpriseDB: http://www.enterprisedb.com
Re: display offset along with block number in vacuum errors
On Sat, 25 Jul 2020 at 02:49, Mahendra Singh Thalor wrote: > > Hi hackers, > We discussed in another email thread[1], that it will be helpful if we can > display offset along with block number in vacuum error. Here, proposing a > patch to add offset along with block number in vacuum errors. > > In commit b61d161(Introduce vacuum errcontext to display additional > information), we added vacuum errcontext to display additional > information(block number) so that in case of vacuum error, we can identify > which block we are getting error. Addition to block number, if we can > display offset, then it will be more helpful for users. So to display offset, > here proposing two different methods(Thanks Robert for suggesting these 2 > methods): > > Method 1: We can report the TID as well as the block number in errcontext. > - errcontext("while scanning block %u of relation \"%s.%s\"", > - errinfo->blkno, errinfo->relnamespace, errinfo->relname); > + errcontext("while scanning block %u and offset %u of relation \"%s.%s\"", > + errinfo->blkno, errinfo->offnum, errinfo->relnamespace, errinfo->relname); > > Above fix requires more calls to update_vacuum_error_info(). Attaching > v01_0001 patch for this method. > > Method 2: We can improve the error messages by passing the relevant TID to > heap_prepare_freeze_tuple and having it report the TID as part of the error > message or in the error detail. > ereport(ERROR, > (errcode(ERRCODE_DATA_CORRUPTED), > - errmsg_internal("found xmin %u from before relfrozenxid %u", > + errmsg_internal("for block %u and offnum %u, found xmin %u from before > relfrozenxid %u", > + ItemPointerGetBlockNumber(tid), > + ItemPointerGetOffsetNumber(tid), > xid, relfrozenxid))); > > Attaching v01_0002 patch for this method. > > Please let me know your thoughts. > +1 for adding offset in error messages. I had a look at 0001 patch. You've set the vacuum error info but I think an error won't happen during setting itemids unused: @@ -1924,14 +1932,22 @@ lazy_vacuum_page(Relation onerel, BlockNumber blkno, Buffer buffer, BlockNumber tblk; OffsetNumber toff; ItemId itemid; + LVSavedErrInfo loc_saved_err_info; tblk = ItemPointerGetBlockNumber(_tuples->itemptrs[tupindex]); if (tblk != blkno) break; /* past end of tuples for this block */ toff = ItemPointerGetOffsetNumber(_tuples->itemptrs[tupindex]); + + /* Update error traceback information */ + update_vacuum_error_info(vacrelstats, _saved_err_info, VACUUM_ERRCB_PHASE_VACUUM_HEAP, +blkno, toff); itemid = PageGetItemId(page, toff); ItemIdSetUnused(itemid); unused[uncnt++] = toff; + + /* Revert to the previous phase information for error traceback */ + restore_vacuum_error_info(vacrelstats, _saved_err_info); } PageRepairFragmentation(page); Regards, -- Masahiko Sawadahttp://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Re: Postgres-native method to identify if a tuple is frozen
Thanks for the help. I'd seen the heap_page_items functions, but wanted to avoid the superuser requirement and wondered if this was going to be a performant method of finding the freeze column (we're scanning some billions of rows). Fwiw, we think we'll probably go with a tiny extension that exposes the frozen state exactly. For reference, this is the basic sketch: Datum frozen(PG_FUNCTION_ARGS) { Oid reloid = PG_GETARG_OID(0); ItemPointer tid = PG_GETARG_ITEMPOINTER(1); Relation rel; HeapTupleData tuple; Buffer buf; int result; // Open table and snapshot- ensuring we later close them rel = heap_open(reloid, AccessShareLock); // Initialise the tuple data with a tid that matches our input ItemPointerCopy(tid, &(tuple.t_self)); #if PG_MAJOR < 12 if (!heap_fetch(rel, SnapshotAny, , , true, NULL)) #else if (!heap_fetch(rel, SnapshotAny, , )) #endif { result = 3; } else { result = HeapTupleHeaderXminFrozen(tuple.t_data); } // Close any opened resources here heap_close(rel, AccessShareLock); ReleaseBuffer(buf); PG_RETURN_INT32(result); } On Tue, 21 Jul 2020 at 13:22, Amit Kapila wrote: > On Mon, Jul 20, 2020 at 9:07 PM Lawrence Jones > wrote: > > > > > > So we hit the question: how can we identify if a tuple is frozen? I know > the tuple has both committed and aborted hint bits set, but accessing those > bits seems to require superuser functions and are unlikely to be that fast. > > > > Are there system columns (similar to xmin, tid, cid) that we don't know > about? > > > > I think the way to get that information is to use pageinspect > extension and use some query like below but you are right that you > need superuser privilege for that: > > SELECT t_ctid, raw_flags, combined_flags > FROM heap_page_items(get_raw_page('pg_class', 0)), >LATERAL heap_tuple_infomask_flags(t_infomask, t_infomask2) > WHERE t_infomask IS NOT NULL OR t_infomask2 IS NOT NULL; > > -- > With Regards, > Amit Kapila. > EnterpriseDB: http://www.enterprisedb.com >
Re: Transactions involving multiple postgres foreign servers, take 2
On Thu, 23 Jul 2020 at 22:51, Muhammad Usama wrote: > > > > On Wed, Jul 22, 2020 at 12:42 PM Masahiko Sawada > wrote: >> >> On Sat, 18 Jul 2020 at 01:55, Fujii Masao >> wrote: >> > >> > >> > >> > On 2020/07/16 14:47, Masahiko Sawada wrote: >> > > On Tue, 14 Jul 2020 at 11:19, Fujii Masao >> > > wrote: >> > >> >> > >> >> > >> >> > >> On 2020/07/14 9:08, Masahiro Ikeda wrote: >> > I've attached the latest version patches. I've incorporated the review >> > comments I got so far and improved locking strategy. >> > >>> >> > >>> Thanks for updating the patch! >> > >> >> > >> +1 >> > >> I'm interested in these patches and now studying them. While checking >> > >> the behaviors of the patched PostgreSQL, I got three comments. >> > > >> > > Thank you for testing this patch! >> > > >> > >> >> > >> 1. We can access to the foreign table even during recovery in the HEAD. >> > >> But in the patched version, when I did that, I got the following error. >> > >> Is this intentional? >> > >> >> > >> ERROR: cannot assign TransactionIds during recovery >> > > >> > > No, it should be fixed. I'm going to fix this by not collecting >> > > participants for atomic commit during recovery. >> > >> > Thanks for trying to fix the issues! >> > >> > I'd like to report one more issue. When I started new transaction >> > in the local server, executed INSERT in the remote server via >> > postgres_fdw and then quit psql, I got the following assertion failure. >> > >> > TRAP: FailedAssertion("fdwxact", File: "fdwxact.c", Line: 1570) >> > 0 postgres0x00010d52f3c0 >> > ExceptionalCondition + 160 >> > 1 postgres0x00010cefbc49 >> > ForgetAllFdwXactParticipants + 313 >> > 2 postgres0x00010cefff14 >> > AtProcExit_FdwXact + 20 >> > 3 postgres0x00010d313fe3 shmem_exit + 179 >> > 4 postgres0x00010d313e7a >> > proc_exit_prepare + 122 >> > 5 postgres0x00010d313da3 proc_exit + 19 >> > 6 postgres0x00010d35112f PostgresMain + >> > 3711 >> > 7 postgres0x00010d27bb3a BackendRun + 570 >> > 8 postgres0x00010d27af6b BackendStartup >> > + 475 >> > 9 postgres0x00010d279ed1 ServerLoop + 593 >> > 10 postgres0x00010d277940 PostmasterMain >> > + 6016 >> > 11 postgres0x00010d1597b9 main + 761 >> > 12 libdyld.dylib 0x7fff7161e3d5 start + 1 >> > 13 ??? 0x0003 0x0 + 3 >> > >> >> Thank you for reporting the issue! >> >> I've attached the latest version patch that incorporated all comments >> I got so far. I've removed the patch adding the 'prefer' mode of >> foreign_twophase_commit to keep the patch set simple. > > > I have started to review the patchset. Just a quick comment. > > Patch v24-0002-Support-atomic-commit-among-multiple-foreign-ser.patch > contains changes (adding fdwxact includes) for > src/backend/executor/nodeForeignscan.c, > src/backend/executor/nodeModifyTable.c > and src/backend/executor/execPartition.c files that doesn't seem to be > required with the latest version. Thanks for your comment. Right. I've removed these changes on the local branch. Regards, -- Masahiko Sawadahttp://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Re: Issue with cancel_before_shmem_exit while searching to remove a particular registered exit callbacks
On Fri, Jul 24, 2020 at 8:07 PM Robert Haas wrote: > > On Fri, Jul 24, 2020 at 7:10 AM Bharath Rupireddy > wrote: > > I looked at what actually llvm_shutdown() does? It frees up JIT stacks, also if exists perf related resource, using LLVMOrcDisposeInstance() and LLVMOrcUnregisterPerf(), that were dynamically allocated in llvm_session_initialize through a JIT library function LLVMOrcCreateInstance() [1]. > > > > It looks like there is no problem in moving llvm_shutdown to either on_shmem_exit() or on_proc_exit(). > > If it doesn't involve shared memory, I guess it can be on_proc_exit() > rather than on_shmem_exit(). > > I guess the other question is why we're doing it at all. What > resources are being allocated that wouldn't be freed up by process > exit anyway? > LLVMOrcCreateInstance() and LLVMOrcDisposeInstance() are doing new and delete respectively, I just found these functions from the link [1]. But I don't exactly know whether there are any other resources being allocated that can't be freed up by proc_exit(). Tagging @Andres Freund for inputs on whether we have any problem making llvm_shutdown() a on_proc_exit() callback instead of before_shmem_exit() callback. And as suggested in the previous mails, we wanted to make it on_proc_exit() to avoid the abort issue reported in this mail chain, however if we take the abort issue fix commit # 303640199d0436c5e7acdf50b837a027b5726594 as mentioned in the previous response[2], then it may not be necessary, right now, but just to be safer and to avoid any of these similar kind of issues in future, we can consider this change as well. [1] - https://llvm.org/doxygen/OrcCBindings_8cpp_source.html LLVMOrcJITStackRef LLVMOrcCreateInstance(LLVMTargetMachineRef TM) { TargetMachine *TM2(unwrap(TM)); Triple T(TM2->getTargetTriple()); auto IndirectStubsMgrBuilder = orc::createLocalIndirectStubsManagerBuilder(T); OrcCBindingsStack *JITStack = new OrcCBindingsStack(*TM2, std::move(IndirectStubsMgrBuilder)); return wrap(JITStack); } LLVMErrorRef LLVMOrcDisposeInstance(LLVMOrcJITStackRef JITStack) { auto *J = unwrap(JITStack); auto Err = J->shutdown(); delete J; return wrap(std::move(Err)); } [2] - https://www.postgresql.org/message-id/CALj2ACVwOKZ8qYUsZrU2y2efnYZOLRxPC6k52FQcB3oriH9Kcg%40mail.gmail.com With Regards, Bharath Rupireddy. EnterpriseDB: http://www.enterprisedb.com
Re: autovac issue with large number of tables
On Mon, 27 Jul 2020 at 06:43, Nasby, Jim wrote: > > A database with a very large number of tables eligible for autovacuum can > result in autovacuum workers “stuck” in a tight loop of > table_recheck_autovac() constantly reporting nothing to do on the table. This > is because a database with a very large number of tables means it takes a > while to search the statistics hash to verify that the table still needs to > be processed[1]. If a worker spends some time processing a table, when it’s > done it can spend a significant amount of time rechecking each table that it > identified at launch (I’ve seen a worker in this state for over an hour). A > simple work-around in this scenario is to kill the worker; the launcher will > quickly fire up a new worker on the same database, and that worker will build > a new list of tables. > > > > That’s not a complete solution though… if the database contains a large > number of very small tables you can end up in a state where 1 or 2 workers is > busy chugging through those small tables so quickly than any additional > workers spend all their time in table_recheck_autovac(), because that takes > long enough that the additional workers are never able to “leapfrog” the > workers that are doing useful work. > As another solution, I've been considering adding a queue having table OIDs that need to vacuumed/analyzed on the shared memory (i.g. on DSA). Since all autovacuum workers running on the same database can see a consistent queue, the issue explained above won't happen and probably it makes the implementation of prioritization of tables being vacuumed easier which is sometimes discussed on pgsql-hackers. I guess it might be worth to discuss including this idea. Regards, -- Masahiko Sawadahttp://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Re: Reigning in ExecParallelHashRepartitionFirst
On Thu, Jul 9, 2020 at 8:17 AM Melanie Plageman wrote: > Last week as I was working on adaptive hash join [1] and trying to get > parallel adaptive hash join batch 0 to spill correctly, I noticed what > seemed like a problem with the code to repartition batch 0. > > If we run out of space while inserting tuples into the hashtable during > the build phase of parallel hash join and proceed to increase the number > of batches, we need to repartition all of the tuples from the old > generation (when nbatch was x) and move them to their new homes in the > new generation (when nbatch is 2x). Before we do this repartitioning we > disable growth in the number of batches. > > Then we repartition the tuples from the hashtable, inserting them either > back into the hashtable or into a batch file. While inserting them into > the hashtable, we call ExecParallelHashTupleAlloc(), and, if there is no > space for the current tuple in the current chunk and growth in the > number of batches is disabled, we go ahead and allocate a new chunk of > memory -- regardless of whether or not we will exceed the space limit. Hmm. It shouldn't really be possible for ExecParallelHashRepartitionFirst() to run out of memory anyway, considering that the input of that operation previously fit (just... I mean we started repartitioning because one more chunk would have pushed us over the edge, but the tuples so far fit, and we'll insert them in the same order for each input chunk, possibly filtering some out). Perhaps you reached this condition because batches[0].shared->size finishes up accounting for the memory used by the bucket array in PHJ_GROW_BUCKETS_ELECTING, but didn't originally account for it in generation 0, so what previously appeared to fit no longer does :-(. I'll look into that.
Re: Global snapshots
On 7/27/20 11:22 AM, tsunakawa.ta...@fujitsu.com wrote: Hi Andrey san, Movead san, From: tsunakawa.ta...@fujitsu.com While Clock-SI seems to be considered the best promising for global serializability here, * Why does Clock-SI gets so much attention? How did Clock-SI become the only choice? * Clock-SI was devised in Microsoft Research. Does Microsoft or some other organization use Clock-SI? Could you take a look at this patent? I'm afraid this is the Clock-SI for MVCC. Microsoft holds this until 2031. I couldn't find this with the keyword "Clock-SI."" US8356007B2 - Distributed transaction management for database systems with multiversioning - Google Patents https://patents.google.com/patent/US8356007 If it is, can we circumvent this patent? Regards Takayuki Tsunakawa Thank you for the research (and previous links too). I haven't seen this patent before. This should be carefully studied. -- regards, Andrey Lepikhov Postgres Professional
RE: Global snapshots
Hi Andrey san, Movead san, From: tsunakawa.ta...@fujitsu.com > While Clock-SI seems to be considered the best promising for global > serializability here, > > * Why does Clock-SI gets so much attention? How did Clock-SI become the > only choice? > > * Clock-SI was devised in Microsoft Research. Does Microsoft or some other > organization use Clock-SI? Could you take a look at this patent? I'm afraid this is the Clock-SI for MVCC. Microsoft holds this until 2031. I couldn't find this with the keyword "Clock-SI."" US8356007B2 - Distributed transaction management for database systems with multiversioning - Google Patents https://patents.google.com/patent/US8356007 If it is, can we circumvent this patent? Regards Takayuki Tsunakawa