Re: [HACKERS] Get stuck when dropping a subscription during synchronizing table
On Tue, Jul 4, 2017 at 12:21 PM, Peter Eisentraut wrote: > On 6/25/17 06:35, Petr Jelinek wrote: >> - Do LockSharedObject in ALTER SUBSCRIPTION, DROP SUBSCRIPTION (this one >> is preexisting but mentioning it for context), SetSubscriptionRelState, >> AddSubscriptionRelState, and in the logicalrep_worker_launch. This means >> we use granular per object locks to deal with concurrency. > > I have committed those locking changes, as we had already discussed them > previously. This should address the open item. Thank you for committing the patch! > > Maybe we can start new threads for the other parts of the patch and > maybe split the patch up a bit. Yeah, let's discuss about reworking the locking and management on new thread. Regards, -- Masahiko Sawada NIPPON TELEGRAPH AND TELEPHONE CORPORATION NTT Open Source Software Center -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Get stuck when dropping a subscription during synchronizing table
On 6/25/17 06:35, Petr Jelinek wrote: > - Do LockSharedObject in ALTER SUBSCRIPTION, DROP SUBSCRIPTION (this one > is preexisting but mentioning it for context), SetSubscriptionRelState, > AddSubscriptionRelState, and in the logicalrep_worker_launch. This means > we use granular per object locks to deal with concurrency. I have committed those locking changes, as we had already discussed them previously. This should address the open item. Maybe we can start new threads for the other parts of the patch and maybe split the patch up a bit. At this point I don't want to commit major code rearrangements without specific reasons and detailed analysis. -- Peter Eisentraut http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Get stuck when dropping a subscription during synchronizing table
On Wed, Jun 28, 2017 at 2:13 PM, Masahiko Sawada wrote: > On Wed, Jun 28, 2017 at 1:47 AM, Petr Jelinek > wrote: >> >> On 27/06/17 10:51, Masahiko Sawada wrote: >>> On Mon, Jun 26, 2017 at 12:12 PM, Masahiko Sawada >>> wrote: >>> >>> I've reviewed this patch briefly. >> >> Thanks! >> >>> >>> @@ -515,6 +533,31 @@ logicalrep_worker_stop(Oid subid, Oid relid) >>> } >>> >>> /* >>> + * Request worker to be stopped on commit. >>> + */ >>> +void >>> +logicalrep_worker_stop_at_commit(Oid subid, Oid relid) >>> +{ >>> + LogicalRepWorkerId *wid; >>> + MemoryContext old; >>> + >>> + old = MemoryContextSwitchTo(TopTransactionContext); >>> + >>> + wid = (LogicalRepWorkerId *) palloc(sizeof(LogicalRepWorkerId)); >>> + >>> + /* >>> + wid = MemoryContextAlloc(TopTransactionContext, >>> + >>> sizeof(LogicalRepWorkerId)); >>> + */ >>> + wid->subid = subid; >>> + wid->relid = relid; >>> + >>> + on_commit_stop_workers = lappend(on_commit_stop_workers, wid); >>> + >>> + MemoryContextSwitchTo(old); >>> +} >>> >>> logicalrep_worker_stop_at_commit() has a problem that new_list() >>> called by lappend() allocates the memory from current memory context. >>> It should switch the memory context and then allocate the memory for >>> wid and append it to the list. >>> > > Thank you for updating the patch! > >> >> Right, fixed (I see you did that locally as well based on the above >> excerpt ;) ). > > Oops, yeah that's my test code. > >>> >>> @@ -754,9 +773,25 @@ ApplyLauncherShmemInit(void) >>> void >>> AtEOXact_ApplyLauncher(bool isCommit) >>> { >>> - if (isCommit && on_commit_launcher_wakeup) >>> - ApplyLauncherWakeup(); >>> + ListCell *lc; >>> + >>> + if (isCommit) >>> + { >>> + foreach (lc, on_commit_stop_workers) >>> + { >>> + LogicalRepWorkerId *wid = lfirst(lc); >>> + logicalrep_worker_stop(wid->subid, wid->relid); >>> + } >>> + >>> + if (on_commit_launcher_wakeup) >>> + ApplyLauncherWakeup(); >>> >>> Stopping the logical rep worker in AtEOXact_ApplyLauncher doesn't >>> support the prepared transaction. Since we allocate the list >>> on_commit_stop_workers in TopTransactionContext the postgres crashes >>> if we execute any query after prepared transaction that removes >>> subscription relation state. Also after fixed this issue, we still >>> need to something: the list of on_commit_stop_workers is not >>> associated the prepared transaction. A query next to "preapre >>> transaction" tries to stop workers at the commit. There was similar >>> discussion before. >>> >> >> Hmm, good point. I think for now it makes sense to simply don't allow >> PREPARE for transactions that manipulate workers similar to what we do >> when there are exported snapshots. Done it that way in attached. > > Agreed. Should we note that in docs? > >> >>> >>> + >>> + ensure_transaction(); >>> + >>> UpdateSubscriptionRelState(MyLogicalRepWorker->subid, >>> + >>> rstate->relid, rstate->state, >>> + >>> rstate->lsn); >>> >>> >>> Should we commit the transaction if we started a new transaction >>> before update the subscription relation state, or it could be deadlock >>> risk. >> >> We only lock the whole subscription (and only conflicting things are >> DROP and ALTER SUBSCRIPTION), not individual subscription-relation >> mapping so it doesn't seem to me like there is any higher risk of >> deadlocks than anything else which works with multiple tables (or >> compared to previous behavior). >> > > I'm concerned that a lock for whole subscription could be conflicted > between ALTER SUBSCRIPTION, table sync worker and apply worker: > > Please imagine the following case. > > 1. The apply worker updates a subscription relation state R1 of > subscription S1. > -> It acquires AccessShareLock on S1, and keep holding. > 2. ALTER SUBSCRIPTION SET PUBLICATION tries to acquire > AccessExclusiveLock on S1. > -> But it waits for the apply worker to release the lock. > 3. The apply worker calls wait_for_relation_state_change(relid, > SUBREL_STATE_SYNCDONE) and waits for the table sync worker > for R2 to change its status. > -> Note that the apply worker is still holding AccessShareLock on S1 > 4. The table sync worker tries to update its status to SYNCDONE > -> In UpdateSubscriptionRelState(), it tires to acquire AccessShareLock >on S1 but waits for it. *deadlock* > > To summary, because the apply worker keeps holding AccessShareLock on > S1, the acquiring AccessExclusiveLock by ALTER SUBSCRIPTION waits for > the apply worker, and then the table sync worker also waits for the > ALTER SUBSCRIPTION in order to change its status. And the apply worker > waits for the table sync worker to change its status. > > I encountered the similar case o
Re: [HACKERS] Get stuck when dropping a subscription during synchronizing table
On 6/29/17 23:39, Noah Misch wrote: > IMMEDIATE ATTENTION REQUIRED. This PostgreSQL 10 open item is long past due > for your status update. Please reacquaint yourself with the policy on open > item ownership[1] and then reply immediately. If I do not hear from you by > 2017-07-01 04:00 UTC, I will transfer this item to release management team > ownership without further notice. I will work on this over the weekend and report back on Monday. -- Peter Eisentraut http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Get stuck when dropping a subscription during synchronizing table
On Wed, Jun 28, 2017 at 03:22:18AM +, Noah Misch wrote: > On Fri, Jun 23, 2017 at 09:42:10PM -0400, Peter Eisentraut wrote: > > On 6/21/17 22:47, Peter Eisentraut wrote: > > > On 6/20/17 22:44, Noah Misch wrote: > > >>> A patch has been posted, and it's being reviewed. Next update Monday. > > >> > > >> This PostgreSQL 10 open item is past due for your status update. Kindly > > >> send > > >> a status update within 24 hours, and include a date for your subsequent > > >> status > > >> update. Refer to the policy on open item ownership: > > >> https://www.postgresql.org/message-id/20170404140717.GA2675809%40tornado.leadboat.com > > > > > > I'm not sure how to proceed here. Nobody is else seems terribly > > > interested, and this is really a minor issue, so I don't want to muck > > > around with the locking endlessly. Let's say, if there are no new > > > insights by Friday, I'll pick one of the proposed patches or just close > > > it without any patch. > > > > After some comments, a new patch has been posted, and I'll give until > > Monday for review. > > This PostgreSQL 10 open item is past due for your status update. Kindly send > a status update within 24 hours, and include a date for your subsequent status > update. Refer to the policy on open item ownership: > https://www.postgresql.org/message-id/20170404140717.GA2675809%40tornado.leadboat.com IMMEDIATE ATTENTION REQUIRED. This PostgreSQL 10 open item is long past due for your status update. Please reacquaint yourself with the policy on open item ownership[1] and then reply immediately. If I do not hear from you by 2017-07-01 04:00 UTC, I will transfer this item to release management team ownership without further notice. [1] https://www.postgresql.org/message-id/20170404140717.GA2675809%40tornado.leadboat.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Get stuck when dropping a subscription during synchronizing table
On Wed, Jun 28, 2017 at 1:47 AM, Petr Jelinek wrote: > > On 27/06/17 10:51, Masahiko Sawada wrote: >> On Mon, Jun 26, 2017 at 12:12 PM, Masahiko Sawada >> wrote: >> >> I've reviewed this patch briefly. > > Thanks! > >> >> @@ -515,6 +533,31 @@ logicalrep_worker_stop(Oid subid, Oid relid) >> } >> >> /* >> + * Request worker to be stopped on commit. >> + */ >> +void >> +logicalrep_worker_stop_at_commit(Oid subid, Oid relid) >> +{ >> + LogicalRepWorkerId *wid; >> + MemoryContext old; >> + >> + old = MemoryContextSwitchTo(TopTransactionContext); >> + >> + wid = (LogicalRepWorkerId *) palloc(sizeof(LogicalRepWorkerId)); >> + >> + /* >> + wid = MemoryContextAlloc(TopTransactionContext, >> + >> sizeof(LogicalRepWorkerId)); >> + */ >> + wid->subid = subid; >> + wid->relid = relid; >> + >> + on_commit_stop_workers = lappend(on_commit_stop_workers, wid); >> + >> + MemoryContextSwitchTo(old); >> +} >> >> logicalrep_worker_stop_at_commit() has a problem that new_list() >> called by lappend() allocates the memory from current memory context. >> It should switch the memory context and then allocate the memory for >> wid and append it to the list. >> Thank you for updating the patch! > > Right, fixed (I see you did that locally as well based on the above > excerpt ;) ). Oops, yeah that's my test code. >> >> @@ -754,9 +773,25 @@ ApplyLauncherShmemInit(void) >> void >> AtEOXact_ApplyLauncher(bool isCommit) >> { >> - if (isCommit && on_commit_launcher_wakeup) >> - ApplyLauncherWakeup(); >> + ListCell *lc; >> + >> + if (isCommit) >> + { >> + foreach (lc, on_commit_stop_workers) >> + { >> + LogicalRepWorkerId *wid = lfirst(lc); >> + logicalrep_worker_stop(wid->subid, wid->relid); >> + } >> + >> + if (on_commit_launcher_wakeup) >> + ApplyLauncherWakeup(); >> >> Stopping the logical rep worker in AtEOXact_ApplyLauncher doesn't >> support the prepared transaction. Since we allocate the list >> on_commit_stop_workers in TopTransactionContext the postgres crashes >> if we execute any query after prepared transaction that removes >> subscription relation state. Also after fixed this issue, we still >> need to something: the list of on_commit_stop_workers is not >> associated the prepared transaction. A query next to "preapre >> transaction" tries to stop workers at the commit. There was similar >> discussion before. >> > > Hmm, good point. I think for now it makes sense to simply don't allow > PREPARE for transactions that manipulate workers similar to what we do > when there are exported snapshots. Done it that way in attached. Agreed. Should we note that in docs? > >> >> + >> + ensure_transaction(); >> + >> UpdateSubscriptionRelState(MyLogicalRepWorker->subid, >> + >> rstate->relid, rstate->state, >> + >> rstate->lsn); >> >> >> Should we commit the transaction if we started a new transaction >> before update the subscription relation state, or it could be deadlock >> risk. > > We only lock the whole subscription (and only conflicting things are > DROP and ALTER SUBSCRIPTION), not individual subscription-relation > mapping so it doesn't seem to me like there is any higher risk of > deadlocks than anything else which works with multiple tables (or > compared to previous behavior). > I'm concerned that a lock for whole subscription could be conflicted between ALTER SUBSCRIPTION, table sync worker and apply worker: Please imagine the following case. 1. The apply worker updates a subscription relation state R1 of subscription S1. -> It acquires AccessShareLock on S1, and keep holding. 2. ALTER SUBSCRIPTION SET PUBLICATION tries to acquire AccessExclusiveLock on S1. -> But it waits for the apply worker to release the lock. 3. The apply worker calls wait_for_relation_state_change(relid, SUBREL_STATE_SYNCDONE) and waits for the table sync worker for R2 to change its status. -> Note that the apply worker is still holding AccessShareLock on S1 4. The table sync worker tries to update its status to SYNCDONE -> In UpdateSubscriptionRelState(), it tires to acquire AccessShareLock on S1 but waits for it. *deadlock* To summary, because the apply worker keeps holding AccessShareLock on S1, the acquiring AccessExclusiveLock by ALTER SUBSCRIPTION waits for the apply worker, and then the table sync worker also waits for the ALTER SUBSCRIPTION in order to change its status. And the apply worker waits for the table sync worker to change its status. I encountered the similar case once on my environment. But since it completely depends on timing I don't have the concrete reproducing steps. Regards, -- Masahiko Sawada NIPPON TELEGRAPH AND TELEPHONE CORPORATION NTT Open Source Software Center -- Sent
Re: [HACKERS] Get stuck when dropping a subscription during synchronizing table
On Fri, Jun 23, 2017 at 09:42:10PM -0400, Peter Eisentraut wrote: > On 6/21/17 22:47, Peter Eisentraut wrote: > > On 6/20/17 22:44, Noah Misch wrote: > >>> A patch has been posted, and it's being reviewed. Next update Monday. > >> > >> This PostgreSQL 10 open item is past due for your status update. Kindly > >> send > >> a status update within 24 hours, and include a date for your subsequent > >> status > >> update. Refer to the policy on open item ownership: > >> https://www.postgresql.org/message-id/20170404140717.GA2675809%40tornado.leadboat.com > > > > I'm not sure how to proceed here. Nobody is else seems terribly > > interested, and this is really a minor issue, so I don't want to muck > > around with the locking endlessly. Let's say, if there are no new > > insights by Friday, I'll pick one of the proposed patches or just close > > it without any patch. > > After some comments, a new patch has been posted, and I'll give until > Monday for review. This PostgreSQL 10 open item is past due for your status update. Kindly send a status update within 24 hours, and include a date for your subsequent status update. Refer to the policy on open item ownership: https://www.postgresql.org/message-id/20170404140717.GA2675809%40tornado.leadboat.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Get stuck when dropping a subscription during synchronizing table
On 27/06/17 10:51, Masahiko Sawada wrote: > On Mon, Jun 26, 2017 at 12:12 PM, Masahiko Sawada > wrote: > > I've reviewed this patch briefly. Thanks! > > @@ -515,6 +533,31 @@ logicalrep_worker_stop(Oid subid, Oid relid) > } > > /* > + * Request worker to be stopped on commit. > + */ > +void > +logicalrep_worker_stop_at_commit(Oid subid, Oid relid) > +{ > + LogicalRepWorkerId *wid; > + MemoryContext old; > + > + old = MemoryContextSwitchTo(TopTransactionContext); > + > + wid = (LogicalRepWorkerId *) palloc(sizeof(LogicalRepWorkerId)); > + > + /* > + wid = MemoryContextAlloc(TopTransactionContext, > + > sizeof(LogicalRepWorkerId)); > + */ > + wid->subid = subid; > + wid->relid = relid; > + > + on_commit_stop_workers = lappend(on_commit_stop_workers, wid); > + > + MemoryContextSwitchTo(old); > +} > > logicalrep_worker_stop_at_commit() has a problem that new_list() > called by lappend() allocates the memory from current memory context. > It should switch the memory context and then allocate the memory for > wid and append it to the list. > Right, fixed (I see you did that locally as well based on the above excerpt ;) ). > > @@ -754,9 +773,25 @@ ApplyLauncherShmemInit(void) > void > AtEOXact_ApplyLauncher(bool isCommit) > { > - if (isCommit && on_commit_launcher_wakeup) > - ApplyLauncherWakeup(); > + ListCell *lc; > + > + if (isCommit) > + { > + foreach (lc, on_commit_stop_workers) > + { > + LogicalRepWorkerId *wid = lfirst(lc); > + logicalrep_worker_stop(wid->subid, wid->relid); > + } > + > + if (on_commit_launcher_wakeup) > + ApplyLauncherWakeup(); > > Stopping the logical rep worker in AtEOXact_ApplyLauncher doesn't > support the prepared transaction. Since we allocate the list > on_commit_stop_workers in TopTransactionContext the postgres crashes > if we execute any query after prepared transaction that removes > subscription relation state. Also after fixed this issue, we still > need to something: the list of on_commit_stop_workers is not > associated the prepared transaction. A query next to "preapre > transaction" tries to stop workers at the commit. There was similar > discussion before. > Hmm, good point. I think for now it makes sense to simply don't allow PREPARE for transactions that manipulate workers similar to what we do when there are exported snapshots. Done it that way in attached. > > + > + ensure_transaction(); > + > UpdateSubscriptionRelState(MyLogicalRepWorker->subid, > + > rstate->relid, rstate->state, > + > rstate->lsn); > > > Should we commit the transaction if we started a new transaction > before update the subscription relation state, or it could be deadlock > risk. We only lock the whole subscription (and only conflicting things are DROP and ALTER SUBSCRIPTION), not individual subscription-relation mapping so it doesn't seem to me like there is any higher risk of deadlocks than anything else which works with multiple tables (or compared to previous behavior). -- Petr Jelinek http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services From 88da433110aa3bde3dcce33ebf62d41a08c191b9 Mon Sep 17 00:00:00 2001 From: Petr Jelinek Date: Sat, 24 Jun 2017 19:38:21 +0200 Subject: [PATCH] Rework subscription worker and relation status handling --- src/backend/access/transam/xact.c | 9 + src/backend/catalog/pg_subscription.c | 137 ++-- src/backend/commands/subscriptioncmds.c | 98 - src/backend/replication/logical/launcher.c | 309 src/backend/replication/logical/tablesync.c | 97 + src/backend/replication/logical/worker.c| 23 ++- src/backend/utils/cache/catcache.c | 6 +- src/include/catalog/pg_subscription_rel.h | 6 +- src/include/replication/logicallauncher.h | 1 + src/include/replication/worker_internal.h | 6 +- 10 files changed, 393 insertions(+), 299 deletions(-) diff --git a/src/backend/access/transam/xact.c b/src/backend/access/transam/xact.c index b0aa69f..322502d 100644 --- a/src/backend/access/transam/xact.c +++ b/src/backend/access/transam/xact.c @@ -2277,6 +2277,15 @@ PrepareTransaction(void) (errcode(ERRCODE_FEATURE_NOT_SUPPORTED), errmsg("cannot PREPARE a transaction that has exported snapshots"))); + /* + * Similar to above, don't allow PREPARE but for transaction that kill + * logical replication, workers. + */ + if (XactManipulatesLogicalReplicationWorkers()) + ereport(ERROR, +(errcode(ERRCODE_FEATURE_NOT_SUPPORTED), + errmsg("cannot PREPARE a transaction that has manipulated logical replication workers"))); + /* Prevent cancel/die interrupt while cleaning up */ HOLD_INTE
Re: [HACKERS] Get stuck when dropping a subscription during synchronizing table
On Mon, Jun 26, 2017 at 12:12 PM, Masahiko Sawada wrote: > On Sun, Jun 25, 2017 at 7:35 PM, Petr Jelinek > wrote: >> (was away for a while, got some time now for this again) >> >> On 22/06/17 04:43, Peter Eisentraut wrote: >>> The alternative is that we use the LockSharedObject() approach that was >>> already alluded to, like in the attached patch. Both approaches would >>> work equally fine AFAICT. >> >> I agree, but I think we need bigger overhaul of the locking/management >> in general. So here is patch which does much more changes. >> >> The patch does several important things (in no particular order): >> - Split SetSubscriptionRelState into AddSubscriptionRelState and >> UpdateSubscriptionRelState for the reasons said upstream, it's cleaner, >> there is no half-broken upsert logic and it has proper error checking >> for each action. >> >> - Do LockSharedObject in ALTER SUBSCRIPTION, DROP SUBSCRIPTION (this one >> is preexisting but mentioning it for context), SetSubscriptionRelState, >> AddSubscriptionRelState, and in the logicalrep_worker_launch. This means >> we use granular per object locks to deal with concurrency. >> >> - Because of above, the AccessExclusiveLock on pg_subscription is no >> longer needed, just normal RowExlusiveLock is used now. >> >> - logicalrep_worker_stop is also simplified due to the proper locking >> >> - There is new interface logicalrep_worker_stop_at_commit which is used >> by ALTER SUBSCRIPTION ... REFRESH PUBLICATION and by transactional >> variant of DROP SUBSCRIPTION to only kill workers at the end of transaction. >> >> - Locking/reading of subscription info is unified between DROP and ALTER >> SUBSCRIPTION commands. >> >> - DROP SUBSCRIPTION will kill all workers associated with subscription, >> not just apply. >> >> - The sync worker checks during startup if the relation is still subscribed. >> >> - The sync worker will exit when waiting for apply and apply has shut-down. >> >> - The log messages around workers and removed or disabled subscription >> are now more consistent between startup and normal runtime of the worker. >> >> - Some code deduplication and stylistic changes/simplification in >> related areas. >> >> - Fixed catcache's IndexScanOK() handling of the subscription catalog. >> >> It's bit bigger patch but solves issues from multiple threads around >> handling of ALTER/DROP subscription. >> >> A lot of the locking that I added is normally done transparently by >> dependency handling, but subscriptions and subscription relation status >> do not use that much as it was deemed to bloat pg_depend needlessly >> during the original patch review (it's also probably why this has >> slipped through). >> > I've reviewed this patch briefly. @@ -515,6 +533,31 @@ logicalrep_worker_stop(Oid subid, Oid relid) } /* + * Request worker to be stopped on commit. + */ +void +logicalrep_worker_stop_at_commit(Oid subid, Oid relid) +{ + LogicalRepWorkerId *wid; + MemoryContext old; + + old = MemoryContextSwitchTo(TopTransactionContext); + + wid = (LogicalRepWorkerId *) palloc(sizeof(LogicalRepWorkerId)); + + /* + wid = MemoryContextAlloc(TopTransactionContext, + sizeof(LogicalRepWorkerId)); + */ + wid->subid = subid; + wid->relid = relid; + + on_commit_stop_workers = lappend(on_commit_stop_workers, wid); + + MemoryContextSwitchTo(old); +} logicalrep_worker_stop_at_commit() has a problem that new_list() called by lappend() allocates the memory from current memory context. It should switch the memory context and then allocate the memory for wid and append it to the list. @@ -754,9 +773,25 @@ ApplyLauncherShmemInit(void) void AtEOXact_ApplyLauncher(bool isCommit) { - if (isCommit && on_commit_launcher_wakeup) - ApplyLauncherWakeup(); + ListCell *lc; + + if (isCommit) + { + foreach (lc, on_commit_stop_workers) + { + LogicalRepWorkerId *wid = lfirst(lc); + logicalrep_worker_stop(wid->subid, wid->relid); + } + + if (on_commit_launcher_wakeup) + ApplyLauncherWakeup(); Stopping the logical rep worker in AtEOXact_ApplyLauncher doesn't support the prepared transaction. Since we allocate the list on_commit_stop_workers in TopTransactionContext the postgres crashes if we execute any query after prepared transaction that removes subscription relation state. Also after fixed this issue, we still need to something: the list of on_commit_stop_workers is not associated the prepared transaction. A query next to "preapre transaction" tries to stop workers at the commit. There was similar discussion before. + + ensure_transaction(); + UpdateSubscriptionRelState(MyLogicalRepWorker->subid, + rstate->relid, rstate->state, + rstate->lsn); Should we commit the transaction if we started a new
Re: [HACKERS] Get stuck when dropping a subscription during synchronizing table
On Sun, Jun 25, 2017 at 7:35 PM, Petr Jelinek wrote: > (was away for a while, got some time now for this again) > > On 22/06/17 04:43, Peter Eisentraut wrote: >> The alternative is that we use the LockSharedObject() approach that was >> already alluded to, like in the attached patch. Both approaches would >> work equally fine AFAICT. > > I agree, but I think we need bigger overhaul of the locking/management > in general. So here is patch which does much more changes. > > The patch does several important things (in no particular order): > - Split SetSubscriptionRelState into AddSubscriptionRelState and > UpdateSubscriptionRelState for the reasons said upstream, it's cleaner, > there is no half-broken upsert logic and it has proper error checking > for each action. > > - Do LockSharedObject in ALTER SUBSCRIPTION, DROP SUBSCRIPTION (this one > is preexisting but mentioning it for context), SetSubscriptionRelState, > AddSubscriptionRelState, and in the logicalrep_worker_launch. This means > we use granular per object locks to deal with concurrency. > > - Because of above, the AccessExclusiveLock on pg_subscription is no > longer needed, just normal RowExlusiveLock is used now. > > - logicalrep_worker_stop is also simplified due to the proper locking > > - There is new interface logicalrep_worker_stop_at_commit which is used > by ALTER SUBSCRIPTION ... REFRESH PUBLICATION and by transactional > variant of DROP SUBSCRIPTION to only kill workers at the end of transaction. > > - Locking/reading of subscription info is unified between DROP and ALTER > SUBSCRIPTION commands. > > - DROP SUBSCRIPTION will kill all workers associated with subscription, > not just apply. > > - The sync worker checks during startup if the relation is still subscribed. > > - The sync worker will exit when waiting for apply and apply has shut-down. > > - The log messages around workers and removed or disabled subscription > are now more consistent between startup and normal runtime of the worker. > > - Some code deduplication and stylistic changes/simplification in > related areas. > > - Fixed catcache's IndexScanOK() handling of the subscription catalog. > > It's bit bigger patch but solves issues from multiple threads around > handling of ALTER/DROP subscription. > > A lot of the locking that I added is normally done transparently by > dependency handling, but subscriptions and subscription relation status > do not use that much as it was deemed to bloat pg_depend needlessly > during the original patch review (it's also probably why this has > slipped through). > Thank you for reworking on this! I'll review this patch on Tuesday. Regards, -- Masahiko Sawada NIPPON TELEGRAPH AND TELEPHONE CORPORATION NTT Open Source Software Center -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Get stuck when dropping a subscription during synchronizing table
(was away for a while, got some time now for this again) On 22/06/17 04:43, Peter Eisentraut wrote: > The alternative is that we use the LockSharedObject() approach that was > already alluded to, like in the attached patch. Both approaches would > work equally fine AFAICT. I agree, but I think we need bigger overhaul of the locking/management in general. So here is patch which does much more changes. The patch does several important things (in no particular order): - Split SetSubscriptionRelState into AddSubscriptionRelState and UpdateSubscriptionRelState for the reasons said upstream, it's cleaner, there is no half-broken upsert logic and it has proper error checking for each action. - Do LockSharedObject in ALTER SUBSCRIPTION, DROP SUBSCRIPTION (this one is preexisting but mentioning it for context), SetSubscriptionRelState, AddSubscriptionRelState, and in the logicalrep_worker_launch. This means we use granular per object locks to deal with concurrency. - Because of above, the AccessExclusiveLock on pg_subscription is no longer needed, just normal RowExlusiveLock is used now. - logicalrep_worker_stop is also simplified due to the proper locking - There is new interface logicalrep_worker_stop_at_commit which is used by ALTER SUBSCRIPTION ... REFRESH PUBLICATION and by transactional variant of DROP SUBSCRIPTION to only kill workers at the end of transaction. - Locking/reading of subscription info is unified between DROP and ALTER SUBSCRIPTION commands. - DROP SUBSCRIPTION will kill all workers associated with subscription, not just apply. - The sync worker checks during startup if the relation is still subscribed. - The sync worker will exit when waiting for apply and apply has shut-down. - The log messages around workers and removed or disabled subscription are now more consistent between startup and normal runtime of the worker. - Some code deduplication and stylistic changes/simplification in related areas. - Fixed catcache's IndexScanOK() handling of the subscription catalog. It's bit bigger patch but solves issues from multiple threads around handling of ALTER/DROP subscription. A lot of the locking that I added is normally done transparently by dependency handling, but subscriptions and subscription relation status do not use that much as it was deemed to bloat pg_depend needlessly during the original patch review (it's also probably why this has slipped through). -- Petr Jelinek http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services From d7038474012769c9c3b50231af76dd7796fe593f Mon Sep 17 00:00:00 2001 From: Petr Jelinek Date: Sat, 24 Jun 2017 19:38:21 +0200 Subject: [PATCH] Rework subscription worker and relation status handling --- src/backend/catalog/pg_subscription.c | 137 +++-- src/backend/commands/subscriptioncmds.c | 98 +- src/backend/replication/logical/launcher.c | 293 +++- src/backend/replication/logical/tablesync.c | 97 + src/backend/replication/logical/worker.c| 23 ++- src/backend/utils/cache/catcache.c | 6 +- src/include/catalog/pg_subscription_rel.h | 6 +- src/include/replication/worker_internal.h | 6 +- 8 files changed, 367 insertions(+), 299 deletions(-) diff --git a/src/backend/catalog/pg_subscription.c b/src/backend/catalog/pg_subscription.c index c69c461..b643e54 100644 --- a/src/backend/catalog/pg_subscription.c +++ b/src/backend/catalog/pg_subscription.c @@ -28,6 +28,8 @@ #include "nodes/makefuncs.h" +#include "storage/lmgr.h" + #include "utils/array.h" #include "utils/builtins.h" #include "utils/fmgroids.h" @@ -225,84 +227,101 @@ textarray_to_stringlist(ArrayType *textarray) } /* - * Set the state of a subscription table. - * - * If update_only is true and the record for given table doesn't exist, do - * nothing. This can be used to avoid inserting a new record that was deleted - * by someone else. Generally, subscription DDL commands should use false, - * workers should use true. - * - * The insert-or-update logic in this function is not concurrency safe so it - * might raise an error in rare circumstances. But if we took a stronger lock - * such as ShareRowExclusiveLock, we would risk more deadlocks. + * Add new state record for a subscription table. */ Oid -SetSubscriptionRelState(Oid subid, Oid relid, char state, - XLogRecPtr sublsn, bool update_only) +AddSubscriptionRelState(Oid subid, Oid relid, char state, + XLogRecPtr sublsn) { Relation rel; HeapTuple tup; - Oid subrelid = InvalidOid; + Oid subrelid; bool nulls[Natts_pg_subscription_rel]; Datum values[Natts_pg_subscription_rel]; + LockSharedObject(SubscriptionRelationId, subid, 0, AccessShareLock); + rel = heap_open(SubscriptionRelRelationId, RowExclusiveLock); /* Try finding existing mapping. */ tup = SearchSysCacheCopy2(SUBSCRIPTIONRELMAP, ObjectIdGetDatum(relid), ObjectIdGetDatum
Re: [HACKERS] Get stuck when dropping a subscription during synchronizing table
On 6/21/17 22:47, Peter Eisentraut wrote: > On 6/20/17 22:44, Noah Misch wrote: >>> A patch has been posted, and it's being reviewed. Next update Monday. >> >> This PostgreSQL 10 open item is past due for your status update. Kindly send >> a status update within 24 hours, and include a date for your subsequent >> status >> update. Refer to the policy on open item ownership: >> https://www.postgresql.org/message-id/20170404140717.GA2675809%40tornado.leadboat.com > > I'm not sure how to proceed here. Nobody is else seems terribly > interested, and this is really a minor issue, so I don't want to muck > around with the locking endlessly. Let's say, if there are no new > insights by Friday, I'll pick one of the proposed patches or just close > it without any patch. After some comments, a new patch has been posted, and I'll give until Monday for review. -- Peter Eisentraut http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Get stuck when dropping a subscription during synchronizing table
On 6/22/17 03:26, Masahiko Sawada wrote: > Thank you for the patch. Some comment and question. > DropSubscriptionRelState requests ExclusiveLock but why is it not > ShareRowExclusiveLock? fixed > I test DROP SUBSCIPTION case but even with this patch, "tuple > concurrently updated" is still occurred. > > @@ -405,7 +405,7 @@ RemoveSubscriptionRel(Oid subid, Oid relid) > } > heap_endscan(scan); > > - heap_close(rel, RowExclusiveLock); > + heap_close(rel, lockmode); > } > > I think we should not release the table lock here, should be > heap_close(rel, NoLock) instead? After changed it so on my > environment, DROP SUBSCRIPTION seems to work fine. fixed > Also, ALTER SUBSCRIPTION SET PUBLICATION seems the same. Even with > Petr's patch, the same error still occurred during ALTER SUBSCRIPTION > SET PUBLICATION. Currently it acquires RowExclusiveLock on > pg_subscription_rel but as long as both DDL and table sync worker > could modify the same record on pg_subscription this error can > happen. fixed > On the other hand if we take a strong lock on pg_subscription > during DDL the deadlock risk will be increased. The original problem was that DROP TABLE locked things in the order 1) user table, 2) pg_subscription_rel, whereas a full-database VACUUM would lock things in the opposite order. I think this was a pretty wide window if you have many tables. In this case, however, we are only dealing with pg_subscription and pg_subscription_rel, and I think even VACUUM would always processes them in the same order. Could please try the attached patch so see if it addresses your test scenarios? -- Peter Eisentraut http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services From fc50b9e287e8c3c29be21d82c07c39023cbf3882 Mon Sep 17 00:00:00 2001 From: Peter Eisentraut Date: Fri, 23 Jun 2017 21:26:46 -0400 Subject: [PATCH v2] Parametrize locking in RemoveSubscriptionRel Revising the change in 521fd4795e3ec3d0b263b62e5eb58e1557be9c86, we need to take a stronger lock when removing entries from pg_subscription_rel when ALTER/DROP SUBSCRIPTION. Otherwise, users might see "tuple concurrently updated" when table sync workers make updates at the same time as DDL commands are run. For DROP TABLE, we keep the lower lock level that was put in place by the above commit, so that DROP TABLE does not deadlock with whole-database VACUUM. --- src/backend/catalog/heap.c| 2 +- src/backend/catalog/pg_subscription.c | 6 +++--- src/backend/commands/subscriptioncmds.c | 4 ++-- src/include/catalog/pg_subscription_rel.h | 2 +- 4 files changed, 7 insertions(+), 7 deletions(-) diff --git a/src/backend/catalog/heap.c b/src/backend/catalog/heap.c index a376b99f1e..8a96bbece6 100644 --- a/src/backend/catalog/heap.c +++ b/src/backend/catalog/heap.c @@ -1844,7 +1844,7 @@ heap_drop_with_catalog(Oid relid) /* * Remove any associated relation synchronization states. */ - RemoveSubscriptionRel(InvalidOid, relid); + RemoveSubscriptionRel(InvalidOid, relid, RowExclusiveLock); /* * Forget any ON COMMIT action for the rel diff --git a/src/backend/catalog/pg_subscription.c b/src/backend/catalog/pg_subscription.c index c69c461b62..1d80191ca8 100644 --- a/src/backend/catalog/pg_subscription.c +++ b/src/backend/catalog/pg_subscription.c @@ -369,7 +369,7 @@ GetSubscriptionRelState(Oid subid, Oid relid, XLogRecPtr *sublsn, * subscription, or for a particular relation, or both. */ void -RemoveSubscriptionRel(Oid subid, Oid relid) +RemoveSubscriptionRel(Oid subid, Oid relid, LOCKMODE lockmode) { Relationrel; HeapScanDesc scan; @@ -377,7 +377,7 @@ RemoveSubscriptionRel(Oid subid, Oid relid) HeapTuple tup; int nkeys = 0; - rel = heap_open(SubscriptionRelRelationId, RowExclusiveLock); + rel = heap_open(SubscriptionRelRelationId, lockmode); if (OidIsValid(subid)) { @@ -405,7 +405,7 @@ RemoveSubscriptionRel(Oid subid, Oid relid) } heap_endscan(scan); - heap_close(rel, RowExclusiveLock); + heap_close(rel, NoLock); } diff --git a/src/backend/commands/subscriptioncmds.c b/src/backend/commands/subscriptioncmds.c index 9cbd36f646..57954c978b 100644 --- a/src/backend/commands/subscriptioncmds.c +++ b/src/backend/commands/subscriptioncmds.c @@ -595,7 +595,7 @@ AlterSubscription_refresh(Subscription *sub, bool copy_data) { char *namespace; - RemoveSubscriptionRel(sub->oid, relid); + RemoveSubscriptionRel(sub->oid, relid, ShareRowExclusiveLock); logicalrep_worker_stop(sub->oid, relid); @@ -910,7 +910,7 @@ DropSubscription(DropSubscriptionStmt *stmt, bool isTopLevel) deleteSharedDependencyRecordsFor(SubscriptionRelationId, subid, 0); /* Remove any associa
Re: [HACKERS] Get stuck when dropping a subscription during synchronizing table
On Wed, Jun 21, 2017 at 8:10 AM, Peter Eisentraut wrote: > On 6/19/17 22:54, Masahiko Sawada wrote: >>> It seems to me we could just take a stronger lock around >>> RemoveSubscriptionRel(), so that workers can't write in there concurrently. >> >> Since we reduced the lock level of updating pg_subscription_rel by >> commit 521fd4795e3e the same deadlock issue will appear if we just >> take a stronger lock level. > > I was thinking about a more refined approach, like in the attached > patch. It just changes the locking when in DropSubscription(), so that > that doesn't fail if workers are doing stuff concurrently. Everything > else stays the same. > Thank you for the patch. Some comment and question. DropSubscriptionRelState requests ExclusiveLock but why is it not ShareRowExclusiveLock? I test DROP SUBSCIPTION case but even with this patch, "tuple concurrently updated" is still occurred. @@ -405,7 +405,7 @@ RemoveSubscriptionRel(Oid subid, Oid relid) } heap_endscan(scan); - heap_close(rel, RowExclusiveLock); + heap_close(rel, lockmode); } I think we should not release the table lock here, should be heap_close(rel, NoLock) instead? After changed it so on my environment, DROP SUBSCRIPTION seems to work fine. Also, ALTER SUBSCRIPTION SET PUBLICATION seems the same. Even with Petr's patch, the same error still occurred during ALTER SUBSCRIPTION SET PUBLICATION. Currently it acquires RowExclusiveLock on pg_subscription_rel but as long as both DDL and table sync worker could modify the same record on pg_subscription this error can happen. On the other hand if we take a strong lock on pg_subscription during DDL the deadlock risk will be increased. One solution I came up with is that we use other than simple_heap_update/delete so that we accept the return value HeapTupleUpdated of heap_update/delete. It makes pg_subscription_rel something like heap table rather than system catalog, though. Regards, -- Masahiko Sawada NIPPON TELEGRAPH AND TELEPHONE CORPORATION NTT Open Source Software Center -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Get stuck when dropping a subscription during synchronizing table
Hello, At Wed, 21 Jun 2017 22:43:32 -0400, Peter Eisentraut wrote in <501f75c9-c5d6-d023-add0-3b670ac86...@2ndquadrant.com> > On 6/20/17 19:10, Peter Eisentraut wrote: > > On 6/19/17 22:54, Masahiko Sawada wrote: > >>> It seems to me we could just take a stronger lock around > >>> RemoveSubscriptionRel(), so that workers can't write in there > >>> concurrently. > >> > >> Since we reduced the lock level of updating pg_subscription_rel by > >> commit 521fd4795e3e the same deadlock issue will appear if we just > >> take a stronger lock level. > > > > I was thinking about a more refined approach, like in the attached > > patch. It just changes the locking when in DropSubscription(), so that > > that doesn't fail if workers are doing stuff concurrently. Everything > > else stays the same. > > The alternative is that we use the LockSharedObject() approach that was > already alluded to, like in the attached patch. Both approaches would > work equally fine AFAICT. However I haven't seen this deeply, just making SetSubscriptionRelState exlusive seems to have a chance to create a false record on pg_subscription_rel. Am I misunderstanding? -- Kyotaro Horiguchi NTT Open Source Software Center -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Get stuck when dropping a subscription during synchronizing table
On 6/20/17 22:44, Noah Misch wrote: >> A patch has been posted, and it's being reviewed. Next update Monday. > > This PostgreSQL 10 open item is past due for your status update. Kindly send > a status update within 24 hours, and include a date for your subsequent status > update. Refer to the policy on open item ownership: > https://www.postgresql.org/message-id/20170404140717.GA2675809%40tornado.leadboat.com I'm not sure how to proceed here. Nobody is else seems terribly interested, and this is really a minor issue, so I don't want to muck around with the locking endlessly. Let's say, if there are no new insights by Friday, I'll pick one of the proposed patches or just close it without any patch. -- Peter Eisentraut http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Get stuck when dropping a subscription during synchronizing table
On 6/19/17 22:41, Masahiko Sawada wrote: > On Tue, Jun 20, 2017 at 10:47 AM, Peter Eisentraut > wrote: >> On 6/16/17 04:16, Masahiko Sawada wrote: >>> A subscription relation state may have been removed already when we >>> try to update it. SetSubscriptionRelState didn't emit an error in such >>> case but with this patch we end up with an error. Since we shouldn't >>> ignore such error in UpdateSubscriptionRelState I'd say we can let the >>> user know about that possibility in the error message. >> >> So are you saying it's good to have the error message? >> > > Yes. UpdateSubscriptionRelState failure means that the subscription > relation state has disappeared or also means something wrong. So I > think it's good to have it as perhaps errdetail. Thought? I think this could lead to errors in the tablesync workers if they are trying to write while the entries have already been deleted as part of the subscription or the table being deleted. -- Peter Eisentraut http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Get stuck when dropping a subscription during synchronizing table
On 6/20/17 19:10, Peter Eisentraut wrote: > On 6/19/17 22:54, Masahiko Sawada wrote: >>> It seems to me we could just take a stronger lock around >>> RemoveSubscriptionRel(), so that workers can't write in there concurrently. >> >> Since we reduced the lock level of updating pg_subscription_rel by >> commit 521fd4795e3e the same deadlock issue will appear if we just >> take a stronger lock level. > > I was thinking about a more refined approach, like in the attached > patch. It just changes the locking when in DropSubscription(), so that > that doesn't fail if workers are doing stuff concurrently. Everything > else stays the same. The alternative is that we use the LockSharedObject() approach that was already alluded to, like in the attached patch. Both approaches would work equally fine AFAICT. -- Peter Eisentraut http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services From 10afa9807014e14596cb05d70ba302c86bf30dd3 Mon Sep 17 00:00:00 2001 From: Peter Eisentraut Date: Wed, 21 Jun 2017 22:40:22 -0400 Subject: [PATCH] WIP Add locking SetSubscriptionRelState() --- src/backend/catalog/pg_subscription.c | 4 1 file changed, 4 insertions(+) diff --git a/src/backend/catalog/pg_subscription.c b/src/backend/catalog/pg_subscription.c index c69c461b62..f794d4c526 100644 --- a/src/backend/catalog/pg_subscription.c +++ b/src/backend/catalog/pg_subscription.c @@ -28,6 +28,8 @@ #include "nodes/makefuncs.h" +#include "storage/lmgr.h" + #include "utils/array.h" #include "utils/builtins.h" #include "utils/fmgroids.h" @@ -246,6 +248,8 @@ SetSubscriptionRelState(Oid subid, Oid relid, char state, boolnulls[Natts_pg_subscription_rel]; Datum values[Natts_pg_subscription_rel]; + LockSharedObject(SubscriptionRelationId, subid, 0, AccessExclusiveLock); + rel = heap_open(SubscriptionRelRelationId, RowExclusiveLock); /* Try finding existing mapping. */ -- 2.13.1 -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Get stuck when dropping a subscription during synchronizing table
On Thu, Jun 15, 2017 at 11:40:52PM -0400, Peter Eisentraut wrote: > On 6/13/17 15:49, Peter Eisentraut wrote: > > On 6/13/17 02:33, Noah Misch wrote: > >>> Steps to reproduce - > >>> X cluster -> create 100 tables , publish all tables (create publication > >>> pub > >>> for all tables); > >>> Y Cluster -> create 100 tables ,create subscription(create subscription > >>> sub > >>> connection 'user=centos host=localhost' publication pub; > >>> Y cluster ->drop subscription - drop subscription sub; > >>> > >>> check the log file on Y cluster. > >>> > >>> Sometime , i have seen this error on psql prompt and drop subscription > >>> operation got failed at first attempt. > >>> > >>> postgres=# drop subscription sub; > >>> ERROR: tuple concurrently updated > >>> postgres=# drop subscription sub; > >>> NOTICE: dropped replication slot "sub" on publisher > >>> DROP SUBSCRIPTION > >> > >> [Action required within three days. This is a generic notification.] > > > > It's being worked on. Let's see by Thursday. > > A patch has been posted, and it's being reviewed. Next update Monday. This PostgreSQL 10 open item is past due for your status update. Kindly send a status update within 24 hours, and include a date for your subsequent status update. Refer to the policy on open item ownership: https://www.postgresql.org/message-id/20170404140717.GA2675809%40tornado.leadboat.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Get stuck when dropping a subscription during synchronizing table
On 6/19/17 22:54, Masahiko Sawada wrote: >> It seems to me we could just take a stronger lock around >> RemoveSubscriptionRel(), so that workers can't write in there concurrently. > > Since we reduced the lock level of updating pg_subscription_rel by > commit 521fd4795e3e the same deadlock issue will appear if we just > take a stronger lock level. I was thinking about a more refined approach, like in the attached patch. It just changes the locking when in DropSubscription(), so that that doesn't fail if workers are doing stuff concurrently. Everything else stays the same. -- Peter Eisentraut http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services From c101b6da52dd4f9f041390a937b337f20d212e5c Mon Sep 17 00:00:00 2001 From: Peter Eisentraut Date: Tue, 20 Jun 2017 19:06:42 -0400 Subject: [PATCH] WIP Parametrize locking in RemoveSubscriptionRel --- src/backend/catalog/heap.c| 2 +- src/backend/catalog/pg_subscription.c | 6 +++--- src/backend/commands/subscriptioncmds.c | 4 ++-- src/include/catalog/pg_subscription_rel.h | 2 +- 4 files changed, 7 insertions(+), 7 deletions(-) diff --git a/src/backend/catalog/heap.c b/src/backend/catalog/heap.c index 4e5b79ef94..643e4bb1d5 100644 --- a/src/backend/catalog/heap.c +++ b/src/backend/catalog/heap.c @@ -1844,7 +1844,7 @@ heap_drop_with_catalog(Oid relid) /* * Remove any associated relation synchronization states. */ - RemoveSubscriptionRel(InvalidOid, relid); + RemoveSubscriptionRel(InvalidOid, relid, RowExclusiveLock); /* * Forget any ON COMMIT action for the rel diff --git a/src/backend/catalog/pg_subscription.c b/src/backend/catalog/pg_subscription.c index c69c461b62..88b81920c8 100644 --- a/src/backend/catalog/pg_subscription.c +++ b/src/backend/catalog/pg_subscription.c @@ -369,7 +369,7 @@ GetSubscriptionRelState(Oid subid, Oid relid, XLogRecPtr *sublsn, * subscription, or for a particular relation, or both. */ void -RemoveSubscriptionRel(Oid subid, Oid relid) +RemoveSubscriptionRel(Oid subid, Oid relid, LOCKMODE lockmode) { Relationrel; HeapScanDesc scan; @@ -377,7 +377,7 @@ RemoveSubscriptionRel(Oid subid, Oid relid) HeapTuple tup; int nkeys = 0; - rel = heap_open(SubscriptionRelRelationId, RowExclusiveLock); + rel = heap_open(SubscriptionRelRelationId, lockmode); if (OidIsValid(subid)) { @@ -405,7 +405,7 @@ RemoveSubscriptionRel(Oid subid, Oid relid) } heap_endscan(scan); - heap_close(rel, RowExclusiveLock); + heap_close(rel, lockmode); } diff --git a/src/backend/commands/subscriptioncmds.c b/src/backend/commands/subscriptioncmds.c index 5aae7b6f91..02cb7c47b8 100644 --- a/src/backend/commands/subscriptioncmds.c +++ b/src/backend/commands/subscriptioncmds.c @@ -595,7 +595,7 @@ AlterSubscription_refresh(Subscription *sub, bool copy_data) { char *namespace; - RemoveSubscriptionRel(sub->oid, relid); + RemoveSubscriptionRel(sub->oid, relid, RowExclusiveLock); logicalrep_worker_stop(sub->oid, relid); @@ -910,7 +910,7 @@ DropSubscription(DropSubscriptionStmt *stmt, bool isTopLevel) deleteSharedDependencyRecordsFor(SubscriptionRelationId, subid, 0); /* Remove any associated relation synchronization states. */ - RemoveSubscriptionRel(subid, InvalidOid); + RemoveSubscriptionRel(subid, InvalidOid, ExclusiveLock); /* Kill the apply worker so that the slot becomes accessible. */ logicalrep_worker_stop(subid, InvalidOid); diff --git a/src/include/catalog/pg_subscription_rel.h b/src/include/catalog/pg_subscription_rel.h index f5f6191676..b2cadb4b13 100644 --- a/src/include/catalog/pg_subscription_rel.h +++ b/src/include/catalog/pg_subscription_rel.h @@ -74,7 +74,7 @@ extern Oid SetSubscriptionRelState(Oid subid, Oid relid, char state, XLogRecPtr sublsn, bool update_only); extern char GetSubscriptionRelState(Oid subid, Oid relid, XLogRecPtr *sublsn, bool missing_ok); -extern void RemoveSubscriptionRel(Oid subid, Oid relid); +extern void RemoveSubscriptionRel(Oid subid, Oid relid, LOCKMODE lockmode); extern List *GetSubscriptionRelations(Oid subid); extern List *GetSubscriptionNotReadyRelations(Oid subid); -- 2.13.1 -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Get stuck when dropping a subscription during synchronizing table
On Tue, Jun 20, 2017 at 10:55 AM, Peter Eisentraut wrote: > On 6/1/17 13:37, Petr Jelinek wrote: >> On 01/06/17 17:32, Masahiko Sawada wrote: >>> On Thu, May 25, 2017 at 5:29 PM, tushar >>> wrote: After applying all your patches, drop subscription no more hangs while dropping subscription but there is an error "ERROR: tuple concurrently updated" in the log file of standby. >>> >>> I tried to reproduce this issue with latest four patches I submit but >>> it didn't happen. I guess this issue doesn't related to the issues >>> reported on this thread and another factor might be cause of this >>> issue. It would be good to test the same again with latest four >>> patches or after solved current some open items. >> >> That's because your additional patch kills the workers that do the >> concurrent update. While that's probably okay, I still plan to look into >> making the subscription and state locking more robust. > > It seems we have gotten off track here a bit. What is the current > proposal to fix "tuple concurrently updated" during DROP SUBSCRiPTION? I think there is no proposal for it so far. The current proposal is to fix this problem during ALTER SUBSCRIPTION. > It seems to me we could just take a stronger lock around > RemoveSubscriptionRel(), so that workers can't write in there concurrently. > Since we reduced the lock level of updating pg_subscription_rel by commit 521fd4795e3e the same deadlock issue will appear if we just take a stronger lock level. Regards, -- Masahiko Sawada NIPPON TELEGRAPH AND TELEPHONE CORPORATION NTT Open Source Software Center -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Get stuck when dropping a subscription during synchronizing table
On Tue, Jun 20, 2017 at 10:47 AM, Peter Eisentraut wrote: > On 6/16/17 04:16, Masahiko Sawada wrote: >> A subscription relation state may have been removed already when we >> try to update it. SetSubscriptionRelState didn't emit an error in such >> case but with this patch we end up with an error. Since we shouldn't >> ignore such error in UpdateSubscriptionRelState I'd say we can let the >> user know about that possibility in the error message. > > So are you saying it's good to have the error message? > Yes. UpdateSubscriptionRelState failure means that the subscription relation state has disappeared or also means something wrong. So I think it's good to have it as perhaps errdetail. Thought? Regards, -- Masahiko Sawada NIPPON TELEGRAPH AND TELEPHONE CORPORATION NTT Open Source Software Center -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Get stuck when dropping a subscription during synchronizing table
On 6/1/17 13:37, Petr Jelinek wrote: > On 01/06/17 17:32, Masahiko Sawada wrote: >> On Thu, May 25, 2017 at 5:29 PM, tushar >> wrote: >>> After applying all your patches, drop subscription no more hangs while >>> dropping subscription but there is an error "ERROR: tuple concurrently >>> updated" in the log file of >>> standby. >> >> I tried to reproduce this issue with latest four patches I submit but >> it didn't happen. I guess this issue doesn't related to the issues >> reported on this thread and another factor might be cause of this >> issue. It would be good to test the same again with latest four >> patches or after solved current some open items. > > That's because your additional patch kills the workers that do the > concurrent update. While that's probably okay, I still plan to look into > making the subscription and state locking more robust. It seems we have gotten off track here a bit. What is the current proposal to fix "tuple concurrently updated" during DROP SUBSCRiPTION? It seems to me we could just take a stronger lock around RemoveSubscriptionRel(), so that workers can't write in there concurrently. -- Peter Eisentraut http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Get stuck when dropping a subscription during synchronizing table
On 6/16/17 04:16, Masahiko Sawada wrote: > A subscription relation state may have been removed already when we > try to update it. SetSubscriptionRelState didn't emit an error in such > case but with this patch we end up with an error. Since we shouldn't > ignore such error in UpdateSubscriptionRelState I'd say we can let the > user know about that possibility in the error message. So are you saying it's good to have the error message? -- Peter Eisentraut http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Get stuck when dropping a subscription during synchronizing table
On Thu, Jun 15, 2017 at 10:22 AM, Masahiko Sawada wrote: > On Thu, Jun 15, 2017 at 7:35 AM, Petr Jelinek > wrote: >> On 13/06/17 21:49, Peter Eisentraut wrote: >>> On 6/13/17 02:33, Noah Misch wrote: > Steps to reproduce - > X cluster -> create 100 tables , publish all tables (create publication > pub > for all tables); > Y Cluster -> create 100 tables ,create subscription(create subscription > sub > connection 'user=centos host=localhost' publication pub; > Y cluster ->drop subscription - drop subscription sub; > > check the log file on Y cluster. > > Sometime , i have seen this error on psql prompt and drop subscription > operation got failed at first attempt. > > postgres=# drop subscription sub; > ERROR: tuple concurrently updated > postgres=# drop subscription sub; > NOTICE: dropped replication slot "sub" on publisher > DROP SUBSCRIPTION [Action required within three days. This is a generic notification.] >>> >>> It's being worked on. Let's see by Thursday. >>> >> I've reviewed these patches. 0001 patch conflicts with commit a571c7f661a7b601aafcb12196d004cdb8b8cb23. >> Attached fixes it (it was mostly about order of calls). I also split the >> SetSubscriptionRelState into 2 separate interface while I was changing >> it, because now that the update_only bool was added it has become quite >> strange to have single interface for what is basically two separate >> functions. +1 from me, too. A subscription relation state may have been removed already when we try to update it. SetSubscriptionRelState didn't emit an error in such case but with this patch we end up with an error. Since we shouldn't ignore such error in UpdateSubscriptionRelState I'd say we can let the user know about that possibility in the error message. Regards, -- Masahiko Sawada NIPPON TELEGRAPH AND TELEPHONE CORPORATION NTT Open Source Software Center -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Get stuck when dropping a subscription during synchronizing table
On 6/13/17 15:49, Peter Eisentraut wrote: > On 6/13/17 02:33, Noah Misch wrote: >>> Steps to reproduce - >>> X cluster -> create 100 tables , publish all tables (create publication pub >>> for all tables); >>> Y Cluster -> create 100 tables ,create subscription(create subscription sub >>> connection 'user=centos host=localhost' publication pub; >>> Y cluster ->drop subscription - drop subscription sub; >>> >>> check the log file on Y cluster. >>> >>> Sometime , i have seen this error on psql prompt and drop subscription >>> operation got failed at first attempt. >>> >>> postgres=# drop subscription sub; >>> ERROR: tuple concurrently updated >>> postgres=# drop subscription sub; >>> NOTICE: dropped replication slot "sub" on publisher >>> DROP SUBSCRIPTION >> >> [Action required within three days. This is a generic notification.] > > It's being worked on. Let's see by Thursday. A patch has been posted, and it's being reviewed. Next update Monday. -- Peter Eisentraut http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Get stuck when dropping a subscription during synchronizing table
On 6/15/17 12:22, Petr Jelinek wrote: > On 15/06/17 17:53, Peter Eisentraut wrote: >> On 6/14/17 18:35, Petr Jelinek wrote: >>> Attached fixes it (it was mostly about order of calls). >> So do I understand this right that the actual fix is just moving up the >> logicalrep_worker_stop() call in DropSubscription(). >> > No the fix is heap_open before SearchSysCache(). The existing code already does that. -- Peter Eisentraut http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Get stuck when dropping a subscription during synchronizing table
On 15/06/17 18:36, Peter Eisentraut wrote: > On 6/15/17 12:22, Petr Jelinek wrote: >> On 15/06/17 17:53, Peter Eisentraut wrote: >>> On 6/14/17 18:35, Petr Jelinek wrote: Attached fixes it (it was mostly about order of calls). >>> >>> So do I understand this right that the actual fix is just moving up the >>> logicalrep_worker_stop() call in DropSubscription(). >>> >> >> No the fix is heap_open before SearchSysCache(). > > Right. Is there a reason for moving the _stop() call then? > Nothing specific, just felt it's better there when I was messing with the function. -- Petr Jelinek http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Get stuck when dropping a subscription during synchronizing table
On 6/15/17 12:22, Petr Jelinek wrote: > On 15/06/17 17:53, Peter Eisentraut wrote: >> On 6/14/17 18:35, Petr Jelinek wrote: >>> Attached fixes it (it was mostly about order of calls). >> >> So do I understand this right that the actual fix is just moving up the >> logicalrep_worker_stop() call in DropSubscription(). >> > > No the fix is heap_open before SearchSysCache(). Right. Is there a reason for moving the _stop() call then? -- Peter Eisentraut http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Get stuck when dropping a subscription during synchronizing table
On 15/06/17 17:53, Peter Eisentraut wrote: > On 6/14/17 18:35, Petr Jelinek wrote: >> Attached fixes it (it was mostly about order of calls). > > So do I understand this right that the actual fix is just moving up the > logicalrep_worker_stop() call in DropSubscription(). > No the fix is heap_open before SearchSysCache(). >> I also split the >> SetSubscriptionRelState into 2 separate interface while I was changing >> it, because now that the update_only bool was added it has become quite >> strange to have single interface for what is basically two separate >> functions. > > makes sense > >> Other related problem is locking of subscriptions during operations on >> them, especially AlterSubscription seems like it should lock the >> subscription itself. I did that in 0002. > > More detail here please. AlterSubscription() does locking via > heap_open(). This introduces a new locking method. What are the > implications? > I don't think heap_open will be enough once we remove the AccessExclusiveLock of the catalog in DropSubscription because concurrent AlterSubscription might happily add tables to the subscription that has been dropped if we don't lock it. But you made me realize that even my patch is not enough because we then reread the subscription info only from syscache without any kind of invalidation attempt. -- Petr Jelinek http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Get stuck when dropping a subscription during synchronizing table
On 6/14/17 18:35, Petr Jelinek wrote: > Attached fixes it (it was mostly about order of calls). So do I understand this right that the actual fix is just moving up the logicalrep_worker_stop() call in DropSubscription(). > I also split the > SetSubscriptionRelState into 2 separate interface while I was changing > it, because now that the update_only bool was added it has become quite > strange to have single interface for what is basically two separate > functions. makes sense > Other related problem is locking of subscriptions during operations on > them, especially AlterSubscription seems like it should lock the > subscription itself. I did that in 0002. More detail here please. AlterSubscription() does locking via heap_open(). This introduces a new locking method. What are the implications? -- Peter Eisentraut http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Get stuck when dropping a subscription during synchronizing table
On Thu, Jun 15, 2017 at 7:35 AM, Petr Jelinek wrote: > On 13/06/17 21:49, Peter Eisentraut wrote: >> On 6/13/17 02:33, Noah Misch wrote: Steps to reproduce - X cluster -> create 100 tables , publish all tables (create publication pub for all tables); Y Cluster -> create 100 tables ,create subscription(create subscription sub connection 'user=centos host=localhost' publication pub; Y cluster ->drop subscription - drop subscription sub; check the log file on Y cluster. Sometime , i have seen this error on psql prompt and drop subscription operation got failed at first attempt. postgres=# drop subscription sub; ERROR: tuple concurrently updated postgres=# drop subscription sub; NOTICE: dropped replication slot "sub" on publisher DROP SUBSCRIPTION >>> >>> [Action required within three days. This is a generic notification.] >> >> It's being worked on. Let's see by Thursday. >> > > Attached fixes it (it was mostly about order of calls). I also split the > SetSubscriptionRelState into 2 separate interface while I was changing > it, because now that the update_only bool was added it has become quite > strange to have single interface for what is basically two separate > functions. > > There are still couple of remaining issues from this thread though. > Namely the AccessExclusiveLock of the pg_subscription catalog which is > not very pretty, but we need a way to block launcher from accessing the > subscription which is being dropped and make sure it will not start new > workers for it afterwards. Question is how however as by the time > launcher can lock individual subscription it is already processing it. > So it looks to me like we'd need to reread the catalog with new snapshot > after the lock was acquired which seems bit wasteful (I wonder if we > could just AcceptInvalidationMessages and refetch from syscache). Any > better ideas? > > Other related problem is locking of subscriptions during operations on > them, especially AlterSubscription seems like it should lock the > subscription itself. I did that in 0002. > Thank you for the patch! Sorry I don't have a time for it today but I'll review these patches tomorrow. Regards, -- Masahiko Sawada NIPPON TELEGRAPH AND TELEPHONE CORPORATION NTT Open Source Software Center -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Get stuck when dropping a subscription during synchronizing table
On 13/06/17 21:49, Peter Eisentraut wrote: > On 6/13/17 02:33, Noah Misch wrote: >>> Steps to reproduce - >>> X cluster -> create 100 tables , publish all tables (create publication pub >>> for all tables); >>> Y Cluster -> create 100 tables ,create subscription(create subscription sub >>> connection 'user=centos host=localhost' publication pub; >>> Y cluster ->drop subscription - drop subscription sub; >>> >>> check the log file on Y cluster. >>> >>> Sometime , i have seen this error on psql prompt and drop subscription >>> operation got failed at first attempt. >>> >>> postgres=# drop subscription sub; >>> ERROR: tuple concurrently updated >>> postgres=# drop subscription sub; >>> NOTICE: dropped replication slot "sub" on publisher >>> DROP SUBSCRIPTION >> >> [Action required within three days. This is a generic notification.] > > It's being worked on. Let's see by Thursday. > Attached fixes it (it was mostly about order of calls). I also split the SetSubscriptionRelState into 2 separate interface while I was changing it, because now that the update_only bool was added it has become quite strange to have single interface for what is basically two separate functions. There are still couple of remaining issues from this thread though. Namely the AccessExclusiveLock of the pg_subscription catalog which is not very pretty, but we need a way to block launcher from accessing the subscription which is being dropped and make sure it will not start new workers for it afterwards. Question is how however as by the time launcher can lock individual subscription it is already processing it. So it looks to me like we'd need to reread the catalog with new snapshot after the lock was acquired which seems bit wasteful (I wonder if we could just AcceptInvalidationMessages and refetch from syscache). Any better ideas? Other related problem is locking of subscriptions during operations on them, especially AlterSubscription seems like it should lock the subscription itself. I did that in 0002. -- Petr Jelinek http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services From 005eeb820f4e0528513744136582c4489e2429e3 Mon Sep 17 00:00:00 2001 From: Petr Jelinek Date: Wed, 14 Jun 2017 08:14:20 +0200 Subject: [PATCH 2/2] Lock subscription when altering it --- src/backend/commands/subscriptioncmds.c | 7 +++ 1 file changed, 7 insertions(+) diff --git a/src/backend/commands/subscriptioncmds.c b/src/backend/commands/subscriptioncmds.c index 14c8f3f..e0ec8ea 100644 --- a/src/backend/commands/subscriptioncmds.c +++ b/src/backend/commands/subscriptioncmds.c @@ -642,6 +642,13 @@ AlterSubscription(AlterSubscriptionStmt *stmt) stmt->subname); subid = HeapTupleGetOid(tup); + + /* + * Lock the subscription so nobody else can do anything with it (including + * the replication workers). + */ + LockSharedObject(SubscriptionRelationId, subid, 0, AccessExclusiveLock); + sub = GetSubscription(subid, false); /* Form a new tuple. */ -- 2.7.4 From 9011698ae800e0f45f960e91f6b16eab15634fac Mon Sep 17 00:00:00 2001 From: Petr Jelinek Date: Tue, 13 Jun 2017 19:26:51 +0200 Subject: [PATCH 1/2] Improve the pg_subscription_rel handling Split the SetSubscriptionRelState into separate Add and Update functions, removing the unsafe upsert logic as callers are supposed to know whether they are updating or adding new row. Reorder the code in the above mentioned functions to avoid "tuple updated concurrently" warnings. --- src/backend/catalog/pg_subscription.c | 131 +++- src/backend/commands/subscriptioncmds.c | 14 +-- src/backend/replication/logical/tablesync.c | 33 --- src/include/catalog/pg_subscription_rel.h | 6 +- 4 files changed, 98 insertions(+), 86 deletions(-) diff --git a/src/backend/catalog/pg_subscription.c b/src/backend/catalog/pg_subscription.c index c5b2541..fd19675 100644 --- a/src/backend/catalog/pg_subscription.c +++ b/src/backend/catalog/pg_subscription.c @@ -225,24 +225,15 @@ textarray_to_stringlist(ArrayType *textarray) } /* - * Set the state of a subscription table. - * - * If update_only is true and the record for given table doesn't exist, do - * nothing. This can be used to avoid inserting a new record that was deleted - * by someone else. Generally, subscription DDL commands should use false, - * workers should use true. - * - * The insert-or-update logic in this function is not concurrency safe so it - * might raise an error in rare circumstances. But if we took a stronger lock - * such as ShareRowExclusiveLock, we would risk more deadlocks. + * Add new state record for a subscription table. */ Oid -SetSubscriptionRelState(Oid subid, Oid relid, char state, - XLogRecPtr sublsn, bool update_only) +AddSubscriptionRelState(Oid subid, Oid relid, char state, + XLogRecPtr sublsn) { Relation rel; HeapTuple tup; - Oid subrelid = InvalidOid; + Oid subrelid; bool nulls[Natts_p
Re: [HACKERS] Get stuck when dropping a subscription during synchronizing table
On 6/13/17 02:33, Noah Misch wrote: >> Steps to reproduce - >> X cluster -> create 100 tables , publish all tables (create publication pub >> for all tables); >> Y Cluster -> create 100 tables ,create subscription(create subscription sub >> connection 'user=centos host=localhost' publication pub; >> Y cluster ->drop subscription - drop subscription sub; >> >> check the log file on Y cluster. >> >> Sometime , i have seen this error on psql prompt and drop subscription >> operation got failed at first attempt. >> >> postgres=# drop subscription sub; >> ERROR: tuple concurrently updated >> postgres=# drop subscription sub; >> NOTICE: dropped replication slot "sub" on publisher >> DROP SUBSCRIPTION > > [Action required within three days. This is a generic notification.] It's being worked on. Let's see by Thursday. -- Peter Eisentraut http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Get stuck when dropping a subscription during synchronizing table
On Thu, May 25, 2017 at 01:59:51PM +0530, tushar wrote: > After applying all your patches, drop subscription no more hangs while > dropping subscription but there is an error "ERROR: tuple concurrently > updated" in the log file of > standby. > > --- > logical replication synchronization worker finished processing > 2017-05-25 09:15:52.654 BST [18575] LOG: logical replication > synchronization worker finished processing > 2017-05-25 09:15:52.656 BST [18563] LOG: starting logical replication > worker for subscription "sub" > 2017-05-25 09:15:52.662 BST [18577] LOG: logical replication sync for > subscription sub, table t14 started > 2017-05-25 09:15:53.657 BST [18563] LOG: starting logical replication > worker for subscription "sub" > 2017-05-25 09:15:53.663 BST [18579] LOG: logical replication sync for > subscription sub, table t15 started > 2017-05-25 09:15:53.724 BST [18563] FATAL: terminating logical replication > worker due to administrator command > 2017-05-25 09:15:53.725 BST [18521] LOG: worker process: logical > replication worker for subscription 16684 (PID 18563) exited with exit code > 1 > 2017-05-25 09:15:54.734 BST [18579] ERROR: tuple concurrently updated > 2017-05-25 09:15:54.735 BST [18577] ERROR: tuple concurrently updated > 2017-05-25 09:15:54.736 BST [18521] LOG: worker process: logical > replication worker for subscription 16684 sync 16426 (PID 18579) exited with > exit code 1 > 2017-05-25 09:15:54.736 BST [18521] LOG: worker process: logical > replication worker for subscription 16684 sync 16423 (PID 18577) exited with > exit code 1 > ~ > ~ > ~ > > Steps to reproduce - > X cluster -> create 100 tables , publish all tables (create publication pub > for all tables); > Y Cluster -> create 100 tables ,create subscription(create subscription sub > connection 'user=centos host=localhost' publication pub; > Y cluster ->drop subscription - drop subscription sub; > > check the log file on Y cluster. > > Sometime , i have seen this error on psql prompt and drop subscription > operation got failed at first attempt. > > postgres=# drop subscription sub; > ERROR: tuple concurrently updated > postgres=# drop subscription sub; > NOTICE: dropped replication slot "sub" on publisher > DROP SUBSCRIPTION [Action required within three days. This is a generic notification.] The above-described topic is currently a PostgreSQL 10 open item. Peter, since you committed the patch believed to have created it, you own this open item. If some other commit is more relevant or if this does not belong as a v10 open item, please let us know. Otherwise, please observe the policy on open item ownership[1] and send a status update within three calendar days of this message. Include a date for your subsequent status update. Testers may discover new open items at any time, and I want to plan to get them all fixed well in advance of shipping v10. Consequently, I will appreciate your efforts toward speedy resolution. Thanks. [1] https://www.postgresql.org/message-id/20170404140717.GA2675809%40tornado.leadboat.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Get stuck when dropping a subscription during synchronizing table
On 2017-06-07 11:51:12 +0200, Petr Jelinek wrote: > On 07/06/17 03:00, Andres Freund wrote: > > On 2017-06-06 19:36:13 +0200, Petr Jelinek wrote: > > > >> As a side note, we are starting to have several IsSomeTypeOfProcess > >> functions for these kind of things. I wonder if bgworker infrastructure > >> should somehow provide this type of stuff (the proposed bgw_type might > >> help there) as well as maybe being able to register interrupt handler > >> for the worker (it's really hard to do it via custom SIGTERM handler as > >> visible on worker, launcher and walsender issues we are fixing). > >> Obviously this is PG11 related thought. > > > > Maybe it's also a sign that the bgworker infrastructure isn't really the > > best thing to use for internal processes like parallel workers and lrep > > processes - it's really built so core code *doesn't* know anything > > specific about them, which isn't really what we want in some of these > > cases. That's not to say that bgworkers and parallelism/lrep shouldn't > > share infrastructure, don't get me wrong. > > > > Well the nice thing about bgworkers is that it provides the basic > infrastructure. Right. > Main reason lrep stuff is using it is that I didn't want to add > another special backend kind to 20 different places, but turns out it > still needs to be in some. So either we need to add more support for > these kind of things to bgworkers or write something for internal > workers that shares the infrastructure with bgworkers as you say. Yea, I think we should radically unify a lot of the related infrastructure between all processes. We've grown a lot of duplication. > >> @@ -2848,6 +2849,14 @@ ProcessInterrupts(void) > >>ereport(FATAL, > >>(errcode(ERRCODE_ADMIN_SHUTDOWN), > >> errmsg("terminating logical > >> replication worker due to administrator command"))); > >> + else if (IsLogicalLauncher()) > >> + { > >> + ereport(DEBUG1, > >> + (errmsg("logical replication launcher > >> shutting down"))); > >> + > >> + /* The logical replication launcher can be stopped at > >> any time. */ > >> + proc_exit(0); > >> + } > > > > We could use PgBackendStatus->st_backendType for these, merging these > > different paths. > > > > Hmm, that's not that easy, st_backendType will be generic type for > bgworker as the bgw_type patch didn't land yet (which is discussed in > yet another thread [1]). It seems like an argument for going forward > with it (the bgw_type patch) in PG10. Yea. I left it as is for now. > I don't mind, it has some overlap with your proposed fixes for latching > so if you are willing go ahead. Done. Greetings, Andres Freund -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Get stuck when dropping a subscription during synchronizing table
On 07/06/17 03:00, Andres Freund wrote: > On 2017-06-06 19:36:13 +0200, Petr Jelinek wrote: > >> As a side note, we are starting to have several IsSomeTypeOfProcess >> functions for these kind of things. I wonder if bgworker infrastructure >> should somehow provide this type of stuff (the proposed bgw_type might >> help there) as well as maybe being able to register interrupt handler >> for the worker (it's really hard to do it via custom SIGTERM handler as >> visible on worker, launcher and walsender issues we are fixing). >> Obviously this is PG11 related thought. > > Maybe it's also a sign that the bgworker infrastructure isn't really the > best thing to use for internal processes like parallel workers and lrep > processes - it's really built so core code *doesn't* know anything > specific about them, which isn't really what we want in some of these > cases. That's not to say that bgworkers and parallelism/lrep shouldn't > share infrastructure, don't get me wrong. > Well the nice thing about bgworkers is that it provides the basic infrastructure. Main reason lrep stuff is using it is that I didn't want to add another special backend kind to 20 different places, but turns out it still needs to be in some. So either we need to add more support for these kind of things to bgworkers or write something for internal workers that shares the infrastructure with bgworkers as you say. > >> >> -LogicalRepCtx->launcher_pid = 0; >> - >> -/* ... and if it returns, we're done */ >> -ereport(DEBUG1, >> -(errmsg("logical replication launcher shutting down"))); >> +/* Not reachable */ >> +} > > Maybe put a pg_unreachable() there? > Ah didn't know it existed. > >> @@ -2848,6 +2849,14 @@ ProcessInterrupts(void) >> ereport(FATAL, >> (errcode(ERRCODE_ADMIN_SHUTDOWN), >> errmsg("terminating logical >> replication worker due to administrator command"))); >> +else if (IsLogicalLauncher()) >> +{ >> +ereport(DEBUG1, >> +(errmsg("logical replication launcher >> shutting down"))); >> + >> +/* The logical replication launcher can be stopped at >> any time. */ >> +proc_exit(0); >> +} > > We could use PgBackendStatus->st_backendType for these, merging these > different paths. > Hmm, that's not that easy, st_backendType will be generic type for bgworker as the bgw_type patch didn't land yet (which is discussed in yet another thread [1]). It seems like an argument for going forward with it (the bgw_type patch) in PG10. > I can take care of this one, if you/Peter want. > I don't mind, it has some overlap with your proposed fixes for latching so if you are willing go ahead. [1] https://www.postgresql.org/message-id/flat/0d795703-a885-2193-2331-f00d7a3a4...@2ndquadrant.com#0d795703-a885-2193-2331-f00d7a3a4...@2ndquadrant.com -- Petr Jelinek http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Get stuck when dropping a subscription during synchronizing table
On 2017-06-06 19:36:13 +0200, Petr Jelinek wrote: > So the fact that we moved workers to standard interrupt handling broke > launcher in subtle ways because it still uses it's own SIGTERM handling > but some function it calls are using CHECK_FOR_INTERRUPTS (they are used > by worker as well). I think we need to move launcher to standard > interrupt handling as well. Sounds like a good plan. > As a side note, we are starting to have several IsSomeTypeOfProcess > functions for these kind of things. I wonder if bgworker infrastructure > should somehow provide this type of stuff (the proposed bgw_type might > help there) as well as maybe being able to register interrupt handler > for the worker (it's really hard to do it via custom SIGTERM handler as > visible on worker, launcher and walsender issues we are fixing). > Obviously this is PG11 related thought. Maybe it's also a sign that the bgworker infrastructure isn't really the best thing to use for internal processes like parallel workers and lrep processes - it's really built so core code *doesn't* know anything specific about them, which isn't really what we want in some of these cases. That's not to say that bgworkers and parallelism/lrep shouldn't share infrastructure, don't get me wrong. > > - LogicalRepCtx->launcher_pid = 0; > - > - /* ... and if it returns, we're done */ > - ereport(DEBUG1, > - (errmsg("logical replication launcher shutting down"))); > + /* Not reachable */ > +} Maybe put a pg_unreachable() there? > @@ -2848,6 +2849,14 @@ ProcessInterrupts(void) > ereport(FATAL, > (errcode(ERRCODE_ADMIN_SHUTDOWN), >errmsg("terminating logical > replication worker due to administrator command"))); > + else if (IsLogicalLauncher()) > + { > + ereport(DEBUG1, > + (errmsg("logical replication launcher > shutting down"))); > + > + /* The logical replication launcher can be stopped at > any time. */ > + proc_exit(0); > + } We could use PgBackendStatus->st_backendType for these, merging these different paths. I can take care of this one, if you/Peter want. Greetings, Andres Freund -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Get stuck when dropping a subscription during synchronizing table
On 03/06/17 18:53, Peter Eisentraut wrote: > On 6/2/17 14:52, Peter Eisentraut wrote: >> On 5/24/17 15:14, Petr Jelinek wrote: >>> All the locking works just fine the way it is in master. The issue with >>> deadlock with apply comes from the wrong handling of the SIGTERM in the >>> apply (we didn't set InterruptPending). I changed the SIGTERM handler in >>> patch 0001 just to die which is actually the correct behavior for apply >>> workers. I also moved the connection cleanup code to the >>> before_shmem_exit callback (similar to walreceiver) and now that part >>> works correctly. >> >> I have committed this, in two separate parts. This should fix the >> originally reported issue. >> >> I will continue to work through your other patches. I notice there is >> still a bit of discussion about another patch, so please let me know if >> there is anything else I should be looking for. > > I have committed the remaining two patches. I believe this fixes the > originally reported issue. > So the fact that we moved workers to standard interrupt handling broke launcher in subtle ways because it still uses it's own SIGTERM handling but some function it calls are using CHECK_FOR_INTERRUPTS (they are used by worker as well). I think we need to move launcher to standard interrupt handling as well. It's not same as other processes though as it's allowed to be terminated any time (just like autovacuum launcher) so we just proc_exit(0) instead of FATALing out. This is related to the nightjar failures btw. As a side note, we are starting to have several IsSomeTypeOfProcess functions for these kind of things. I wonder if bgworker infrastructure should somehow provide this type of stuff (the proposed bgw_type might help there) as well as maybe being able to register interrupt handler for the worker (it's really hard to do it via custom SIGTERM handler as visible on worker, launcher and walsender issues we are fixing). Obviously this is PG11 related thought. -- Petr Jelinek http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services From 3fb7ed57d669beba3feba894a11c9dcff8e60414 Mon Sep 17 00:00:00 2001 From: Petr Jelinek Date: Tue, 6 Jun 2017 19:26:12 +0200 Subject: [PATCH] Use standard interrupt handling in logical replication launcher --- src/backend/replication/logical/launcher.c | 41 -- src/backend/tcop/postgres.c| 9 +++ src/include/replication/logicallauncher.h | 2 ++ 3 files changed, 27 insertions(+), 25 deletions(-) diff --git a/src/backend/replication/logical/launcher.c b/src/backend/replication/logical/launcher.c index 5aaf24b..52169f1 100644 --- a/src/backend/replication/logical/launcher.c +++ b/src/backend/replication/logical/launcher.c @@ -81,7 +81,6 @@ static void logicalrep_worker_cleanup(LogicalRepWorker *worker); /* Flags set by signal handlers */ static volatile sig_atomic_t got_SIGHUP = false; -static volatile sig_atomic_t got_SIGTERM = false; static bool on_commit_launcher_wakeup = false; @@ -623,20 +622,6 @@ logicalrep_worker_onexit(int code, Datum arg) ApplyLauncherWakeup(); } -/* SIGTERM: set flag to exit at next convenient time */ -static void -logicalrep_launcher_sigterm(SIGNAL_ARGS) -{ - int save_errno = errno; - - got_SIGTERM = true; - - /* Waken anything waiting on the process latch */ - SetLatch(MyLatch); - - errno = save_errno; -} - /* SIGHUP: set flag to reload configuration at next convenient time */ static void logicalrep_launcher_sighup(SIGNAL_ARGS) @@ -798,13 +783,14 @@ ApplyLauncherMain(Datum main_arg) before_shmem_exit(logicalrep_launcher_onexit, (Datum) 0); + Assert(LogicalRepCtx->launcher_pid == 0); + LogicalRepCtx->launcher_pid = MyProcPid; + /* Establish signal handlers. */ pqsignal(SIGHUP, logicalrep_launcher_sighup); - pqsignal(SIGTERM, logicalrep_launcher_sigterm); + pqsignal(SIGTERM, die); BackgroundWorkerUnblockSignals(); - LogicalRepCtx->launcher_pid = MyProcPid; - /* * Establish connection to nailed catalogs (we only ever access * pg_subscription). @@ -812,7 +798,7 @@ ApplyLauncherMain(Datum main_arg) BackgroundWorkerInitializeConnection(NULL, NULL); /* Enter main loop */ - while (!got_SIGTERM) + for (;;) { int rc; List *sublist; @@ -822,6 +808,8 @@ ApplyLauncherMain(Datum main_arg) TimestampTz now; long wait_time = DEFAULT_NAPTIME_PER_CYCLE; + CHECK_FOR_INTERRUPTS(); + now = GetCurrentTimestamp(); /* Limit the start retry to once a wal_retrieve_retry_interval */ @@ -894,13 +882,16 @@ ApplyLauncherMain(Datum main_arg) ResetLatch(&MyProc->procLatch); } - LogicalRepCtx->launcher_pid = 0; - - /* ... and if it returns, we're done */ - ereport(DEBUG1, - (errmsg("logical replication launcher shutting down"))); + /* Not reachable */ +} - proc_exit(0); +/* + * Is current process the logical replication launcher? + */ +bool +IsLogicalLauncher(void) +{ + return LogicalRepCtx->launcher_pid
Re: [HACKERS] Get stuck when dropping a subscription during synchronizing table
On Sun, Jun 4, 2017 at 1:53 AM, Peter Eisentraut wrote: > On 6/2/17 14:52, Peter Eisentraut wrote: >> On 5/24/17 15:14, Petr Jelinek wrote: >>> All the locking works just fine the way it is in master. The issue with >>> deadlock with apply comes from the wrong handling of the SIGTERM in the >>> apply (we didn't set InterruptPending). I changed the SIGTERM handler in >>> patch 0001 just to die which is actually the correct behavior for apply >>> workers. I also moved the connection cleanup code to the >>> before_shmem_exit callback (similar to walreceiver) and now that part >>> works correctly. >> >> I have committed this, in two separate parts. This should fix the >> originally reported issue. >> >> I will continue to work through your other patches. I notice there is >> still a bit of discussion about another patch, so please let me know if >> there is anything else I should be looking for. > > I have committed the remaining two patches. I believe this fixes the > originally reported issue. > IIUC the issue that sync worker could be orphaned and keep running inside the long COPY is not fixed yet by commit 3c9bc2157a4f465b3c070d1250597568d2dc285f, and should be fixed. Am I missing something? Regards, -- Masahiko Sawada NIPPON TELEGRAPH AND TELEPHONE CORPORATION NTT Open Source Software Center -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Get stuck when dropping a subscription during synchronizing table
On 6/2/17 14:52, Peter Eisentraut wrote: > On 5/24/17 15:14, Petr Jelinek wrote: >> All the locking works just fine the way it is in master. The issue with >> deadlock with apply comes from the wrong handling of the SIGTERM in the >> apply (we didn't set InterruptPending). I changed the SIGTERM handler in >> patch 0001 just to die which is actually the correct behavior for apply >> workers. I also moved the connection cleanup code to the >> before_shmem_exit callback (similar to walreceiver) and now that part >> works correctly. > > I have committed this, in two separate parts. This should fix the > originally reported issue. > > I will continue to work through your other patches. I notice there is > still a bit of discussion about another patch, so please let me know if > there is anything else I should be looking for. I have committed the remaining two patches. I believe this fixes the originally reported issue. -- Peter Eisentraut http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Get stuck when dropping a subscription during synchronizing table
On 5/24/17 15:14, Petr Jelinek wrote: > All the locking works just fine the way it is in master. The issue with > deadlock with apply comes from the wrong handling of the SIGTERM in the > apply (we didn't set InterruptPending). I changed the SIGTERM handler in > patch 0001 just to die which is actually the correct behavior for apply > workers. I also moved the connection cleanup code to the > before_shmem_exit callback (similar to walreceiver) and now that part > works correctly. I have committed this, in two separate parts. This should fix the originally reported issue. I will continue to work through your other patches. I notice there is still a bit of discussion about another patch, so please let me know if there is anything else I should be looking for. -- Peter Eisentraut http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Get stuck when dropping a subscription during synchronizing table
On Thu, Jun 01, 2017 at 12:00:33AM -0400, Peter Eisentraut wrote: > On 5/31/17 02:51, Noah Misch wrote: > > On Tue, May 30, 2017 at 01:30:35AM +, Noah Misch wrote: > >> On Thu, May 18, 2017 at 10:27:51PM -0400, Peter Eisentraut wrote: > >>> On 5/18/17 11:11, Noah Misch wrote: > IMMEDIATE ATTENTION REQUIRED. This PostgreSQL 10 open item is past due > for > your status update. Please reacquaint yourself with the policy on open > item > ownership[1] and then reply immediately. If I do not hear from you by > 2017-05-19 16:00 UTC, I will transfer this item to release management > team > ownership without further notice. > >>> > >>> There is no progress on this issue at the moment. I will report again > >>> next Wednesday. > >> > >> IMMEDIATE ATTENTION REQUIRED. This PostgreSQL 10 open item is long past > >> due > >> for your status update. If I do not hear one from you by 2017-05-31 02:00 > >> UTC, I will transfer this item to release management team ownership without > >> further notice. > > > > This PostgreSQL 10 open item now needs a permanent owner. Would any other > > committer like to take ownership? If this role interests you, please read > > this thread and the policy linked above, then send an initial status update > > bearing a date for your subsequent status update. If the item does not > > have a > > permanent owner by 2017-06-03 07:00 UTC, I will investigate feature removals > > that would resolve the item. > > It seems I lost track of this item between all the other ones. I will > continue to work on this item. We have patches proposed and I will work > on committing them until Friday. If any other committer cares about logical replication features in v10, I'd recommend he take ownership of this open item despite your plan to work on it. Otherwise, if you miss fixing this on Friday, it will be too late for others to volunteer. > I think I now have updates posted on all my items. As of your writing that, two of your open items had no conforming status update, and they still don't: - Background worker display in pg_stat_activity (logical replication especially) - ALTER SUBSCRIPTION REFRESH and table sync worker -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Get stuck when dropping a subscription during synchronizing table
On 01/06/17 17:32, Masahiko Sawada wrote: > On Thu, May 25, 2017 at 5:29 PM, tushar wrote: >> On 05/25/2017 12:44 AM, Petr Jelinek wrote: >>> >>> There is still outstanding issue that sync worker will keep running >>> inside the long COPY because the invalidation messages are also not >>> processed until it finishes but all the original issues reported here >>> disappear for me with the attached patches applied. >> >> After applying all your patches, drop subscription no more hangs while >> dropping subscription but there is an error "ERROR: tuple concurrently >> updated" in the log file of >> standby. >> > > I tried to reproduce this issue with latest four patches I submit but > it didn't happen. I guess this issue doesn't related to the issues > reported on this thread and another factor might be cause of this > issue. It would be good to test the same again with latest four > patches or after solved current some open items. > That's because your additional patch kills the workers that do the concurrent update. While that's probably okay, I still plan to look into making the subscription and state locking more robust. -- Petr Jelinek http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Get stuck when dropping a subscription during synchronizing table
On Thu, May 25, 2017 at 5:29 PM, tushar wrote: > On 05/25/2017 12:44 AM, Petr Jelinek wrote: >> >> There is still outstanding issue that sync worker will keep running >> inside the long COPY because the invalidation messages are also not >> processed until it finishes but all the original issues reported here >> disappear for me with the attached patches applied. > > After applying all your patches, drop subscription no more hangs while > dropping subscription but there is an error "ERROR: tuple concurrently > updated" in the log file of > standby. > I tried to reproduce this issue with latest four patches I submit but it didn't happen. I guess this issue doesn't related to the issues reported on this thread and another factor might be cause of this issue. It would be good to test the same again with latest four patches or after solved current some open items. Regards, -- Masahiko Sawada NIPPON TELEGRAPH AND TELEPHONE CORPORATION NTT Open Source Software Center -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Get stuck when dropping a subscription during synchronizing table
On 5/31/17 02:51, Noah Misch wrote: > On Tue, May 30, 2017 at 01:30:35AM +, Noah Misch wrote: >> On Thu, May 18, 2017 at 10:27:51PM -0400, Peter Eisentraut wrote: >>> On 5/18/17 11:11, Noah Misch wrote: IMMEDIATE ATTENTION REQUIRED. This PostgreSQL 10 open item is past due for your status update. Please reacquaint yourself with the policy on open item ownership[1] and then reply immediately. If I do not hear from you by 2017-05-19 16:00 UTC, I will transfer this item to release management team ownership without further notice. >>> >>> There is no progress on this issue at the moment. I will report again >>> next Wednesday. >> >> IMMEDIATE ATTENTION REQUIRED. This PostgreSQL 10 open item is long past due >> for your status update. If I do not hear one from you by 2017-05-31 02:00 >> UTC, I will transfer this item to release management team ownership without >> further notice. > > This PostgreSQL 10 open item now needs a permanent owner. Would any other > committer like to take ownership? If this role interests you, please read > this thread and the policy linked above, then send an initial status update > bearing a date for your subsequent status update. If the item does not have a > permanent owner by 2017-06-03 07:00 UTC, I will investigate feature removals > that would resolve the item. It seems I lost track of this item between all the other ones. I will continue to work on this item. We have patches proposed and I will work on committing them until Friday. I think I now have updates posted on all my items. -- Peter Eisentraut http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Get stuck when dropping a subscription during synchronizing table
On Thu, May 25, 2017 at 4:14 AM, Petr Jelinek wrote: > Hi, > > I finally had time to properly analyze this, and turns out we've been > all just trying to fix symptoms and the actual problems. > > All the locking works just fine the way it is in master. The issue with > deadlock with apply comes from the wrong handling of the SIGTERM in the > apply (we didn't set InterruptPending). I changed the SIGTERM handler in > patch 0001 just to die which is actually the correct behavior for apply > workers. I also moved the connection cleanup code to the > before_shmem_exit callback (similar to walreceiver) and now that part > works correctly. > > The issue with orphaned sync workers is actually two separate issues. > First, due to thinko we always searched for sync worker in > wait_for_sync_status_change instead of searching for opposite worker as > was the intention (i.e. sync worker should search for apply and apply > should search for sync). Thats fixed by 0002. And second, we didn't > accept any invalidation messages until the whole sync process finished > (because it flattens all the remote transactions in the single one) so > sync worker didn't learn about subscription changes/drop until it has > finished, which I now fixed in 0003. > > There is still outstanding issue that sync worker will keep running > inside the long COPY because the invalidation messages are also not > processed until it finishes but all the original issues reported here > disappear for me with the attached patches applied. > These patches conflict with current HEAD, I attached updated version patches. Also, the issue that sync worker will keep running inside the long COPY can lead the another problem that the user could not create new subscription with some workers due to not enough free logical replication worker slots until the long COPY finishes. Attached 0004 patch is the updated version patch I submitted before. Regards, -- Masahiko Sawada NIPPON TELEGRAPH AND TELEPHONE CORPORATION NTT Open Source Software Center 0001-Fix-signal-handling-in-logical-workers.patch Description: Binary data 0002-Make-tablesync-worker-exit-when-apply-dies-while-it-.patch Description: Binary data 0003-Receive-invalidation-messages-correctly-in-tablesync.patch Description: Binary data 0004-Wait-for-table-sync-worker-to-finish-when-apply-work.patch Description: Binary data -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Get stuck when dropping a subscription during synchronizing table
On Tue, May 30, 2017 at 01:30:35AM +, Noah Misch wrote: > On Thu, May 18, 2017 at 10:27:51PM -0400, Peter Eisentraut wrote: > > On 5/18/17 11:11, Noah Misch wrote: > > > IMMEDIATE ATTENTION REQUIRED. This PostgreSQL 10 open item is past due > > > for > > > your status update. Please reacquaint yourself with the policy on open > > > item > > > ownership[1] and then reply immediately. If I do not hear from you by > > > 2017-05-19 16:00 UTC, I will transfer this item to release management team > > > ownership without further notice. > > > > There is no progress on this issue at the moment. I will report again > > next Wednesday. > > IMMEDIATE ATTENTION REQUIRED. This PostgreSQL 10 open item is long past due > for your status update. If I do not hear one from you by 2017-05-31 02:00 > UTC, I will transfer this item to release management team ownership without > further notice. This PostgreSQL 10 open item now needs a permanent owner. Would any other committer like to take ownership? If this role interests you, please read this thread and the policy linked above, then send an initial status update bearing a date for your subsequent status update. If the item does not have a permanent owner by 2017-06-03 07:00 UTC, I will investigate feature removals that would resolve the item. -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Get stuck when dropping a subscription during synchronizing table
On Thu, May 18, 2017 at 10:27:51PM -0400, Peter Eisentraut wrote: > On 5/18/17 11:11, Noah Misch wrote: > > IMMEDIATE ATTENTION REQUIRED. This PostgreSQL 10 open item is past due for > > your status update. Please reacquaint yourself with the policy on open item > > ownership[1] and then reply immediately. If I do not hear from you by > > 2017-05-19 16:00 UTC, I will transfer this item to release management team > > ownership without further notice. > > There is no progress on this issue at the moment. I will report again > next Wednesday. IMMEDIATE ATTENTION REQUIRED. This PostgreSQL 10 open item is long past due for your status update. If I do not hear one from you by 2017-05-31 02:00 UTC, I will transfer this item to release management team ownership without further notice. -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Get stuck when dropping a subscription during synchronizing table
On 05/25/2017 12:44 AM, Petr Jelinek wrote: There is still outstanding issue that sync worker will keep running inside the long COPY because the invalidation messages are also not processed until it finishes but all the original issues reported here disappear for me with the attached patches applied. After applying all your patches, drop subscription no more hangs while dropping subscription but there is an error "ERROR: tuple concurrently updated" in the log file of standby. --- logical replication synchronization worker finished processing 2017-05-25 09:15:52.654 BST [18575] LOG: logical replication synchronization worker finished processing 2017-05-25 09:15:52.656 BST [18563] LOG: starting logical replication worker for subscription "sub" 2017-05-25 09:15:52.662 BST [18577] LOG: logical replication sync for subscription sub, table t14 started 2017-05-25 09:15:53.657 BST [18563] LOG: starting logical replication worker for subscription "sub" 2017-05-25 09:15:53.663 BST [18579] LOG: logical replication sync for subscription sub, table t15 started 2017-05-25 09:15:53.724 BST [18563] FATAL: terminating logical replication worker due to administrator command 2017-05-25 09:15:53.725 BST [18521] LOG: worker process: logical replication worker for subscription 16684 (PID 18563) exited with exit code 1 2017-05-25 09:15:54.734 BST [18579] ERROR: tuple concurrently updated 2017-05-25 09:15:54.735 BST [18577] ERROR: tuple concurrently updated 2017-05-25 09:15:54.736 BST [18521] LOG: worker process: logical replication worker for subscription 16684 sync 16426 (PID 18579) exited with exit code 1 2017-05-25 09:15:54.736 BST [18521] LOG: worker process: logical replication worker for subscription 16684 sync 16423 (PID 18577) exited with exit code 1 ~ ~ ~ Steps to reproduce - X cluster -> create 100 tables , publish all tables (create publication pub for all tables); Y Cluster -> create 100 tables ,create subscription(create subscription sub connection 'user=centos host=localhost' publication pub; Y cluster ->drop subscription - drop subscription sub; check the log file on Y cluster. Sometime , i have seen this error on psql prompt and drop subscription operation got failed at first attempt. postgres=# drop subscription sub; ERROR: tuple concurrently updated postgres=# drop subscription sub; NOTICE: dropped replication slot "sub" on publisher DROP SUBSCRIPTION -- regards,tushar EnterpriseDB https://www.enterprisedb.com/ The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Get stuck when dropping a subscription during synchronizing table
On Wed, May 24, 2017 at 3:14 PM, Petr Jelinek wrote: > Hi, > > I finally had time to properly analyze this, and turns out we've been > all just trying to fix symptoms and the actual problems. Thank you for the patches! > All the locking works just fine the way it is in master. > The issue with > deadlock with apply comes from the wrong handling of the SIGTERM in the > apply (we didn't set InterruptPending). I changed the SIGTERM handler in > patch 0001 just to die which is actually the correct behavior for apply > workers. I also moved the connection cleanup code to the > before_shmem_exit callback (similar to walreceiver) and now that part > works correctly. > > The issue with orphaned sync workers is actually two separate issues. > First, due to thinko we always searched for sync worker in > wait_for_sync_status_change instead of searching for opposite worker as > was the intention (i.e. sync worker should search for apply and apply > should search for sync). Thats fixed by 0002. And second, we didn't > accept any invalidation messages until the whole sync process finished > (because it flattens all the remote transactions in the single one) so > sync worker didn't learn about subscription changes/drop until it has > finished, which I now fixed in 0003. > > There is still outstanding issue that sync worker will keep running > inside the long COPY because the invalidation messages are also not > processed until it finishes but all the original issues reported here > disappear for me with the attached patches applied. > The issue reported on this thread seems to be solved with your patch. But because DROP SUBSCRIPTION is only one DDL command that acquires lock on system catalog with AccessExclusiveLock and the logical replication mechanism is complex, I'm concerned that there might be another potential deadlock issue due to acquire lock on system catalog with strong lock level. It might be good idea to do either fine-grained locking or reducing lock level or both. Regards, -- Masahiko Sawada NIPPON TELEGRAPH AND TELEPHONE CORPORATION NTT Open Source Software Center -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Get stuck when dropping a subscription during synchronizing table
Hi, I finally had time to properly analyze this, and turns out we've been all just trying to fix symptoms and the actual problems. All the locking works just fine the way it is in master. The issue with deadlock with apply comes from the wrong handling of the SIGTERM in the apply (we didn't set InterruptPending). I changed the SIGTERM handler in patch 0001 just to die which is actually the correct behavior for apply workers. I also moved the connection cleanup code to the before_shmem_exit callback (similar to walreceiver) and now that part works correctly. The issue with orphaned sync workers is actually two separate issues. First, due to thinko we always searched for sync worker in wait_for_sync_status_change instead of searching for opposite worker as was the intention (i.e. sync worker should search for apply and apply should search for sync). Thats fixed by 0002. And second, we didn't accept any invalidation messages until the whole sync process finished (because it flattens all the remote transactions in the single one) so sync worker didn't learn about subscription changes/drop until it has finished, which I now fixed in 0003. There is still outstanding issue that sync worker will keep running inside the long COPY because the invalidation messages are also not processed until it finishes but all the original issues reported here disappear for me with the attached patches applied. -- Petr Jelinek http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services 0003-Receive-invalidation-messages-correctly-in-tablesync.patch Description: binary/octet-stream 0002-Make-tablesync-worker-exit-when-apply-dies-while-it-.patch Description: binary/octet-stream 0001-Fix-signal-handling-in-logical-workers.patch Description: binary/octet-stream -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Get stuck when dropping a subscription during synchronizing table
On 18/05/17 16:16, Robert Haas wrote: > On Wed, May 17, 2017 at 6:58 AM, Masahiko Sawada > wrote: >> I think the above changes can solve this issue but It seems to me that >> holding AccessExclusiveLock on pg_subscription by DROP SUBSCRIPTION >> until commit could lead another deadlock problem in the future. So I'd >> to contrive ways to reduce lock level somehow if possible. For >> example, if we change the apply launcher so that it gets the >> subscription list only when pg_subscription gets invalid, apply >> launcher cannot try to launch the apply worker being stopped. We >> invalidate pg_subscription at commit of DROP SUBSCRIPTION and the >> apply launcher can get new subscription list which doesn't include the >> entry we removed. That way we can reduce lock level to >> ShareUpdateExclusiveLock and solve this issue. >> Also in your patch, we need to change DROP SUBSCRIPTION as well to >> resolve another case I encountered, where DROP SUBSCRIPTION waits for >> apply worker while holding a tuple lock on pg_subscription_rel and the >> apply worker waits for same tuple on pg_subscription_rel in >> SetSubscriptionRelState(). > > I don't really understand the issue being discussed here in any > detail, but as a general point I'd say that it might be more > productive to make the locks finer-grained rather than struggling to > reduce the lock level. For example, instead of locking all of > pg_subscription, use LockSharedObject() to lock the individual > subscription, still with AccessExclusiveLock. That means that other > accesses to that subscription also need to take a lock so that you > actually get a conflict when there should be one, but that should be > doable. I expect that trying to manage locking conflicts using only > catalog-wide locks is a doomed strategy. We do LockSharedObject() but it's rather useless the way it's done now as no other access locks it. We can't block all other accesses however, the workers need to be able to access the catalog during clean shutdown in some situations. What we need is to block starting of new workers for that subscription so only those code paths would need to block. So I think we might want to do both finer-grained locking and decreasing lock level. -- Petr Jelinek http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Get stuck when dropping a subscription during synchronizing table
Hello, At Thu, 18 May 2017 10:16:35 -0400, Robert Haas wrote in > On Wed, May 17, 2017 at 6:58 AM, Masahiko Sawada > wrote: > > I think the above changes can solve this issue but It seems to me that > > holding AccessExclusiveLock on pg_subscription by DROP SUBSCRIPTION > > until commit could lead another deadlock problem in the future. So I'd > > to contrive ways to reduce lock level somehow if possible. For > > example, if we change the apply launcher so that it gets the > > subscription list only when pg_subscription gets invalid, apply > > launcher cannot try to launch the apply worker being stopped. We > > invalidate pg_subscription at commit of DROP SUBSCRIPTION and the > > apply launcher can get new subscription list which doesn't include the > > entry we removed. That way we can reduce lock level to > > ShareUpdateExclusiveLock and solve this issue. > > Also in your patch, we need to change DROP SUBSCRIPTION as well to > > resolve another case I encountered, where DROP SUBSCRIPTION waits for > > apply worker while holding a tuple lock on pg_subscription_rel and the > > apply worker waits for same tuple on pg_subscription_rel in > > SetSubscriptionRelState(). Sorry, I don't have enough time to consider this profoundly. Perhaps will return later. > I don't really understand the issue being discussed here in any > detail, but as a general point I'd say that it might be more > productive to make the locks finer-grained rather than struggling to > reduce the lock level. For example, instead of locking all of > pg_subscription, use LockSharedObject() to lock the individual > subscription, still with AccessExclusiveLock. That means that other > accesses to that subscription also need to take a lock so that you > actually get a conflict when there should be one, but that should be > doable. I expect that trying to manage locking conflicts using only > catalog-wide locks is a doomed strategy. Thank you for the suggestion. I think it is a bit differnt from that. The problem here is that a replication worker may try reading exactly the tuple for the subscription being deleted just before responding to a received termination request. So the finer-graind lock doesn't help. The focus of resolving this is preventing blocking of workers caused by DROP SUBSCRIPTION. So Sadasan's patch immediately released the lock on pg_subscrption and uses another lock for exclusion. My patch just give up to read the catalog when not available. regards, -- Kyotaro Horiguchi NTT Open Source Software Center -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Get stuck when dropping a subscription during synchronizing table
On 5/18/17 11:11, Noah Misch wrote: > IMMEDIATE ATTENTION REQUIRED. This PostgreSQL 10 open item is past due for > your status update. Please reacquaint yourself with the policy on open item > ownership[1] and then reply immediately. If I do not hear from you by > 2017-05-19 16:00 UTC, I will transfer this item to release management team > ownership without further notice. There is no progress on this issue at the moment. I will report again next Wednesday. -- Peter Eisentraut http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Get stuck when dropping a subscription during synchronizing table
On Mon, May 15, 2017 at 03:28:14AM +, Noah Misch wrote: > On Mon, May 08, 2017 at 06:27:30PM +0900, Masahiko Sawada wrote: > > I encountered a situation where DROP SUBSCRIPTION got stuck when > > initial table sync is in progress. In my environment, I created > > several tables with some data on publisher. I created subscription on > > subscriber and drop subscription immediately after that. It doesn't > > always happen but I often encountered it on my environment. > > > > ps -x command shows the following. > > > > 96796 ?Ss 0:00 postgres: masahiko postgres [local] DROP > > SUBSCRIPTION > > 96801 ?Ts 0:00 postgres: bgworker: logical replication > > worker for subscription 40993waiting > > 96805 ?Ss 0:07 postgres: bgworker: logical replication > > worker for subscription 40993 sync 16418 > > 96806 ?Ss 0:01 postgres: wal sender process masahiko [local] > > idle > > 96807 ?Ss 0:00 postgres: bgworker: logical replication > > worker for subscription 40993 sync 16421 > > 96808 ?Ss 0:00 postgres: wal sender process masahiko [local] > > idle > > > > The DROP SUBSCRIPTION process (pid 96796) is waiting for the apply > > worker process (pid 96801) to stop while holding a lock on > > pg_subscription_rel. On the other hand the apply worker is waiting for > > acquiring a tuple lock on pg_subscription_rel needed for heap_update. > > Also table sync workers (pid 96805 and 96807) are waiting for the > > apply worker process to change their status. > > > > Also, even when DROP SUBSCRIPTION is done successfully, the table sync > > worker can be orphaned because I guess that the apply worker can exit > > before change status of table sync worker. > > > > I'm using 1f30295. > > [Action required within three days. This is a generic notification.] > > The above-described topic is currently a PostgreSQL 10 open item. Peter, > since you committed the patch believed to have created it, you own this open > item. If some other commit is more relevant or if this does not belong as a > v10 open item, please let us know. Otherwise, please observe the policy on > open item ownership[1] and send a status update within three calendar days of > this message. Include a date for your subsequent status update. Testers may > discover new open items at any time, and I want to plan to get them all fixed > well in advance of shipping v10. Consequently, I will appreciate your efforts > toward speedy resolution. Thanks. > > [1] > https://www.postgresql.org/message-id/20170404140717.GA2675809%40tornado.leadboat.com IMMEDIATE ATTENTION REQUIRED. This PostgreSQL 10 open item is past due for your status update. Please reacquaint yourself with the policy on open item ownership[1] and then reply immediately. If I do not hear from you by 2017-05-19 16:00 UTC, I will transfer this item to release management team ownership without further notice. [1] https://www.postgresql.org/message-id/20170404140717.GA2675809%40tornado.leadboat.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Get stuck when dropping a subscription during synchronizing table
On Wed, May 17, 2017 at 6:58 AM, Masahiko Sawada wrote: > I think the above changes can solve this issue but It seems to me that > holding AccessExclusiveLock on pg_subscription by DROP SUBSCRIPTION > until commit could lead another deadlock problem in the future. So I'd > to contrive ways to reduce lock level somehow if possible. For > example, if we change the apply launcher so that it gets the > subscription list only when pg_subscription gets invalid, apply > launcher cannot try to launch the apply worker being stopped. We > invalidate pg_subscription at commit of DROP SUBSCRIPTION and the > apply launcher can get new subscription list which doesn't include the > entry we removed. That way we can reduce lock level to > ShareUpdateExclusiveLock and solve this issue. > Also in your patch, we need to change DROP SUBSCRIPTION as well to > resolve another case I encountered, where DROP SUBSCRIPTION waits for > apply worker while holding a tuple lock on pg_subscription_rel and the > apply worker waits for same tuple on pg_subscription_rel in > SetSubscriptionRelState(). I don't really understand the issue being discussed here in any detail, but as a general point I'd say that it might be more productive to make the locks finer-grained rather than struggling to reduce the lock level. For example, instead of locking all of pg_subscription, use LockSharedObject() to lock the individual subscription, still with AccessExclusiveLock. That means that other accesses to that subscription also need to take a lock so that you actually get a conflict when there should be one, but that should be doable. I expect that trying to manage locking conflicts using only catalog-wide locks is a doomed strategy. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Get stuck when dropping a subscription during synchronizing table
On Mon, May 15, 2017 at 8:02 PM, Kyotaro HORIGUCHI wrote: > Hello, > > At Fri, 12 May 2017 17:24:07 +0900, Masahiko Sawada > wrote in >> On Fri, May 12, 2017 at 11:24 AM, Masahiko Sawada >> wrote: >> > On Thu, May 11, 2017 at 6:16 PM, Petr Jelinek >> > wrote: >> >> On 11/05/17 10:10, Masahiko Sawada wrote: >> >>> On Thu, May 11, 2017 at 4:06 PM, Michael Paquier >> >>> wrote: >> On Wed, May 10, 2017 at 11:57 AM, Masahiko Sawada >> wrote: >> > Barring any objections, I'll add these two issues to open item. >> >> It seems to me that those open items have not been added yet to the >> list. If I am following correctly, they could be defined as follows: >> - Dropping subscription may stuck if done during tablesync. >> -- Analyze deadlock issues with DROP SUBSCRIPTION and apply worker >> process. >> >> >> >> I think the solution to this is to reintroduce the LWLock that was >> >> removed and replaced with the exclusive lock on catalog [1]. I am afraid >> >> that correct way of handling this is to do both LWLock and catalog lock >> >> (first LWLock under which we kill the workers and then catalog lock so >> >> that something that prevents launcher from restarting them is held till >> >> the end of transaction). >> > >> > I agree to reintroduce LWLock and to stop logical rep worker first and >> > then modify catalog. That way we can reduce catalog lock level (maybe >> > to RowExclusiveLock) so that apply worker can see it. Also I think >> > that we need to do more things like in order to prevent that we keep >> > to hold LWLock until end of transaction, because holding LWLock until >> > end of transaction is not good idea and could be cause of deadlock. So >> > for example we can commit the transaction in DropSubscription after >> > cleaned pg_subscription record and all its dependencies and then start >> > new transaction for the remaining work. Of course we also need to >> > disallow DROP SUBSCRIPTION being executed in a user transaction >> > though. >> >> Attached two draft patches to solve these issues. >> >> Attached 0001 patch reintroduces LogicalRepLauncherLock and makes DROP >> SUBSCRIPTION keep holding it until commit. To prevent from deadlock >> possibility, I disallowed DROP SUBSCRIPTION being called in a >> transaction block. But there might be more sensible solution for this. >> please give me feedback. > > +* Protect against launcher restarting the worker. This lock will > +* be released at commit. > > This is wrong. COMMIT doesn't release left-over LWLocks, only > ABORT does (precisely, it seems intended to fire on ERRORs). So > with this patch, the second DROP SUBSCRIPTION is stuck on the > LWLock acquired at the first time. And as Petr said, LWLock with > such a duration seems bad. Oh I understood. Thank you for pointing out. > > The cause seems to be that workers ignore sigterm on certain > conditions. One of the choke points is GetSubscription, the other > is get_subscription_list. I think we can treat the both cases > without LWLocks. > > The attached patch does that. > > - heap_close + UnlockRelationOid in get_subscription_list() is > equivalent to one heap_close or relation_close but I took seeming > symmetricity. > > - 0.5 seconds for the sleep in ApplyWorkerMain is quite > arbitrary. NAPTIME_PER_CYCLE * 1000 could be used instead. > > - NULL MySubscription without SIGTERM might not need to be an > ERROR. > > Any more thoughts? I think the above changes can solve this issue but It seems to me that holding AccessExclusiveLock on pg_subscription by DROP SUBSCRIPTION until commit could lead another deadlock problem in the future. So I'd to contrive ways to reduce lock level somehow if possible. For example, if we change the apply launcher so that it gets the subscription list only when pg_subscription gets invalid, apply launcher cannot try to launch the apply worker being stopped. We invalidate pg_subscription at commit of DROP SUBSCRIPTION and the apply launcher can get new subscription list which doesn't include the entry we removed. That way we can reduce lock level to ShareUpdateExclusiveLock and solve this issue. Also in your patch, we need to change DROP SUBSCRIPTION as well to resolve another case I encountered, where DROP SUBSCRIPTION waits for apply worker while holding a tuple lock on pg_subscription_rel and the apply worker waits for same tuple on pg_subscription_rel in SetSubscriptionRelState(). > > > FYI, I reproduced the situation by the following steps. This > effectively reproduced the situation without delay insertion for > me. > > # Creating 5 tables with 10 rows on the publisher > create table t1 (a int); > ... > create table t5 (a int); > insert into t1 (select * from generate_series(0, 9) a); > ... > insert into t5 (select * from generate_series(0, 9) a); > create publication p1 for table t1, t2, t3, t4, t5; > > > # Subscribe them, wait 1sec, then unsbscribe. > create table t1 (a
Re: [HACKERS] Get stuck when dropping a subscription during synchronizing table
Hello, At Fri, 12 May 2017 17:24:07 +0900, Masahiko Sawada wrote in > On Fri, May 12, 2017 at 11:24 AM, Masahiko Sawada > wrote: > > On Thu, May 11, 2017 at 6:16 PM, Petr Jelinek > > wrote: > >> On 11/05/17 10:10, Masahiko Sawada wrote: > >>> On Thu, May 11, 2017 at 4:06 PM, Michael Paquier > >>> wrote: > On Wed, May 10, 2017 at 11:57 AM, Masahiko Sawada > wrote: > > Barring any objections, I'll add these two issues to open item. > > It seems to me that those open items have not been added yet to the > list. If I am following correctly, they could be defined as follows: > - Dropping subscription may stuck if done during tablesync. > -- Analyze deadlock issues with DROP SUBSCRIPTION and apply worker > process. > >> > >> I think the solution to this is to reintroduce the LWLock that was > >> removed and replaced with the exclusive lock on catalog [1]. I am afraid > >> that correct way of handling this is to do both LWLock and catalog lock > >> (first LWLock under which we kill the workers and then catalog lock so > >> that something that prevents launcher from restarting them is held till > >> the end of transaction). > > > > I agree to reintroduce LWLock and to stop logical rep worker first and > > then modify catalog. That way we can reduce catalog lock level (maybe > > to RowExclusiveLock) so that apply worker can see it. Also I think > > that we need to do more things like in order to prevent that we keep > > to hold LWLock until end of transaction, because holding LWLock until > > end of transaction is not good idea and could be cause of deadlock. So > > for example we can commit the transaction in DropSubscription after > > cleaned pg_subscription record and all its dependencies and then start > > new transaction for the remaining work. Of course we also need to > > disallow DROP SUBSCRIPTION being executed in a user transaction > > though. > > Attached two draft patches to solve these issues. > > Attached 0001 patch reintroduces LogicalRepLauncherLock and makes DROP > SUBSCRIPTION keep holding it until commit. To prevent from deadlock > possibility, I disallowed DROP SUBSCRIPTION being called in a > transaction block. But there might be more sensible solution for this. > please give me feedback. +* Protect against launcher restarting the worker. This lock will +* be released at commit. This is wrong. COMMIT doesn't release left-over LWLocks, only ABORT does (precisely, it seems intended to fire on ERRORs). So with this patch, the second DROP SUBSCRIPTION is stuck on the LWLock acquired at the first time. And as Petr said, LWLock with such a duration seems bad. The cause seems to be that workers ignore sigterm on certain conditions. One of the choke points is GetSubscription, the other is get_subscription_list. I think we can treat the both cases without LWLocks. The attached patch does that. - heap_close + UnlockRelationOid in get_subscription_list() is equivalent to one heap_close or relation_close but I took seeming symmetricity. - 0.5 seconds for the sleep in ApplyWorkerMain is quite arbitrary. NAPTIME_PER_CYCLE * 1000 could be used instead. - NULL MySubscription without SIGTERM might not need to be an ERROR. Any more thoughts? FYI, I reproduced the situation by the following steps. This effectively reproduced the situation without delay insertion for me. # Creating 5 tables with 10 rows on the publisher create table t1 (a int); ... create table t5 (a int); insert into t1 (select * from generate_series(0, 9) a); ... insert into t5 (select * from generate_series(0, 9) a); create publication p1 for table t1, t2, t3, t4, t5; # Subscribe them, wait 1sec, then unsbscribe. create table t1 (a int); ... create table t5 (a int); truncate t1, t2, t3, t4, t5; create subscription s1 CONNECTION 'host=/tmp port=5432 dbname=postgres' publication p1; select pg_sleep(1); drop subscription s1; Repeated test can be performed by repeatedly enter the last line. > -- Avoid orphaned tablesync worker if apply worker exits before > changing its status. > >>> > >> > >> The behavior question I have about this is if sync workers should die > >> when apply worker dies (ie they are tied to apply worker) or if they > >> should be tied to the subscription. > >> > >> I guess taking down all the sync workers when apply worker has exited is > >> easier to solve. Of course it means that if apply worker restarts in > >> middle of table synchronization, the table synchronization will have to > >> start from scratch. That being said, in normal operation apply worker > >> should only exit/restart if subscription has changed or has been > >> dropped/disabled and I think sync workers want to exit/restart in that > >> situation as well. > > > > I agree that sync workers are tied to the apply worker. > > > >> > >> So for example having shmem detach hook for an apply worker (or reusing > >> the existing one) that searches fo
Re: [HACKERS] Get stuck when dropping a subscription during synchronizing table
On Mon, May 08, 2017 at 06:27:30PM +0900, Masahiko Sawada wrote: > I encountered a situation where DROP SUBSCRIPTION got stuck when > initial table sync is in progress. In my environment, I created > several tables with some data on publisher. I created subscription on > subscriber and drop subscription immediately after that. It doesn't > always happen but I often encountered it on my environment. > > ps -x command shows the following. > > 96796 ?Ss 0:00 postgres: masahiko postgres [local] DROP > SUBSCRIPTION > 96801 ?Ts 0:00 postgres: bgworker: logical replication > worker for subscription 40993waiting > 96805 ?Ss 0:07 postgres: bgworker: logical replication > worker for subscription 40993 sync 16418 > 96806 ?Ss 0:01 postgres: wal sender process masahiko [local] idle > 96807 ?Ss 0:00 postgres: bgworker: logical replication > worker for subscription 40993 sync 16421 > 96808 ?Ss 0:00 postgres: wal sender process masahiko [local] idle > > The DROP SUBSCRIPTION process (pid 96796) is waiting for the apply > worker process (pid 96801) to stop while holding a lock on > pg_subscription_rel. On the other hand the apply worker is waiting for > acquiring a tuple lock on pg_subscription_rel needed for heap_update. > Also table sync workers (pid 96805 and 96807) are waiting for the > apply worker process to change their status. > > Also, even when DROP SUBSCRIPTION is done successfully, the table sync > worker can be orphaned because I guess that the apply worker can exit > before change status of table sync worker. > > I'm using 1f30295. [Action required within three days. This is a generic notification.] The above-described topic is currently a PostgreSQL 10 open item. Peter, since you committed the patch believed to have created it, you own this open item. If some other commit is more relevant or if this does not belong as a v10 open item, please let us know. Otherwise, please observe the policy on open item ownership[1] and send a status update within three calendar days of this message. Include a date for your subsequent status update. Testers may discover new open items at any time, and I want to plan to get them all fixed well in advance of shipping v10. Consequently, I will appreciate your efforts toward speedy resolution. Thanks. [1] https://www.postgresql.org/message-id/20170404140717.GA2675809%40tornado.leadboat.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Get stuck when dropping a subscription during synchronizing table
On Fri, May 12, 2017 at 11:24 AM, Masahiko Sawada wrote: > On Thu, May 11, 2017 at 6:16 PM, Petr Jelinek > wrote: >> On 11/05/17 10:10, Masahiko Sawada wrote: >>> On Thu, May 11, 2017 at 4:06 PM, Michael Paquier >>> wrote: On Wed, May 10, 2017 at 11:57 AM, Masahiko Sawada wrote: > Barring any objections, I'll add these two issues to open item. It seems to me that those open items have not been added yet to the list. If I am following correctly, they could be defined as follows: - Dropping subscription may stuck if done during tablesync. -- Analyze deadlock issues with DROP SUBSCRIPTION and apply worker process. >> >> I think the solution to this is to reintroduce the LWLock that was >> removed and replaced with the exclusive lock on catalog [1]. I am afraid >> that correct way of handling this is to do both LWLock and catalog lock >> (first LWLock under which we kill the workers and then catalog lock so >> that something that prevents launcher from restarting them is held till >> the end of transaction). > > I agree to reintroduce LWLock and to stop logical rep worker first and > then modify catalog. That way we can reduce catalog lock level (maybe > to RowExclusiveLock) so that apply worker can see it. Also I think > that we need to do more things like in order to prevent that we keep > to hold LWLock until end of transaction, because holding LWLock until > end of transaction is not good idea and could be cause of deadlock. So > for example we can commit the transaction in DropSubscription after > cleaned pg_subscription record and all its dependencies and then start > new transaction for the remaining work. Of course we also need to > disallow DROP SUBSCRIPTION being executed in a user transaction > though. Attached two draft patches to solve these issues. Attached 0001 patch reintroduces LogicalRepLauncherLock and makes DROP SUBSCRIPTION keep holding it until commit. To prevent from deadlock possibility, I disallowed DROP SUBSCRIPTION being called in a transaction block. But there might be more sensible solution for this. please give me feedback. > >> -- Avoid orphaned tablesync worker if apply worker exits before changing its status. >>> >> >> The behavior question I have about this is if sync workers should die >> when apply worker dies (ie they are tied to apply worker) or if they >> should be tied to the subscription. >> >> I guess taking down all the sync workers when apply worker has exited is >> easier to solve. Of course it means that if apply worker restarts in >> middle of table synchronization, the table synchronization will have to >> start from scratch. That being said, in normal operation apply worker >> should only exit/restart if subscription has changed or has been >> dropped/disabled and I think sync workers want to exit/restart in that >> situation as well. > > I agree that sync workers are tied to the apply worker. > >> >> So for example having shmem detach hook for an apply worker (or reusing >> the existing one) that searches for all the other workers for same >> subscription and shuts them down as well sounds like solution to this. > > Seems reasonable solution. > Regards, -- Masahiko Sawada NIPPON TELEGRAPH AND TELEPHONE CORPORATION NTT Open Source Software Center 0001-Fix-a-deadlock-bug-between-DROP-SUBSCRIPTION-and-app.patch Description: Binary data 0002-Wait-for-table-sync-worker-to-finish-when-apply-work.patch Description: Binary data -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Get stuck when dropping a subscription during synchronizing table
On Thu, May 11, 2017 at 6:16 PM, Petr Jelinek wrote: > On 11/05/17 10:10, Masahiko Sawada wrote: >> On Thu, May 11, 2017 at 4:06 PM, Michael Paquier >> wrote: >>> On Wed, May 10, 2017 at 11:57 AM, Masahiko Sawada >>> wrote: Barring any objections, I'll add these two issues to open item. >>> >>> It seems to me that those open items have not been added yet to the >>> list. If I am following correctly, they could be defined as follows: >>> - Dropping subscription may stuck if done during tablesync. >>> -- Analyze deadlock issues with DROP SUBSCRIPTION and apply worker process. > > I think the solution to this is to reintroduce the LWLock that was > removed and replaced with the exclusive lock on catalog [1]. I am afraid > that correct way of handling this is to do both LWLock and catalog lock > (first LWLock under which we kill the workers and then catalog lock so > that something that prevents launcher from restarting them is held till > the end of transaction). I agree to reintroduce LWLock and to stop logical rep worker first and then modify catalog. That way we can reduce catalog lock level (maybe to RowExclusiveLock) so that apply worker can see it. Also I think that we need to do more things like in order to prevent that we keep to hold LWLock until end of transaction, because holding LWLock until end of transaction is not good idea and could be cause of deadlock. So for example we can commit the transaction in DropSubscription after cleaned pg_subscription record and all its dependencies and then start new transaction for the remaining work. Of course we also need to disallow DROP SUBSCRIPTION being executed in a user transaction though. > >>> -- Avoid orphaned tablesync worker if apply worker exits before >>> changing its status. >> > > The behavior question I have about this is if sync workers should die > when apply worker dies (ie they are tied to apply worker) or if they > should be tied to the subscription. > > I guess taking down all the sync workers when apply worker has exited is > easier to solve. Of course it means that if apply worker restarts in > middle of table synchronization, the table synchronization will have to > start from scratch. That being said, in normal operation apply worker > should only exit/restart if subscription has changed or has been > dropped/disabled and I think sync workers want to exit/restart in that > situation as well. I agree that sync workers are tied to the apply worker. > > So for example having shmem detach hook for an apply worker (or reusing > the existing one) that searches for all the other workers for same > subscription and shuts them down as well sounds like solution to this. Seems reasonable solution. Regards, -- Masahiko Sawada NIPPON TELEGRAPH AND TELEPHONE CORPORATION NTT Open Source Software Center -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Get stuck when dropping a subscription during synchronizing table
On 11/05/17 10:10, Masahiko Sawada wrote: > On Thu, May 11, 2017 at 4:06 PM, Michael Paquier > wrote: >> On Wed, May 10, 2017 at 11:57 AM, Masahiko Sawada >> wrote: >>> Barring any objections, I'll add these two issues to open item. >> >> It seems to me that those open items have not been added yet to the >> list. If I am following correctly, they could be defined as follows: >> - Dropping subscription may stuck if done during tablesync. >> -- Analyze deadlock issues with DROP SUBSCRIPTION and apply worker process. I think the solution to this is to reintroduce the LWLock that was removed and replaced with the exclusive lock on catalog [1]. I am afraid that correct way of handling this is to do both LWLock and catalog lock (first LWLock under which we kill the workers and then catalog lock so that something that prevents launcher from restarting them is held till the end of transaction). >> -- Avoid orphaned tablesync worker if apply worker exits before >> changing its status. > The behavior question I have about this is if sync workers should die when apply worker dies (ie they are tied to apply worker) or if they should be tied to the subscription. I guess taking down all the sync workers when apply worker has exited is easier to solve. Of course it means that if apply worker restarts in middle of table synchronization, the table synchronization will have to start from scratch. That being said, in normal operation apply worker should only exit/restart if subscription has changed or has been dropped/disabled and I think sync workers want to exit/restart in that situation as well. So for example having shmem detach hook for an apply worker (or reusing the existing one) that searches for all the other workers for same subscription and shuts them down as well sounds like solution to this. [1] https://www.postgresql.org/message-id/cahgqgwhpi8ky-yanffe0sgmhktsycqltnkx07bw9s7-rn17...@mail.gmail.com -- Petr Jelinek http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Get stuck when dropping a subscription during synchronizing table
On Thu, May 11, 2017 at 4:06 PM, Michael Paquier wrote: > On Wed, May 10, 2017 at 11:57 AM, Masahiko Sawada > wrote: >> Barring any objections, I'll add these two issues to open item. > > It seems to me that those open items have not been added yet to the > list. If I am following correctly, they could be defined as follows: > - Dropping subscription may stuck if done during tablesync. > -- Analyze deadlock issues with DROP SUBSCRIPTION and apply worker process. > -- Avoid orphaned tablesync worker if apply worker exits before > changing its status. Thanks, I think correct. Added it to open item. > > I am playing with the code to look at both of them... But feel free to > update this thread if I don't show up. There are no test cases, but > some well-placed pg_usleep calls should make both issues easily > reproducible. I have the gut feeling that other things are hidden > behind though. I'm also working on this, so will update it if there is. Regards, -- Masahiko Sawada NIPPON TELEGRAPH AND TELEPHONE CORPORATION NTT Open Source Software Center -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Get stuck when dropping a subscription during synchronizing table
On Wed, May 10, 2017 at 11:57 AM, Masahiko Sawada wrote: > Barring any objections, I'll add these two issues to open item. It seems to me that those open items have not been added yet to the list. If I am following correctly, they could be defined as follows: - Dropping subscription may stuck if done during tablesync. -- Analyze deadlock issues with DROP SUBSCRIPTION and apply worker process. -- Avoid orphaned tablesync worker if apply worker exits before changing its status. I am playing with the code to look at both of them... But feel free to update this thread if I don't show up. There are no test cases, but some well-placed pg_usleep calls should make both issues easily reproducible. I have the gut feeling that other things are hidden behind though. -- Michael -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Get stuck when dropping a subscription during synchronizing table
On Wed, May 10, 2017 at 2:46 AM, Masahiko Sawada wrote: > On Mon, May 8, 2017 at 8:42 PM, Petr Jelinek > wrote: >> On 08/05/17 11:27, Masahiko Sawada wrote: >>> Hi, >>> >>> I encountered a situation where DROP SUBSCRIPTION got stuck when >>> initial table sync is in progress. In my environment, I created >>> several tables with some data on publisher. I created subscription on >>> subscriber and drop subscription immediately after that. It doesn't >>> always happen but I often encountered it on my environment. >>> >>> ps -x command shows the following. >>> >>> 96796 ?Ss 0:00 postgres: masahiko postgres [local] DROP >>> SUBSCRIPTION >>> 96801 ?Ts 0:00 postgres: bgworker: logical replication >>> worker for subscription 40993waiting >>> 96805 ?Ss 0:07 postgres: bgworker: logical replication >>> worker for subscription 40993 sync 16418 >>> 96806 ?Ss 0:01 postgres: wal sender process masahiko [local] >>> idle >>> 96807 ?Ss 0:00 postgres: bgworker: logical replication >>> worker for subscription 40993 sync 16421 >>> 96808 ?Ss 0:00 postgres: wal sender process masahiko [local] >>> idle >>> >>> The DROP SUBSCRIPTION process (pid 96796) is waiting for the apply >>> worker process (pid 96801) to stop while holding a lock on >>> pg_subscription_rel. On the other hand the apply worker is waiting for >>> acquiring a tuple lock on pg_subscription_rel needed for heap_update. >>> Also table sync workers (pid 96805 and 96807) are waiting for the >>> apply worker process to change their status. >>> >> >> Looks like we should kill apply before dropping dependencies. > > Sorry, after investigated I found out that DROP SUBSCRIPTION process > is holding AccessExclusiveLock on pg_subscription (, not > pg_subscription_rel) and apply worker is waiting for acquiring a lock > on it. Hmm it seems there are two cases. One is that the apply worker waits to acquire AccessShareLock on pg_subscription but DropSubscription already acquired AcessExclusiveLock on it and waits for the apply worker to finish. Another case is that the apply worker waits to acquire a tuple lock on pg_subscrption_rel but DropSubscription (maybe droppoing dependencies) already acquired it. > So I guess that the dropping dependencies are not relevant with > this. It seems to me that the main cause is that DROP SUBSCRIPTION > waits for apply worker to finish while keeping to hold > AccessExclusiveLock on pg_subscription. Perhaps we need to contrive > ways to reduce lock level somehow. > >> >>> Also, even when DROP SUBSCRIPTION is done successfully, the table sync >>> worker can be orphaned because I guess that the apply worker can exit >>> before change status of table sync worker. >> >> Well the tablesync worker should stop itself if the subscription got >> removed, but of course again the dependencies are an issue, so we should >> probably kill those explicitly as well. > > Yeah, I think that we should ensure that the apply worker exits after > killed all involved table sync workers. > Barring any objections, I'll add these two issues to open item. Regards, -- Masahiko Sawada NIPPON TELEGRAPH AND TELEPHONE CORPORATION NTT Open Source Software Center -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Get stuck when dropping a subscription during synchronizing table
On Mon, May 8, 2017 at 8:42 PM, Petr Jelinek wrote: > On 08/05/17 11:27, Masahiko Sawada wrote: >> Hi, >> >> I encountered a situation where DROP SUBSCRIPTION got stuck when >> initial table sync is in progress. In my environment, I created >> several tables with some data on publisher. I created subscription on >> subscriber and drop subscription immediately after that. It doesn't >> always happen but I often encountered it on my environment. >> >> ps -x command shows the following. >> >> 96796 ?Ss 0:00 postgres: masahiko postgres [local] DROP >> SUBSCRIPTION >> 96801 ?Ts 0:00 postgres: bgworker: logical replication >> worker for subscription 40993waiting >> 96805 ?Ss 0:07 postgres: bgworker: logical replication >> worker for subscription 40993 sync 16418 >> 96806 ?Ss 0:01 postgres: wal sender process masahiko [local] >> idle >> 96807 ?Ss 0:00 postgres: bgworker: logical replication >> worker for subscription 40993 sync 16421 >> 96808 ?Ss 0:00 postgres: wal sender process masahiko [local] >> idle >> >> The DROP SUBSCRIPTION process (pid 96796) is waiting for the apply >> worker process (pid 96801) to stop while holding a lock on >> pg_subscription_rel. On the other hand the apply worker is waiting for >> acquiring a tuple lock on pg_subscription_rel needed for heap_update. >> Also table sync workers (pid 96805 and 96807) are waiting for the >> apply worker process to change their status. >> > > Looks like we should kill apply before dropping dependencies. Sorry, after investigated I found out that DROP SUBSCRIPTION process is holding AccessExclusiveLock on pg_subscription (, not pg_subscription_rel) and apply worker is waiting for acquiring a lock on it. So I guess that the dropping dependencies are not relevant with this. It seems to me that the main cause is that DROP SUBSCRIPTION waits for apply worker to finish while keeping to hold AccessExclusiveLock on pg_subscription. Perhaps we need to contrive ways to reduce lock level somehow. > >> Also, even when DROP SUBSCRIPTION is done successfully, the table sync >> worker can be orphaned because I guess that the apply worker can exit >> before change status of table sync worker. > > Well the tablesync worker should stop itself if the subscription got > removed, but of course again the dependencies are an issue, so we should > probably kill those explicitly as well. Yeah, I think that we should ensure that the apply worker exits after killed all involved table sync workers. Regards, -- Masahiko Sawada NIPPON TELEGRAPH AND TELEPHONE CORPORATION NTT Open Source Software Center -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Get stuck when dropping a subscription during synchronizing table
On Mon, May 8, 2017 at 8:53 PM, Erik Rijkers wrote: > On 2017-05-08 13:13, Masahiko Sawada wrote: >> >> On Mon, May 8, 2017 at 7:14 PM, Erik Rijkers wrote: >>> >>> On 2017-05-08 11:27, Masahiko Sawada wrote: > >>> >>> FWIW, running >>> 0001-WIP-Fix-off-by-one-around-GetLastImportantRecPtr.patch+ 0002-WIP-Possibly-more-robust-snapbuild-approach.patch + fix-statistics-reporting-in-logical-replication-work.patch >>> >>> >>> (on top of 44c528810) >> >> >> Thanks, which thread are these patches attached on? >> > > The first two patches are here: > https://www.postgresql.org/message-id/20170505004237.edtahvrwb3uwd5rs%40alap3.anarazel.de > > and last one: > https://www.postgresql.org/message-id/22cc402c-88eb-fa35-217f-0060db2c72f0%402ndquadrant.com > > ( I have to include that last one or my tests fail within minutes. ) Thank you! I will look at these patches. Regards, -- Masahiko Sawada NIPPON TELEGRAPH AND TELEPHONE CORPORATION NTT Open Source Software Center -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Get stuck when dropping a subscription during synchronizing table
On 2017-05-08 13:13, Masahiko Sawada wrote: On Mon, May 8, 2017 at 7:14 PM, Erik Rijkers wrote: On 2017-05-08 11:27, Masahiko Sawada wrote: FWIW, running 0001-WIP-Fix-off-by-one-around-GetLastImportantRecPtr.patch+ 0002-WIP-Possibly-more-robust-snapbuild-approach.patch + fix-statistics-reporting-in-logical-replication-work.patch (on top of 44c528810) Thanks, which thread are these patches attached on? The first two patches are here: https://www.postgresql.org/message-id/20170505004237.edtahvrwb3uwd5rs%40alap3.anarazel.de and last one: https://www.postgresql.org/message-id/22cc402c-88eb-fa35-217f-0060db2c72f0%402ndquadrant.com ( I have to include that last one or my tests fail within minutes. ) Erik Rijkers -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Get stuck when dropping a subscription during synchronizing table
On 08/05/17 11:27, Masahiko Sawada wrote: > Hi, > > I encountered a situation where DROP SUBSCRIPTION got stuck when > initial table sync is in progress. In my environment, I created > several tables with some data on publisher. I created subscription on > subscriber and drop subscription immediately after that. It doesn't > always happen but I often encountered it on my environment. > > ps -x command shows the following. > > 96796 ?Ss 0:00 postgres: masahiko postgres [local] DROP > SUBSCRIPTION > 96801 ?Ts 0:00 postgres: bgworker: logical replication > worker for subscription 40993waiting > 96805 ?Ss 0:07 postgres: bgworker: logical replication > worker for subscription 40993 sync 16418 > 96806 ?Ss 0:01 postgres: wal sender process masahiko [local] idle > 96807 ?Ss 0:00 postgres: bgworker: logical replication > worker for subscription 40993 sync 16421 > 96808 ?Ss 0:00 postgres: wal sender process masahiko [local] idle > > The DROP SUBSCRIPTION process (pid 96796) is waiting for the apply > worker process (pid 96801) to stop while holding a lock on > pg_subscription_rel. On the other hand the apply worker is waiting for > acquiring a tuple lock on pg_subscription_rel needed for heap_update. > Also table sync workers (pid 96805 and 96807) are waiting for the > apply worker process to change their status. > Looks like we should kill apply before dropping dependencies. > Also, even when DROP SUBSCRIPTION is done successfully, the table sync > worker can be orphaned because I guess that the apply worker can exit > before change status of table sync worker. Well the tablesync worker should stop itself if the subscription got removed, but of course again the dependencies are an issue, so we should probably kill those explicitly as well. -- Petr Jelinek http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Get stuck when dropping a subscription during synchronizing table
On Mon, May 8, 2017 at 7:14 PM, Erik Rijkers wrote: > On 2017-05-08 11:27, Masahiko Sawada wrote: >> >> Hi, >> >> I encountered a situation where DROP SUBSCRIPTION got stuck when >> initial table sync is in progress. In my environment, I created >> several tables with some data on publisher. I created subscription on >> subscriber and drop subscription immediately after that. It doesn't >> always happen but I often encountered it on my environment. >> >> ps -x command shows the following. >> >> 96796 ?Ss 0:00 postgres: masahiko postgres [local] DROP >> SUBSCRIPTION >> 96801 ?Ts 0:00 postgres: bgworker: logical replication >> worker for subscription 40993waiting >> 96805 ?Ss 0:07 postgres: bgworker: logical replication >> worker for subscription 40993 sync 16418 >> 96806 ?Ss 0:01 postgres: wal sender process masahiko [local] >> idle >> 96807 ?Ss 0:00 postgres: bgworker: logical replication >> worker for subscription 40993 sync 16421 >> 96808 ?Ss 0:00 postgres: wal sender process masahiko [local] >> idle >> > > FWIW, running > >> 0001-WIP-Fix-off-by-one-around-GetLastImportantRecPtr.patch+ >> 0002-WIP-Possibly-more-robust-snapbuild-approach.patch + >> fix-statistics-reporting-in-logical-replication-work.patch > > (on top of 44c528810) Thanks, which thread are these patches attached on? > > I have encountered the same condition as well in the last few days, a few > times (I think 2 or 3 times). IIUC there are two issues; one is that the deadlock can happen between the DROP SUBSCRIPTION and the apply worker process, another one is the table sync worker can be orphaned if the apply worker exits before changing status. The latter might relate to another issue reported by Jeff[1]. [1] https://www.postgresql.org/message-id/CAMkU%3D1xUJKs%3D2etq2K7bmbY51Q7g853HLxJ7qEB2Snog9oRvDw%40mail.gmail.com Regards, -- Masahiko Sawada NIPPON TELEGRAPH AND TELEPHONE CORPORATION NTT Open Source Software Center -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Get stuck when dropping a subscription during synchronizing table
On 2017-05-08 11:27, Masahiko Sawada wrote: Hi, I encountered a situation where DROP SUBSCRIPTION got stuck when initial table sync is in progress. In my environment, I created several tables with some data on publisher. I created subscription on subscriber and drop subscription immediately after that. It doesn't always happen but I often encountered it on my environment. ps -x command shows the following. 96796 ?Ss 0:00 postgres: masahiko postgres [local] DROP SUBSCRIPTION 96801 ?Ts 0:00 postgres: bgworker: logical replication worker for subscription 40993waiting 96805 ?Ss 0:07 postgres: bgworker: logical replication worker for subscription 40993 sync 16418 96806 ?Ss 0:01 postgres: wal sender process masahiko [local] idle 96807 ?Ss 0:00 postgres: bgworker: logical replication worker for subscription 40993 sync 16421 96808 ?Ss 0:00 postgres: wal sender process masahiko [local] idle FWIW, running 0001-WIP-Fix-off-by-one-around-GetLastImportantRecPtr.patch+ 0002-WIP-Possibly-more-robust-snapbuild-approach.patch + fix-statistics-reporting-in-logical-replication-work.patch (on top of 44c528810) I have encountered the same condition as well in the last few days, a few times (I think 2 or 3 times). Erik Rijkers -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
[HACKERS] Get stuck when dropping a subscription during synchronizing table
Hi, I encountered a situation where DROP SUBSCRIPTION got stuck when initial table sync is in progress. In my environment, I created several tables with some data on publisher. I created subscription on subscriber and drop subscription immediately after that. It doesn't always happen but I often encountered it on my environment. ps -x command shows the following. 96796 ?Ss 0:00 postgres: masahiko postgres [local] DROP SUBSCRIPTION 96801 ?Ts 0:00 postgres: bgworker: logical replication worker for subscription 40993waiting 96805 ?Ss 0:07 postgres: bgworker: logical replication worker for subscription 40993 sync 16418 96806 ?Ss 0:01 postgres: wal sender process masahiko [local] idle 96807 ?Ss 0:00 postgres: bgworker: logical replication worker for subscription 40993 sync 16421 96808 ?Ss 0:00 postgres: wal sender process masahiko [local] idle The DROP SUBSCRIPTION process (pid 96796) is waiting for the apply worker process (pid 96801) to stop while holding a lock on pg_subscription_rel. On the other hand the apply worker is waiting for acquiring a tuple lock on pg_subscription_rel needed for heap_update. Also table sync workers (pid 96805 and 96807) are waiting for the apply worker process to change their status. Also, even when DROP SUBSCRIPTION is done successfully, the table sync worker can be orphaned because I guess that the apply worker can exit before change status of table sync worker. I'm using 1f30295. Regards, -- Masahiko Sawada NIPPON TELEGRAPH AND TELEPHONE CORPORATION NTT Open Source Software Center -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers