subject:"\[HACKERS\] Get stuck when dropping a subscription during synchronizing table"

Re: [HACKERS] Get stuck when dropping a subscription during synchronizing table

2017-07-03 Thread Masahiko Sawada

On Tue, Jul 4, 2017 at 12:21 PM, Peter Eisentraut
 wrote:
> On 6/25/17 06:35, Petr Jelinek wrote:
>> - Do LockSharedObject in ALTER SUBSCRIPTION, DROP SUBSCRIPTION (this one
>> is preexisting but mentioning it for context), SetSubscriptionRelState,
>> AddSubscriptionRelState, and in the logicalrep_worker_launch. This means
>> we use granular per object locks to deal with concurrency.
>
> I have committed those locking changes, as we had already discussed them
> previously.  This should address the open item.

Thank you for committing the patch!

>
> Maybe we can start new threads for the other parts of the patch and
> maybe split the patch up a bit.

Yeah, let's discuss about reworking the locking and management on new thread.

Regards,

--
Masahiko Sawada
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Get stuck when dropping a subscription during synchronizing table

2017-07-03 Thread Peter Eisentraut

On 6/25/17 06:35, Petr Jelinek wrote:
> - Do LockSharedObject in ALTER SUBSCRIPTION, DROP SUBSCRIPTION (this one
> is preexisting but mentioning it for context), SetSubscriptionRelState,
> AddSubscriptionRelState, and in the logicalrep_worker_launch. This means
> we use granular per object locks to deal with concurrency.

I have committed those locking changes, as we had already discussed them
previously.  This should address the open item.

Maybe we can start new threads for the other parts of the patch and
maybe split the patch up a bit.  At this point I don't want to commit
major code rearrangements without specific reasons and detailed analysis.

-- 
Peter Eisentraut  http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Get stuck when dropping a subscription during synchronizing table

2017-07-02 Thread Masahiko Sawada

On Wed, Jun 28, 2017 at 2:13 PM, Masahiko Sawada  wrote:
> On Wed, Jun 28, 2017 at 1:47 AM, Petr Jelinek
>  wrote:
>>
>> On 27/06/17 10:51, Masahiko Sawada wrote:
>>> On Mon, Jun 26, 2017 at 12:12 PM, Masahiko Sawada  
>>> wrote:
>>>
>>> I've reviewed this patch briefly.
>>
>> Thanks!
>>
>>>
>>> @@ -515,6 +533,31 @@ logicalrep_worker_stop(Oid subid, Oid relid)
>>>  }
>>>
>>>  /*
>>> + * Request worker to be stopped on commit.
>>> + */
>>> +void
>>> +logicalrep_worker_stop_at_commit(Oid subid, Oid relid)
>>> +{
>>> +   LogicalRepWorkerId *wid;
>>> +   MemoryContext old;
>>> +
>>> +   old = MemoryContextSwitchTo(TopTransactionContext);
>>> +
>>> +   wid = (LogicalRepWorkerId *) palloc(sizeof(LogicalRepWorkerId));
>>> +
>>> +   /*
>>> +   wid = MemoryContextAlloc(TopTransactionContext,
>>> +
>>> sizeof(LogicalRepWorkerId));
>>> +   */
>>> +   wid->subid = subid;
>>> +   wid->relid = relid;
>>> +
>>> +   on_commit_stop_workers = lappend(on_commit_stop_workers, wid);
>>> +
>>> +   MemoryContextSwitchTo(old);
>>> +}
>>>
>>> logicalrep_worker_stop_at_commit() has a problem that new_list()
>>> called by lappend() allocates the memory from current memory context.
>>> It should switch the memory context and then allocate the memory for
>>> wid and append it to the list.
>>>
>
> Thank you for updating the patch!
>
>>
>> Right, fixed (I see you did that locally as well based on the above
>> excerpt ;) ).
>
> Oops, yeah that's my test code.
>
>>> 
>>> @@ -754,9 +773,25 @@ ApplyLauncherShmemInit(void)
>>>  void
>>>  AtEOXact_ApplyLauncher(bool isCommit)
>>>  {
>>> -   if (isCommit && on_commit_launcher_wakeup)
>>> -   ApplyLauncherWakeup();
>>> +   ListCell *lc;
>>> +
>>> +   if (isCommit)
>>> +   {
>>> +   foreach (lc, on_commit_stop_workers)
>>> +   {
>>> +   LogicalRepWorkerId *wid = lfirst(lc);
>>> +   logicalrep_worker_stop(wid->subid, wid->relid);
>>> +   }
>>> +
>>> +   if (on_commit_launcher_wakeup)
>>> +   ApplyLauncherWakeup();
>>>
>>> Stopping the logical rep worker in AtEOXact_ApplyLauncher doesn't
>>> support the prepared transaction. Since we allocate the list
>>> on_commit_stop_workers in TopTransactionContext the postgres crashes
>>> if we execute any query after prepared transaction that removes
>>> subscription relation state. Also after fixed this issue, we still
>>> need to something: the list of on_commit_stop_workers is not
>>> associated the prepared transaction.  A query next to "preapre
>>> transaction" tries to stop workers at the commit. There was similar
>>> discussion before.
>>>
>>
>> Hmm, good point. I think for now it makes sense to simply don't allow
>> PREPARE for transactions that manipulate workers similar to what we do
>> when there are exported snapshots. Done it that way in attached.
>
> Agreed. Should we note that in docs?
>
>>
>>> 
>>> +
>>> +   ensure_transaction();
>>> +
>>> UpdateSubscriptionRelState(MyLogicalRepWorker->subid,
>>> +
>>> rstate->relid, rstate->state,
>>> +
>>> rstate->lsn);
>>>
>>>
>>> Should we commit the transaction if we started a new transaction
>>> before update the subscription relation state, or it could be deadlock
>>> risk.
>>
>> We only lock the whole subscription (and only conflicting things are
>> DROP and ALTER SUBSCRIPTION), not individual subscription-relation
>> mapping so it doesn't seem to me like there is any higher risk of
>> deadlocks than anything else which works with multiple tables (or
>> compared to previous behavior).
>>
>
> I'm concerned that a lock for whole subscription could be conflicted
> between ALTER SUBSCRIPTION, table sync worker and apply worker:
>
> Please imagine the following case.
>
> 1. The apply worker updates a subscription relation state R1 of
> subscription S1.
> -> It acquires AccessShareLock on S1, and keep holding.
> 2. ALTER SUBSCRIPTION SET PUBLICATION tries to acquire
> AccessExclusiveLock on S1.
> -> But it waits for the apply worker to release the lock.
> 3. The apply worker calls wait_for_relation_state_change(relid,
> SUBREL_STATE_SYNCDONE) and waits for the table sync worker
> for R2 to change its status.
> -> Note that the apply worker is still holding AccessShareLock on S1
> 4. The table sync worker tries to update its status to SYNCDONE
> -> In UpdateSubscriptionRelState(), it tires to acquire AccessShareLock
>on S1 but waits for it. *deadlock*
>
> To summary, because the apply worker keeps holding AccessShareLock on
> S1, the acquiring AccessExclusiveLock by ALTER SUBSCRIPTION waits for
> the apply worker, and then the table sync worker also waits for the
> ALTER SUBSCRIPTION in order to change its status. And the apply worker
> waits for the

Re: [HACKERS] Get stuck when dropping a subscription during synchronizing table

2017-06-30 Thread Peter Eisentraut

On 6/29/17 23:39, Noah Misch wrote:
> IMMEDIATE ATTENTION REQUIRED.  This PostgreSQL 10 open item is long past due
> for your status update.  Please reacquaint yourself with the policy on open
> item ownership[1] and then reply immediately.  If I do not hear from you by
> 2017-07-01 04:00 UTC, I will transfer this item to release management team
> ownership without further notice.

I will work on this over the weekend and report back on Monday.

-- 
Peter Eisentraut  http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Get stuck when dropping a subscription during synchronizing table

2017-06-29 Thread Noah Misch

On Wed, Jun 28, 2017 at 03:22:18AM +, Noah Misch wrote:
> On Fri, Jun 23, 2017 at 09:42:10PM -0400, Peter Eisentraut wrote:
> > On 6/21/17 22:47, Peter Eisentraut wrote:
> > > On 6/20/17 22:44, Noah Misch wrote:
> > >>> A patch has been posted, and it's being reviewed.  Next update Monday.
> > >>
> > >> This PostgreSQL 10 open item is past due for your status update.  Kindly 
> > >> send
> > >> a status update within 24 hours, and include a date for your subsequent 
> > >> status
> > >> update.  Refer to the policy on open item ownership:
> > >> https://www.postgresql.org/message-id/20170404140717.GA2675809%40tornado.leadboat.com
> > > 
> > > I'm not sure how to proceed here.  Nobody is else seems terribly
> > > interested, and this is really a minor issue, so I don't want to muck
> > > around with the locking endlessly.  Let's say, if there are no new
> > > insights by Friday, I'll pick one of the proposed patches or just close
> > > it without any patch.
> > 
> > After some comments, a new patch has been posted, and I'll give until
> > Monday for review.
> 
> This PostgreSQL 10 open item is past due for your status update.  Kindly send
> a status update within 24 hours, and include a date for your subsequent status
> update.  Refer to the policy on open item ownership:
> https://www.postgresql.org/message-id/20170404140717.GA2675809%40tornado.leadboat.com

IMMEDIATE ATTENTION REQUIRED.  This PostgreSQL 10 open item is long past due
for your status update.  Please reacquaint yourself with the policy on open
item ownership[1] and then reply immediately.  If I do not hear from you by
2017-07-01 04:00 UTC, I will transfer this item to release management team
ownership without further notice.

[1] 
https://www.postgresql.org/message-id/20170404140717.GA2675809%40tornado.leadboat.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Get stuck when dropping a subscription during synchronizing table

2017-06-27 Thread Masahiko Sawada

On Wed, Jun 28, 2017 at 1:47 AM, Petr Jelinek
 wrote:
>
> On 27/06/17 10:51, Masahiko Sawada wrote:
>> On Mon, Jun 26, 2017 at 12:12 PM, Masahiko Sawada  
>> wrote:
>>
>> I've reviewed this patch briefly.
>
> Thanks!
>
>>
>> @@ -515,6 +533,31 @@ logicalrep_worker_stop(Oid subid, Oid relid)
>>  }
>>
>>  /*
>> + * Request worker to be stopped on commit.
>> + */
>> +void
>> +logicalrep_worker_stop_at_commit(Oid subid, Oid relid)
>> +{
>> +   LogicalRepWorkerId *wid;
>> +   MemoryContext old;
>> +
>> +   old = MemoryContextSwitchTo(TopTransactionContext);
>> +
>> +   wid = (LogicalRepWorkerId *) palloc(sizeof(LogicalRepWorkerId));
>> +
>> +   /*
>> +   wid = MemoryContextAlloc(TopTransactionContext,
>> +
>> sizeof(LogicalRepWorkerId));
>> +   */
>> +   wid->subid = subid;
>> +   wid->relid = relid;
>> +
>> +   on_commit_stop_workers = lappend(on_commit_stop_workers, wid);
>> +
>> +   MemoryContextSwitchTo(old);
>> +}
>>
>> logicalrep_worker_stop_at_commit() has a problem that new_list()
>> called by lappend() allocates the memory from current memory context.
>> It should switch the memory context and then allocate the memory for
>> wid and append it to the list.
>>

Thank you for updating the patch!

>
> Right, fixed (I see you did that locally as well based on the above
> excerpt ;) ).

Oops, yeah that's my test code.

>> 
>> @@ -754,9 +773,25 @@ ApplyLauncherShmemInit(void)
>>  void
>>  AtEOXact_ApplyLauncher(bool isCommit)
>>  {
>> -   if (isCommit && on_commit_launcher_wakeup)
>> -   ApplyLauncherWakeup();
>> +   ListCell *lc;
>> +
>> +   if (isCommit)
>> +   {
>> +   foreach (lc, on_commit_stop_workers)
>> +   {
>> +   LogicalRepWorkerId *wid = lfirst(lc);
>> +   logicalrep_worker_stop(wid->subid, wid->relid);
>> +   }
>> +
>> +   if (on_commit_launcher_wakeup)
>> +   ApplyLauncherWakeup();
>>
>> Stopping the logical rep worker in AtEOXact_ApplyLauncher doesn't
>> support the prepared transaction. Since we allocate the list
>> on_commit_stop_workers in TopTransactionContext the postgres crashes
>> if we execute any query after prepared transaction that removes
>> subscription relation state. Also after fixed this issue, we still
>> need to something: the list of on_commit_stop_workers is not
>> associated the prepared transaction.  A query next to "preapre
>> transaction" tries to stop workers at the commit. There was similar
>> discussion before.
>>
>
> Hmm, good point. I think for now it makes sense to simply don't allow
> PREPARE for transactions that manipulate workers similar to what we do
> when there are exported snapshots. Done it that way in attached.

Agreed. Should we note that in docs?

>
>> 
>> +
>> +   ensure_transaction();
>> +
>> UpdateSubscriptionRelState(MyLogicalRepWorker->subid,
>> +
>> rstate->relid, rstate->state,
>> +
>> rstate->lsn);
>>
>>
>> Should we commit the transaction if we started a new transaction
>> before update the subscription relation state, or it could be deadlock
>> risk.
>
> We only lock the whole subscription (and only conflicting things are
> DROP and ALTER SUBSCRIPTION), not individual subscription-relation
> mapping so it doesn't seem to me like there is any higher risk of
> deadlocks than anything else which works with multiple tables (or
> compared to previous behavior).
>

I'm concerned that a lock for whole subscription could be conflicted
between ALTER SUBSCRIPTION, table sync worker and apply worker:

Please imagine the following case.

1. The apply worker updates a subscription relation state R1 of
subscription S1.
-> It acquires AccessShareLock on S1, and keep holding.
2. ALTER SUBSCRIPTION SET PUBLICATION tries to acquire
AccessExclusiveLock on S1.
-> But it waits for the apply worker to release the lock.
3. The apply worker calls wait_for_relation_state_change(relid,
SUBREL_STATE_SYNCDONE) and waits for the table sync worker
for R2 to change its status.
-> Note that the apply worker is still holding AccessShareLock on S1
4. The table sync worker tries to update its status to SYNCDONE
-> In UpdateSubscriptionRelState(), it tires to acquire AccessShareLock
   on S1 but waits for it. *deadlock*

To summary, because the apply worker keeps holding AccessShareLock on
S1, the acquiring AccessExclusiveLock by ALTER SUBSCRIPTION waits for
the apply worker, and then the table sync worker also waits for the
ALTER SUBSCRIPTION in order to change its status. And the apply worker
waits for the table sync worker to change its status.

I encountered the similar case once on my environment. But since it
completely depends on timing I don't have the concrete reproducing
steps.

Regards,

--
Masahiko Sawada
NIPPON TELEGRAPH AND TELEPHONE

Re: [HACKERS] Get stuck when dropping a subscription during synchronizing table

2017-06-27 Thread Noah Misch

On Fri, Jun 23, 2017 at 09:42:10PM -0400, Peter Eisentraut wrote:
> On 6/21/17 22:47, Peter Eisentraut wrote:
> > On 6/20/17 22:44, Noah Misch wrote:
> >>> A patch has been posted, and it's being reviewed.  Next update Monday.
> >>
> >> This PostgreSQL 10 open item is past due for your status update.  Kindly 
> >> send
> >> a status update within 24 hours, and include a date for your subsequent 
> >> status
> >> update.  Refer to the policy on open item ownership:
> >> https://www.postgresql.org/message-id/20170404140717.GA2675809%40tornado.leadboat.com
> > 
> > I'm not sure how to proceed here.  Nobody is else seems terribly
> > interested, and this is really a minor issue, so I don't want to muck
> > around with the locking endlessly.  Let's say, if there are no new
> > insights by Friday, I'll pick one of the proposed patches or just close
> > it without any patch.
> 
> After some comments, a new patch has been posted, and I'll give until
> Monday for review.

This PostgreSQL 10 open item is past due for your status update.  Kindly send
a status update within 24 hours, and include a date for your subsequent status
update.  Refer to the policy on open item ownership:
https://www.postgresql.org/message-id/20170404140717.GA2675809%40tornado.leadboat.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Get stuck when dropping a subscription during synchronizing table

2017-06-27 Thread Petr Jelinek


On 27/06/17 10:51, Masahiko Sawada wrote:
> On Mon, Jun 26, 2017 at 12:12 PM, Masahiko Sawada  
> wrote:
> 
> I've reviewed this patch briefly.

Thanks!

> 
> @@ -515,6 +533,31 @@ logicalrep_worker_stop(Oid subid, Oid relid)
>  }
> 
>  /*
> + * Request worker to be stopped on commit.
> + */
> +void
> +logicalrep_worker_stop_at_commit(Oid subid, Oid relid)
> +{
> +   LogicalRepWorkerId *wid;
> +   MemoryContext old;
> +
> +   old = MemoryContextSwitchTo(TopTransactionContext);
> +
> +   wid = (LogicalRepWorkerId *) palloc(sizeof(LogicalRepWorkerId));
> +
> +   /*
> +   wid = MemoryContextAlloc(TopTransactionContext,
> +
> sizeof(LogicalRepWorkerId));
> +   */
> +   wid->subid = subid;
> +   wid->relid = relid;
> +
> +   on_commit_stop_workers = lappend(on_commit_stop_workers, wid);
> +
> +   MemoryContextSwitchTo(old);
> +}
> 
> logicalrep_worker_stop_at_commit() has a problem that new_list()
> called by lappend() allocates the memory from current memory context.
> It should switch the memory context and then allocate the memory for
> wid and append it to the list.
> 

Right, fixed (I see you did that locally as well based on the above
excerpt ;) ).

> 
> @@ -754,9 +773,25 @@ ApplyLauncherShmemInit(void)
>  void
>  AtEOXact_ApplyLauncher(bool isCommit)
>  {
> -   if (isCommit && on_commit_launcher_wakeup)
> -   ApplyLauncherWakeup();
> +   ListCell *lc;
> +
> +   if (isCommit)
> +   {
> +   foreach (lc, on_commit_stop_workers)
> +   {
> +   LogicalRepWorkerId *wid = lfirst(lc);
> +   logicalrep_worker_stop(wid->subid, wid->relid);
> +   }
> +
> +   if (on_commit_launcher_wakeup)
> +   ApplyLauncherWakeup();
> 
> Stopping the logical rep worker in AtEOXact_ApplyLauncher doesn't
> support the prepared transaction. Since we allocate the list
> on_commit_stop_workers in TopTransactionContext the postgres crashes
> if we execute any query after prepared transaction that removes
> subscription relation state. Also after fixed this issue, we still
> need to something: the list of on_commit_stop_workers is not
> associated the prepared transaction.  A query next to "preapre
> transaction" tries to stop workers at the commit. There was similar
> discussion before.
> 

Hmm, good point. I think for now it makes sense to simply don't allow
PREPARE for transactions that manipulate workers similar to what we do
when there are exported snapshots. Done it that way in attached.

> 
> +
> +   ensure_transaction();
> +
> UpdateSubscriptionRelState(MyLogicalRepWorker->subid,
> +
> rstate->relid, rstate->state,
> +
> rstate->lsn);
> 
> 
> Should we commit the transaction if we started a new transaction
> before update the subscription relation state, or it could be deadlock
> risk.

We only lock the whole subscription (and only conflicting things are
DROP and ALTER SUBSCRIPTION), not individual subscription-relation
mapping so it doesn't seem to me like there is any higher risk of
deadlocks than anything else which works with multiple tables (or
compared to previous behavior).

-- 
  Petr Jelinek  http://www.2ndQuadrant.com/
  PostgreSQL Development, 24x7 Support, Training & Services
From 88da433110aa3bde3dcce33ebf62d41a08c191b9 Mon Sep 17 00:00:00 2001
From: Petr Jelinek 
Date: Sat, 24 Jun 2017 19:38:21 +0200
Subject: [PATCH] Rework subscription worker and relation status handling

---
 src/backend/access/transam/xact.c   |   9 +
 src/backend/catalog/pg_subscription.c   | 137 ++--
 src/backend/commands/subscriptioncmds.c |  98 -
 src/backend/replication/logical/launcher.c  | 309 
 src/backend/replication/logical/tablesync.c |  97 +
 src/backend/replication/logical/worker.c|  23 ++-
 src/backend/utils/cache/catcache.c  |   6 +-
 src/include/catalog/pg_subscription_rel.h   |   6 +-
 src/include/replication/logicallauncher.h   |   1 +
 src/include/replication/worker_internal.h   |   6 +-
 10 files changed, 393 insertions(+), 299 deletions(-)

diff --git a/src/backend/access/transam/xact.c b/src/backend/access/transam/xact.c
index b0aa69f..322502d 100644
--- a/src/backend/access/transam/xact.c
+++ b/src/backend/access/transam/xact.c
@@ -2277,6 +2277,15 @@ PrepareTransaction(void)
 (errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
  errmsg("cannot PREPARE a transaction that has exported snapshots")));
 
+	/*
+	 * Similar to above, don't allow PREPARE but for transaction that kill
+	 * logical replication, workers.
+	 */
+	if (XactManipulatesLogicalReplicationWorkers())
+		ereport(ERROR,
+(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
+ errmsg("cannot PREPARE a transaction that has manipulated logical replication workers")));
+
 	/* Prevent

Re: [HACKERS] Get stuck when dropping a subscription during synchronizing table

2017-06-27 Thread Masahiko Sawada

On Mon, Jun 26, 2017 at 12:12 PM, Masahiko Sawada  wrote:
> On Sun, Jun 25, 2017 at 7:35 PM, Petr Jelinek
>  wrote:
>> (was away for a while, got some time now for this again)
>>
>> On 22/06/17 04:43, Peter Eisentraut wrote:
>>> The alternative is that we use the LockSharedObject() approach that was
>>> already alluded to, like in the attached patch.  Both approaches would
>>> work equally fine AFAICT.
>>
>> I agree, but I think we need bigger overhaul of the locking/management
>> in general. So here is patch which does much more changes.
>>
>> The patch does several important things (in no particular order):
>> - Split SetSubscriptionRelState into AddSubscriptionRelState and
>> UpdateSubscriptionRelState for the reasons said upstream, it's cleaner,
>> there is no half-broken upsert logic and it has proper error checking
>> for each action.
>>
>> - Do LockSharedObject in ALTER SUBSCRIPTION, DROP SUBSCRIPTION (this one
>> is preexisting but mentioning it for context), SetSubscriptionRelState,
>> AddSubscriptionRelState, and in the logicalrep_worker_launch. This means
>> we use granular per object locks to deal with concurrency.
>>
>> - Because of above, the AccessExclusiveLock on pg_subscription is no
>> longer needed, just normal RowExlusiveLock is used now.
>>
>> - logicalrep_worker_stop is also simplified due to the proper locking
>>
>> - There is new interface logicalrep_worker_stop_at_commit which is used
>> by ALTER SUBSCRIPTION ... REFRESH PUBLICATION and by transactional
>> variant of DROP SUBSCRIPTION to only kill workers at the end of transaction.
>>
>> - Locking/reading of subscription info is unified between DROP and ALTER
>> SUBSCRIPTION commands.
>>
>> - DROP SUBSCRIPTION will kill all workers associated with subscription,
>> not just apply.
>>
>> - The sync worker checks during startup if the relation is still subscribed.
>>
>> - The sync worker will exit when waiting for apply and apply has shut-down.
>>
>> - The log messages around workers and removed or disabled subscription
>> are now more consistent between startup and normal runtime of the worker.
>>
>> - Some code deduplication and stylistic changes/simplification in
>> related areas.
>>
>> - Fixed catcache's IndexScanOK() handling of the subscription catalog.
>>
>> It's bit bigger patch but solves issues from multiple threads around
>> handling of ALTER/DROP subscription.
>>
>> A lot of the locking that I added is normally done transparently by
>> dependency handling, but subscriptions and subscription relation status
>> do not use that much as it was deemed to bloat pg_depend needlessly
>> during the original patch review (it's also probably why this has
>> slipped through).
>>
>

I've reviewed this patch briefly.

@@ -515,6 +533,31 @@ logicalrep_worker_stop(Oid subid, Oid relid)
 }

 /*
+ * Request worker to be stopped on commit.
+ */
+void
+logicalrep_worker_stop_at_commit(Oid subid, Oid relid)
+{
+   LogicalRepWorkerId *wid;
+   MemoryContext old;
+
+   old = MemoryContextSwitchTo(TopTransactionContext);
+
+   wid = (LogicalRepWorkerId *) palloc(sizeof(LogicalRepWorkerId));
+
+   /*
+   wid = MemoryContextAlloc(TopTransactionContext,
+
sizeof(LogicalRepWorkerId));
+   */
+   wid->subid = subid;
+   wid->relid = relid;
+
+   on_commit_stop_workers = lappend(on_commit_stop_workers, wid);
+
+   MemoryContextSwitchTo(old);
+}

logicalrep_worker_stop_at_commit() has a problem that new_list()
called by lappend() allocates the memory from current memory context.
It should switch the memory context and then allocate the memory for
wid and append it to the list.


@@ -754,9 +773,25 @@ ApplyLauncherShmemInit(void)
 void
 AtEOXact_ApplyLauncher(bool isCommit)
 {
-   if (isCommit && on_commit_launcher_wakeup)
-   ApplyLauncherWakeup();
+   ListCell *lc;
+
+   if (isCommit)
+   {
+   foreach (lc, on_commit_stop_workers)
+   {
+   LogicalRepWorkerId *wid = lfirst(lc);
+   logicalrep_worker_stop(wid->subid, wid->relid);
+   }
+
+   if (on_commit_launcher_wakeup)
+   ApplyLauncherWakeup();

Stopping the logical rep worker in AtEOXact_ApplyLauncher doesn't
support the prepared transaction. Since we allocate the list
on_commit_stop_workers in TopTransactionContext the postgres crashes
if we execute any query after prepared transaction that removes
subscription relation state. Also after fixed this issue, we still
need to something: the list of on_commit_stop_workers is not
associated the prepared transaction.  A query next to "preapre
transaction" tries to stop workers at the commit. There was similar
discussion before.


+
+   ensure_transaction();
+
UpdateSubscriptionRelState(MyLogicalRepWorker->subid,
+
rstate->relid, rstate->state,
+
rstate->lsn);

Re: [HACKERS] Get stuck when dropping a subscription during synchronizing table

2017-06-25 Thread Masahiko Sawada

On Sun, Jun 25, 2017 at 7:35 PM, Petr Jelinek
 wrote:
> (was away for a while, got some time now for this again)
>
> On 22/06/17 04:43, Peter Eisentraut wrote:
>> The alternative is that we use the LockSharedObject() approach that was
>> already alluded to, like in the attached patch.  Both approaches would
>> work equally fine AFAICT.
>
> I agree, but I think we need bigger overhaul of the locking/management
> in general. So here is patch which does much more changes.
>
> The patch does several important things (in no particular order):
> - Split SetSubscriptionRelState into AddSubscriptionRelState and
> UpdateSubscriptionRelState for the reasons said upstream, it's cleaner,
> there is no half-broken upsert logic and it has proper error checking
> for each action.
>
> - Do LockSharedObject in ALTER SUBSCRIPTION, DROP SUBSCRIPTION (this one
> is preexisting but mentioning it for context), SetSubscriptionRelState,
> AddSubscriptionRelState, and in the logicalrep_worker_launch. This means
> we use granular per object locks to deal with concurrency.
>
> - Because of above, the AccessExclusiveLock on pg_subscription is no
> longer needed, just normal RowExlusiveLock is used now.
>
> - logicalrep_worker_stop is also simplified due to the proper locking
>
> - There is new interface logicalrep_worker_stop_at_commit which is used
> by ALTER SUBSCRIPTION ... REFRESH PUBLICATION and by transactional
> variant of DROP SUBSCRIPTION to only kill workers at the end of transaction.
>
> - Locking/reading of subscription info is unified between DROP and ALTER
> SUBSCRIPTION commands.
>
> - DROP SUBSCRIPTION will kill all workers associated with subscription,
> not just apply.
>
> - The sync worker checks during startup if the relation is still subscribed.
>
> - The sync worker will exit when waiting for apply and apply has shut-down.
>
> - The log messages around workers and removed or disabled subscription
> are now more consistent between startup and normal runtime of the worker.
>
> - Some code deduplication and stylistic changes/simplification in
> related areas.
>
> - Fixed catcache's IndexScanOK() handling of the subscription catalog.
>
> It's bit bigger patch but solves issues from multiple threads around
> handling of ALTER/DROP subscription.
>
> A lot of the locking that I added is normally done transparently by
> dependency handling, but subscriptions and subscription relation status
> do not use that much as it was deemed to bloat pg_depend needlessly
> during the original patch review (it's also probably why this has
> slipped through).
>

Thank you for reworking on this! I'll review this patch on Tuesday.

Regards,

--
Masahiko Sawada
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Get stuck when dropping a subscription during synchronizing table

2017-06-25 Thread Petr Jelinek

(was away for a while, got some time now for this again)

On 22/06/17 04:43, Peter Eisentraut wrote:
> The alternative is that we use the LockSharedObject() approach that was
> already alluded to, like in the attached patch.  Both approaches would
> work equally fine AFAICT.

I agree, but I think we need bigger overhaul of the locking/management
in general. So here is patch which does much more changes.

The patch does several important things (in no particular order):
- Split SetSubscriptionRelState into AddSubscriptionRelState and
UpdateSubscriptionRelState for the reasons said upstream, it's cleaner,
there is no half-broken upsert logic and it has proper error checking
for each action.

- Do LockSharedObject in ALTER SUBSCRIPTION, DROP SUBSCRIPTION (this one
is preexisting but mentioning it for context), SetSubscriptionRelState,
AddSubscriptionRelState, and in the logicalrep_worker_launch. This means
we use granular per object locks to deal with concurrency.

- Because of above, the AccessExclusiveLock on pg_subscription is no
longer needed, just normal RowExlusiveLock is used now.

- logicalrep_worker_stop is also simplified due to the proper locking

- There is new interface logicalrep_worker_stop_at_commit which is used
by ALTER SUBSCRIPTION ... REFRESH PUBLICATION and by transactional
variant of DROP SUBSCRIPTION to only kill workers at the end of transaction.

- Locking/reading of subscription info is unified between DROP and ALTER
SUBSCRIPTION commands.

- DROP SUBSCRIPTION will kill all workers associated with subscription,
not just apply.

- The sync worker checks during startup if the relation is still subscribed.

- The sync worker will exit when waiting for apply and apply has shut-down.

- The log messages around workers and removed or disabled subscription
are now more consistent between startup and normal runtime of the worker.

- Some code deduplication and stylistic changes/simplification in
related areas.

- Fixed catcache's IndexScanOK() handling of the subscription catalog.

It's bit bigger patch but solves issues from multiple threads around
handling of ALTER/DROP subscription.

A lot of the locking that I added is normally done transparently by
dependency handling, but subscriptions and subscription relation status
do not use that much as it was deemed to bloat pg_depend needlessly
during the original patch review (it's also probably why this has
slipped through).

-- 
  Petr Jelinek  http://www.2ndQuadrant.com/
  PostgreSQL Development, 24x7 Support, Training & Services
From d7038474012769c9c3b50231af76dd7796fe593f Mon Sep 17 00:00:00 2001
From: Petr Jelinek 
Date: Sat, 24 Jun 2017 19:38:21 +0200
Subject: [PATCH] Rework subscription worker and relation status handling

---
 src/backend/catalog/pg_subscription.c   | 137 +++--
 src/backend/commands/subscriptioncmds.c |  98 +-
 src/backend/replication/logical/launcher.c  | 293 +++-
 src/backend/replication/logical/tablesync.c |  97 +
 src/backend/replication/logical/worker.c|  23 ++-
 src/backend/utils/cache/catcache.c  |   6 +-
 src/include/catalog/pg_subscription_rel.h   |   6 +-
 src/include/replication/worker_internal.h   |   6 +-
 8 files changed, 367 insertions(+), 299 deletions(-)

diff --git a/src/backend/catalog/pg_subscription.c b/src/backend/catalog/pg_subscription.c
index c69c461..b643e54 100644
--- a/src/backend/catalog/pg_subscription.c
+++ b/src/backend/catalog/pg_subscription.c
@@ -28,6 +28,8 @@
 
 #include "nodes/makefuncs.h"
 
+#include "storage/lmgr.h"
+
 #include "utils/array.h"
 #include "utils/builtins.h"
 #include "utils/fmgroids.h"
@@ -225,84 +227,101 @@ textarray_to_stringlist(ArrayType *textarray)
 }
 
 /*
- * Set the state of a subscription table.
- *
- * If update_only is true and the record for given table doesn't exist, do
- * nothing.  This can be used to avoid inserting a new record that was deleted
- * by someone else.  Generally, subscription DDL commands should use false,
- * workers should use true.
- *
- * The insert-or-update logic in this function is not concurrency safe so it
- * might raise an error in rare circumstances.  But if we took a stronger lock
- * such as ShareRowExclusiveLock, we would risk more deadlocks.
+ * Add new state record for a subscription table.
  */
 Oid
-SetSubscriptionRelState(Oid subid, Oid relid, char state,
-		XLogRecPtr sublsn, bool update_only)
+AddSubscriptionRelState(Oid subid, Oid relid, char state,
+		XLogRecPtr sublsn)
 {
 	Relation	rel;
 	HeapTuple	tup;
-	Oid			subrelid = InvalidOid;
+	Oid			subrelid;
 	bool		nulls[Natts_pg_subscription_rel];
 	Datum		values[Natts_pg_subscription_rel];
 
+	LockSharedObject(SubscriptionRelationId, subid, 0, AccessShareLock);
+
 	rel = heap_open(SubscriptionRelRelationId, RowExclusiveLock);
 
 	/* Try finding existing mapping. */
 	tup = SearchSysCacheCopy2(SUBSCRIPTIONRELMAP,
 			  ObjectIdGetDatum(relid),

Re: [HACKERS] Get stuck when dropping a subscription during synchronizing table

2017-06-23 Thread Peter Eisentraut

On 6/21/17 22:47, Peter Eisentraut wrote:
> On 6/20/17 22:44, Noah Misch wrote:
>>> A patch has been posted, and it's being reviewed.  Next update Monday.
>>
>> This PostgreSQL 10 open item is past due for your status update.  Kindly send
>> a status update within 24 hours, and include a date for your subsequent 
>> status
>> update.  Refer to the policy on open item ownership:
>> https://www.postgresql.org/message-id/20170404140717.GA2675809%40tornado.leadboat.com
> 
> I'm not sure how to proceed here.  Nobody is else seems terribly
> interested, and this is really a minor issue, so I don't want to muck
> around with the locking endlessly.  Let's say, if there are no new
> insights by Friday, I'll pick one of the proposed patches or just close
> it without any patch.

After some comments, a new patch has been posted, and I'll give until
Monday for review.

-- 
Peter Eisentraut  http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Get stuck when dropping a subscription during synchronizing table

2017-06-23 Thread Peter Eisentraut

On 6/22/17 03:26, Masahiko Sawada wrote:
> Thank you for the patch. Some comment and question.
> DropSubscriptionRelState requests ExclusiveLock but why is it not
> ShareRowExclusiveLock?

fixed

> I test DROP SUBSCIPTION case but even with this patch, "tuple
> concurrently updated" is still occurred.
> 
> @@ -405,7 +405,7 @@ RemoveSubscriptionRel(Oid subid, Oid relid)
>   }
>   heap_endscan(scan);
> 
> - heap_close(rel, RowExclusiveLock);
> + heap_close(rel, lockmode);
>  }
> 
> I think we should not release the table lock here, should be
> heap_close(rel, NoLock) instead? After changed it so on my
> environment, DROP SUBSCRIPTION seems to work fine.

fixed

> Also, ALTER SUBSCRIPTION SET PUBLICATION seems the same. Even with
> Petr's patch, the same error still occurred during ALTER SUBSCRIPTION
> SET PUBLICATION. Currently it acquires RowExclusiveLock on
> pg_subscription_rel but as long as both DDL and table sync worker
> could modify the same record on  pg_subscription this error can
> happen.

fixed

> On the other hand if we take a strong lock on pg_subscription
> during DDL the deadlock risk will be increased.

The original problem was that DROP TABLE locked things in the order 1)
user table, 2) pg_subscription_rel, whereas a full-database VACUUM would
lock things in the opposite order.  I think this was a pretty wide
window if you have many tables.  In this case, however, we are only
dealing with pg_subscription and pg_subscription_rel, and I think even
VACUUM would always processes them in the same order.

Could please try the attached patch so see if it addresses your test
scenarios?

-- 
Peter Eisentraut  http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
From fc50b9e287e8c3c29be21d82c07c39023cbf3882 Mon Sep 17 00:00:00 2001
From: Peter Eisentraut 
Date: Fri, 23 Jun 2017 21:26:46 -0400
Subject: [PATCH v2] Parametrize locking in RemoveSubscriptionRel

Revising the change in 521fd4795e3ec3d0b263b62e5eb58e1557be9c86, we need
to take a stronger lock when removing entries from pg_subscription_rel
when ALTER/DROP SUBSCRIPTION.  Otherwise, users might see "tuple
concurrently updated" when table sync workers make updates at the same
time as DDL commands are run.  For DROP TABLE, we keep the lower lock
level that was put in place by the above commit, so that DROP TABLE does
not deadlock with whole-database VACUUM.
---
 src/backend/catalog/heap.c| 2 +-
 src/backend/catalog/pg_subscription.c | 6 +++---
 src/backend/commands/subscriptioncmds.c   | 4 ++--
 src/include/catalog/pg_subscription_rel.h | 2 +-
 4 files changed, 7 insertions(+), 7 deletions(-)

diff --git a/src/backend/catalog/heap.c b/src/backend/catalog/heap.c
index a376b99f1e..8a96bbece6 100644
--- a/src/backend/catalog/heap.c
+++ b/src/backend/catalog/heap.c
@@ -1844,7 +1844,7 @@ heap_drop_with_catalog(Oid relid)
/*
 * Remove any associated relation synchronization states.
 */
-   RemoveSubscriptionRel(InvalidOid, relid);
+   RemoveSubscriptionRel(InvalidOid, relid, RowExclusiveLock);
 
/*
 * Forget any ON COMMIT action for the rel
diff --git a/src/backend/catalog/pg_subscription.c 
b/src/backend/catalog/pg_subscription.c
index c69c461b62..1d80191ca8 100644
--- a/src/backend/catalog/pg_subscription.c
+++ b/src/backend/catalog/pg_subscription.c
@@ -369,7 +369,7 @@ GetSubscriptionRelState(Oid subid, Oid relid, XLogRecPtr 
*sublsn,
  * subscription, or for a particular relation, or both.
  */
 void
-RemoveSubscriptionRel(Oid subid, Oid relid)
+RemoveSubscriptionRel(Oid subid, Oid relid, LOCKMODE lockmode)
 {
Relationrel;
HeapScanDesc scan;
@@ -377,7 +377,7 @@ RemoveSubscriptionRel(Oid subid, Oid relid)
HeapTuple   tup;
int nkeys = 0;
 
-   rel = heap_open(SubscriptionRelRelationId, RowExclusiveLock);
+   rel = heap_open(SubscriptionRelRelationId, lockmode);
 
if (OidIsValid(subid))
{
@@ -405,7 +405,7 @@ RemoveSubscriptionRel(Oid subid, Oid relid)
}
heap_endscan(scan);
 
-   heap_close(rel, RowExclusiveLock);
+   heap_close(rel, NoLock);
 }
 
 
diff --git a/src/backend/commands/subscriptioncmds.c 
b/src/backend/commands/subscriptioncmds.c
index 9cbd36f646..57954c978b 100644
--- a/src/backend/commands/subscriptioncmds.c
+++ b/src/backend/commands/subscriptioncmds.c
@@ -595,7 +595,7 @@ AlterSubscription_refresh(Subscription *sub, bool copy_data)
{
char   *namespace;
 
-   RemoveSubscriptionRel(sub->oid, relid);
+   RemoveSubscriptionRel(sub->oid, relid, 
ShareRowExclusiveLock);
 
logicalrep_worker_stop(sub->oid, relid);
 
@@ -910,7 +910,7 @@ DropSubscription(DropSubscriptionStmt *stmt, bool 
isTopLevel)
deleteSharedDependencyRecordsFor(SubscriptionRelationId, subid, 0);
 
/*

Re: [HACKERS] Get stuck when dropping a subscription during synchronizing table

2017-06-22 Thread Masahiko Sawada

On Wed, Jun 21, 2017 at 8:10 AM, Peter Eisentraut
 wrote:
> On 6/19/17 22:54, Masahiko Sawada wrote:
>>> It seems to me we could just take a stronger lock around
>>> RemoveSubscriptionRel(), so that workers can't write in there concurrently.
>>
>> Since we reduced the lock level of updating pg_subscription_rel by
>> commit 521fd4795e3e the same deadlock issue will appear if we just
>> take a stronger lock level.
>
> I was thinking about a more refined approach, like in the attached
> patch.  It just changes the locking when in DropSubscription(), so that
> that doesn't fail if workers are doing stuff concurrently.  Everything
> else stays the same.
>

Thank you for the patch. Some comment and question.
DropSubscriptionRelState requests ExclusiveLock but why is it not
ShareRowExclusiveLock?

I test DROP SUBSCIPTION case but even with this patch, "tuple
concurrently updated" is still occurred.

@@ -405,7 +405,7 @@ RemoveSubscriptionRel(Oid subid, Oid relid)
  }
  heap_endscan(scan);

- heap_close(rel, RowExclusiveLock);
+ heap_close(rel, lockmode);
 }

I think we should not release the table lock here, should be
heap_close(rel, NoLock) instead? After changed it so on my
environment, DROP SUBSCRIPTION seems to work fine.

Also, ALTER SUBSCRIPTION SET PUBLICATION seems the same. Even with
Petr's patch, the same error still occurred during ALTER SUBSCRIPTION
SET PUBLICATION. Currently it acquires RowExclusiveLock on
pg_subscription_rel but as long as both DDL and table sync worker
could modify the same record on  pg_subscription this error can
happen. On the other hand if we take a strong lock on pg_subscription
during DDL the deadlock risk will be increased.

One solution I came up with is that we use other than
simple_heap_update/delete so that we accept the return value
HeapTupleUpdated of heap_update/delete. It makes pg_subscription_rel
something like heap table rather than system catalog, though.

Regards,

--
Masahiko Sawada
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Get stuck when dropping a subscription during synchronizing table

2017-06-22 Thread Kyotaro HORIGUCHI

Hello,

At Wed, 21 Jun 2017 22:43:32 -0400, Peter Eisentraut 
 wrote in 
<501f75c9-c5d6-d023-add0-3b670ac86...@2ndquadrant.com>
> On 6/20/17 19:10, Peter Eisentraut wrote:
> > On 6/19/17 22:54, Masahiko Sawada wrote:
> >>> It seems to me we could just take a stronger lock around
> >>> RemoveSubscriptionRel(), so that workers can't write in there 
> >>> concurrently.
> >>
> >> Since we reduced the lock level of updating pg_subscription_rel by
> >> commit 521fd4795e3e the same deadlock issue will appear if we just
> >> take a stronger lock level.
> > 
> > I was thinking about a more refined approach, like in the attached
> > patch.  It just changes the locking when in DropSubscription(), so that
> > that doesn't fail if workers are doing stuff concurrently.  Everything
> > else stays the same.
> 
> The alternative is that we use the LockSharedObject() approach that was
> already alluded to, like in the attached patch.  Both approaches would
> work equally fine AFAICT.

However I haven't seen this deeply, just making
SetSubscriptionRelState exlusive seems to have a chance to create
a false record on pg_subscription_rel. Am I misunderstanding?

-- 
Kyotaro Horiguchi
NTT Open Source Software Center



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Get stuck when dropping a subscription during synchronizing table

2017-06-21 Thread Peter Eisentraut

On 6/20/17 22:44, Noah Misch wrote:
>> A patch has been posted, and it's being reviewed.  Next update Monday.
> 
> This PostgreSQL 10 open item is past due for your status update.  Kindly send
> a status update within 24 hours, and include a date for your subsequent status
> update.  Refer to the policy on open item ownership:
> https://www.postgresql.org/message-id/20170404140717.GA2675809%40tornado.leadboat.com

I'm not sure how to proceed here.  Nobody is else seems terribly
interested, and this is really a minor issue, so I don't want to muck
around with the locking endlessly.  Let's say, if there are no new
insights by Friday, I'll pick one of the proposed patches or just close
it without any patch.

-- 
Peter Eisentraut  http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Get stuck when dropping a subscription during synchronizing table

2017-06-21 Thread Peter Eisentraut

On 6/19/17 22:41, Masahiko Sawada wrote:
> On Tue, Jun 20, 2017 at 10:47 AM, Peter Eisentraut
>  wrote:
>> On 6/16/17 04:16, Masahiko Sawada wrote:
>>> A subscription relation state may have been removed already when we
>>> try to update it. SetSubscriptionRelState didn't emit an error in such
>>> case but with this patch we end up with an error. Since we shouldn't
>>> ignore such error in UpdateSubscriptionRelState I'd say we can let the
>>> user know about that possibility in the error message.
>>
>> So are you saying it's good to have the error message?
>>
> 
> Yes. UpdateSubscriptionRelState failure means that the subscription
> relation state has disappeared or also means something wrong. So I
> think it's good to have it as perhaps errdetail. Thought?

I think this could lead to errors in the tablesync workers if they are
trying to write while the entries have already been deleted as part of
the subscription or the table being deleted.

-- 
Peter Eisentraut  http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Get stuck when dropping a subscription during synchronizing table

2017-06-21 Thread Peter Eisentraut

On 6/20/17 19:10, Peter Eisentraut wrote:
> On 6/19/17 22:54, Masahiko Sawada wrote:
>>> It seems to me we could just take a stronger lock around
>>> RemoveSubscriptionRel(), so that workers can't write in there concurrently.
>>
>> Since we reduced the lock level of updating pg_subscription_rel by
>> commit 521fd4795e3e the same deadlock issue will appear if we just
>> take a stronger lock level.
> 
> I was thinking about a more refined approach, like in the attached
> patch.  It just changes the locking when in DropSubscription(), so that
> that doesn't fail if workers are doing stuff concurrently.  Everything
> else stays the same.

The alternative is that we use the LockSharedObject() approach that was
already alluded to, like in the attached patch.  Both approaches would
work equally fine AFAICT.

-- 
Peter Eisentraut  http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
From 10afa9807014e14596cb05d70ba302c86bf30dd3 Mon Sep 17 00:00:00 2001
From: Peter Eisentraut 
Date: Wed, 21 Jun 2017 22:40:22 -0400
Subject: [PATCH] WIP Add locking SetSubscriptionRelState()

---
 src/backend/catalog/pg_subscription.c | 4 
 1 file changed, 4 insertions(+)

diff --git a/src/backend/catalog/pg_subscription.c 
b/src/backend/catalog/pg_subscription.c
index c69c461b62..f794d4c526 100644
--- a/src/backend/catalog/pg_subscription.c
+++ b/src/backend/catalog/pg_subscription.c
@@ -28,6 +28,8 @@
 
 #include "nodes/makefuncs.h"
 
+#include "storage/lmgr.h"
+
 #include "utils/array.h"
 #include "utils/builtins.h"
 #include "utils/fmgroids.h"
@@ -246,6 +248,8 @@ SetSubscriptionRelState(Oid subid, Oid relid, char state,
boolnulls[Natts_pg_subscription_rel];
Datum   values[Natts_pg_subscription_rel];
 
+   LockSharedObject(SubscriptionRelationId, subid, 0, AccessExclusiveLock);
+
rel = heap_open(SubscriptionRelRelationId, RowExclusiveLock);
 
/* Try finding existing mapping. */
-- 
2.13.1


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Get stuck when dropping a subscription during synchronizing table

2017-06-20 Thread Noah Misch

On Thu, Jun 15, 2017 at 11:40:52PM -0400, Peter Eisentraut wrote:
> On 6/13/17 15:49, Peter Eisentraut wrote:
> > On 6/13/17 02:33, Noah Misch wrote:
> >>> Steps to reproduce -
> >>> X cluster -> create 100 tables , publish all tables  (create publication 
> >>> pub
> >>> for all tables);
> >>> Y Cluster -> create 100 tables ,create subscription(create subscription 
> >>> sub
> >>> connection 'user=centos host=localhost' publication pub;
> >>> Y cluster ->drop subscription - drop subscription sub;
> >>>
> >>> check the log file on Y cluster.
> >>>
> >>> Sometime , i have seen this error on psql prompt and drop subscription
> >>> operation got failed at first attempt.
> >>>
> >>> postgres=# drop subscription sub;
> >>> ERROR:  tuple concurrently updated
> >>> postgres=# drop subscription sub;
> >>> NOTICE:  dropped replication slot "sub" on publisher
> >>> DROP SUBSCRIPTION
> >>
> >> [Action required within three days.  This is a generic notification.]
> > 
> > It's being worked on.  Let's see by Thursday.
> 
> A patch has been posted, and it's being reviewed.  Next update Monday.

This PostgreSQL 10 open item is past due for your status update.  Kindly send
a status update within 24 hours, and include a date for your subsequent status
update.  Refer to the policy on open item ownership:
https://www.postgresql.org/message-id/20170404140717.GA2675809%40tornado.leadboat.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Get stuck when dropping a subscription during synchronizing table

2017-06-20 Thread Peter Eisentraut

On 6/19/17 22:54, Masahiko Sawada wrote:
>> It seems to me we could just take a stronger lock around
>> RemoveSubscriptionRel(), so that workers can't write in there concurrently.
> 
> Since we reduced the lock level of updating pg_subscription_rel by
> commit 521fd4795e3e the same deadlock issue will appear if we just
> take a stronger lock level.

I was thinking about a more refined approach, like in the attached
patch.  It just changes the locking when in DropSubscription(), so that
that doesn't fail if workers are doing stuff concurrently.  Everything
else stays the same.

-- 
Peter Eisentraut  http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
From c101b6da52dd4f9f041390a937b337f20d212e5c Mon Sep 17 00:00:00 2001
From: Peter Eisentraut 
Date: Tue, 20 Jun 2017 19:06:42 -0400
Subject: [PATCH] WIP Parametrize locking in RemoveSubscriptionRel

---
 src/backend/catalog/heap.c| 2 +-
 src/backend/catalog/pg_subscription.c | 6 +++---
 src/backend/commands/subscriptioncmds.c   | 4 ++--
 src/include/catalog/pg_subscription_rel.h | 2 +-
 4 files changed, 7 insertions(+), 7 deletions(-)

diff --git a/src/backend/catalog/heap.c b/src/backend/catalog/heap.c
index 4e5b79ef94..643e4bb1d5 100644
--- a/src/backend/catalog/heap.c
+++ b/src/backend/catalog/heap.c
@@ -1844,7 +1844,7 @@ heap_drop_with_catalog(Oid relid)
/*
 * Remove any associated relation synchronization states.
 */
-   RemoveSubscriptionRel(InvalidOid, relid);
+   RemoveSubscriptionRel(InvalidOid, relid, RowExclusiveLock);
 
/*
 * Forget any ON COMMIT action for the rel
diff --git a/src/backend/catalog/pg_subscription.c 
b/src/backend/catalog/pg_subscription.c
index c69c461b62..88b81920c8 100644
--- a/src/backend/catalog/pg_subscription.c
+++ b/src/backend/catalog/pg_subscription.c
@@ -369,7 +369,7 @@ GetSubscriptionRelState(Oid subid, Oid relid, XLogRecPtr 
*sublsn,
  * subscription, or for a particular relation, or both.
  */
 void
-RemoveSubscriptionRel(Oid subid, Oid relid)
+RemoveSubscriptionRel(Oid subid, Oid relid, LOCKMODE lockmode)
 {
Relationrel;
HeapScanDesc scan;
@@ -377,7 +377,7 @@ RemoveSubscriptionRel(Oid subid, Oid relid)
HeapTuple   tup;
int nkeys = 0;
 
-   rel = heap_open(SubscriptionRelRelationId, RowExclusiveLock);
+   rel = heap_open(SubscriptionRelRelationId, lockmode);
 
if (OidIsValid(subid))
{
@@ -405,7 +405,7 @@ RemoveSubscriptionRel(Oid subid, Oid relid)
}
heap_endscan(scan);
 
-   heap_close(rel, RowExclusiveLock);
+   heap_close(rel, lockmode);
 }
 
 
diff --git a/src/backend/commands/subscriptioncmds.c 
b/src/backend/commands/subscriptioncmds.c
index 5aae7b6f91..02cb7c47b8 100644
--- a/src/backend/commands/subscriptioncmds.c
+++ b/src/backend/commands/subscriptioncmds.c
@@ -595,7 +595,7 @@ AlterSubscription_refresh(Subscription *sub, bool copy_data)
{
char   *namespace;
 
-   RemoveSubscriptionRel(sub->oid, relid);
+   RemoveSubscriptionRel(sub->oid, relid, 
RowExclusiveLock);
 
logicalrep_worker_stop(sub->oid, relid);
 
@@ -910,7 +910,7 @@ DropSubscription(DropSubscriptionStmt *stmt, bool 
isTopLevel)
deleteSharedDependencyRecordsFor(SubscriptionRelationId, subid, 0);
 
/* Remove any associated relation synchronization states. */
-   RemoveSubscriptionRel(subid, InvalidOid);
+   RemoveSubscriptionRel(subid, InvalidOid, ExclusiveLock);
 
/* Kill the apply worker so that the slot becomes accessible. */
logicalrep_worker_stop(subid, InvalidOid);
diff --git a/src/include/catalog/pg_subscription_rel.h 
b/src/include/catalog/pg_subscription_rel.h
index f5f6191676..b2cadb4b13 100644
--- a/src/include/catalog/pg_subscription_rel.h
+++ b/src/include/catalog/pg_subscription_rel.h
@@ -74,7 +74,7 @@ extern Oid SetSubscriptionRelState(Oid subid, Oid relid, char 
state,
XLogRecPtr sublsn, bool 
update_only);
 extern char GetSubscriptionRelState(Oid subid, Oid relid,
XLogRecPtr *sublsn, bool 
missing_ok);
-extern void RemoveSubscriptionRel(Oid subid, Oid relid);
+extern void RemoveSubscriptionRel(Oid subid, Oid relid, LOCKMODE lockmode);
 
 extern List *GetSubscriptionRelations(Oid subid);
 extern List *GetSubscriptionNotReadyRelations(Oid subid);
-- 
2.13.1


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Get stuck when dropping a subscription during synchronizing table

2017-06-19 Thread Masahiko Sawada

On Tue, Jun 20, 2017 at 10:55 AM, Peter Eisentraut
 wrote:
> On 6/1/17 13:37, Petr Jelinek wrote:
>> On 01/06/17 17:32, Masahiko Sawada wrote:
>>> On Thu, May 25, 2017 at 5:29 PM, tushar  
>>> wrote:
 After applying all your patches, drop subscription no  more hangs while
 dropping  subscription but there is an error   "ERROR:  tuple concurrently
 updated" in the log file of
 standby.
>>>
>>> I tried to reproduce this issue with latest four patches I submit but
>>> it didn't happen. I guess this issue doesn't related to the issues
>>> reported on this thread and another factor  might be cause of this
>>> issue. It would be good to test the same again with latest four
>>> patches or after solved current some open items.
>>
>> That's because your additional patch kills the workers that do the
>> concurrent update. While that's probably okay, I still plan to look into
>> making the subscription and state locking more robust.
>
> It seems we have gotten off track here a bit.  What is the current
> proposal to fix "tuple concurrently updated" during DROP SUBSCRiPTION?

I think there is no proposal for it so far. The current proposal is to
fix this problem during ALTER SUBSCRIPTION.

> It seems to me we could just take a stronger lock around
> RemoveSubscriptionRel(), so that workers can't write in there concurrently.
>

Since we reduced the lock level of updating pg_subscription_rel by
commit 521fd4795e3e the same deadlock issue will appear if we just
take a stronger lock level.

Regards,

--
Masahiko Sawada
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Get stuck when dropping a subscription during synchronizing table

2017-06-19 Thread Masahiko Sawada

On Tue, Jun 20, 2017 at 10:47 AM, Peter Eisentraut
 wrote:
> On 6/16/17 04:16, Masahiko Sawada wrote:
>> A subscription relation state may have been removed already when we
>> try to update it. SetSubscriptionRelState didn't emit an error in such
>> case but with this patch we end up with an error. Since we shouldn't
>> ignore such error in UpdateSubscriptionRelState I'd say we can let the
>> user know about that possibility in the error message.
>
> So are you saying it's good to have the error message?
>

Yes. UpdateSubscriptionRelState failure means that the subscription
relation state has disappeared or also means something wrong. So I
think it's good to have it as perhaps errdetail. Thought?

Regards,

--
Masahiko Sawada
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Get stuck when dropping a subscription during synchronizing table

2017-06-19 Thread Peter Eisentraut

On 6/1/17 13:37, Petr Jelinek wrote:
> On 01/06/17 17:32, Masahiko Sawada wrote:
>> On Thu, May 25, 2017 at 5:29 PM, tushar  
>> wrote:
>>> After applying all your patches, drop subscription no  more hangs while
>>> dropping  subscription but there is an error   "ERROR:  tuple concurrently
>>> updated" in the log file of
>>> standby.
>>
>> I tried to reproduce this issue with latest four patches I submit but
>> it didn't happen. I guess this issue doesn't related to the issues
>> reported on this thread and another factor  might be cause of this
>> issue. It would be good to test the same again with latest four
>> patches or after solved current some open items.
> 
> That's because your additional patch kills the workers that do the
> concurrent update. While that's probably okay, I still plan to look into
> making the subscription and state locking more robust.

It seems we have gotten off track here a bit.  What is the current
proposal to fix "tuple concurrently updated" during DROP SUBSCRiPTION?

It seems to me we could just take a stronger lock around
RemoveSubscriptionRel(), so that workers can't write in there concurrently.

-- 
Peter Eisentraut  http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Get stuck when dropping a subscription during synchronizing table

2017-06-19 Thread Peter Eisentraut

On 6/16/17 04:16, Masahiko Sawada wrote:
> A subscription relation state may have been removed already when we
> try to update it. SetSubscriptionRelState didn't emit an error in such
> case but with this patch we end up with an error. Since we shouldn't
> ignore such error in UpdateSubscriptionRelState I'd say we can let the
> user know about that possibility in the error message.

So are you saying it's good to have the error message?

-- 
Peter Eisentraut  http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Get stuck when dropping a subscription during synchronizing table

2017-06-16 Thread Masahiko Sawada

On Thu, Jun 15, 2017 at 10:22 AM, Masahiko Sawada  wrote:
> On Thu, Jun 15, 2017 at 7:35 AM, Petr Jelinek
>  wrote:
>> On 13/06/17 21:49, Peter Eisentraut wrote:
>>> On 6/13/17 02:33, Noah Misch wrote:
> Steps to reproduce -
> X cluster -> create 100 tables , publish all tables  (create publication 
> pub
> for all tables);
> Y Cluster -> create 100 tables ,create subscription(create subscription 
> sub
> connection 'user=centos host=localhost' publication pub;
> Y cluster ->drop subscription - drop subscription sub;
>
> check the log file on Y cluster.
>
> Sometime , i have seen this error on psql prompt and drop subscription
> operation got failed at first attempt.
>
> postgres=# drop subscription sub;
> ERROR:  tuple concurrently updated
> postgres=# drop subscription sub;
> NOTICE:  dropped replication slot "sub" on publisher
> DROP SUBSCRIPTION

 [Action required within three days.  This is a generic notification.]
>>>
>>> It's being worked on.  Let's see by Thursday.
>>>
>>

I've reviewed these patches. 0001 patch conflicts with commit
a571c7f661a7b601aafcb12196d004cdb8b8cb23.

>> Attached fixes it (it was mostly about order of calls). I also split the
>> SetSubscriptionRelState into 2 separate interface while I was changing
>> it, because now that the update_only bool was added it has become quite
>> strange to have single interface for what is basically two separate
>> functions.

+1 from me, too.
A subscription relation state may have been removed already when we
try to update it. SetSubscriptionRelState didn't emit an error in such
case but with this patch we end up with an error. Since we shouldn't
ignore such error in UpdateSubscriptionRelState I'd say we can let the
user know about that possibility in the error message.

Regards,

--
Masahiko Sawada
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Get stuck when dropping a subscription during synchronizing table

2017-06-15 Thread Peter Eisentraut

On 6/13/17 15:49, Peter Eisentraut wrote:
> On 6/13/17 02:33, Noah Misch wrote:
>>> Steps to reproduce -
>>> X cluster -> create 100 tables , publish all tables  (create publication pub
>>> for all tables);
>>> Y Cluster -> create 100 tables ,create subscription(create subscription sub
>>> connection 'user=centos host=localhost' publication pub;
>>> Y cluster ->drop subscription - drop subscription sub;
>>>
>>> check the log file on Y cluster.
>>>
>>> Sometime , i have seen this error on psql prompt and drop subscription
>>> operation got failed at first attempt.
>>>
>>> postgres=# drop subscription sub;
>>> ERROR:  tuple concurrently updated
>>> postgres=# drop subscription sub;
>>> NOTICE:  dropped replication slot "sub" on publisher
>>> DROP SUBSCRIPTION
>>
>> [Action required within three days.  This is a generic notification.]
> 
> It's being worked on.  Let's see by Thursday.

A patch has been posted, and it's being reviewed.  Next update Monday.

-- 
Peter Eisentraut  http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Get stuck when dropping a subscription during synchronizing table

2017-06-15 Thread Peter Eisentraut

On 6/15/17 12:22, Petr Jelinek wrote:
> On 15/06/17 17:53, Peter Eisentraut wrote:
>> On 6/14/17 18:35, Petr Jelinek wrote:
>>> Attached fixes it (it was mostly about order of calls).
>> So do I understand this right that the actual fix is just moving up the
>> logicalrep_worker_stop() call in DropSubscription().
>>
> No the fix is heap_open before SearchSysCache().

The existing code already does that.

-- 
Peter Eisentraut  http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Get stuck when dropping a subscription during synchronizing table

2017-06-15 Thread Petr Jelinek

On 15/06/17 18:36, Peter Eisentraut wrote:
> On 6/15/17 12:22, Petr Jelinek wrote:
>> On 15/06/17 17:53, Peter Eisentraut wrote:
>>> On 6/14/17 18:35, Petr Jelinek wrote:
 Attached fixes it (it was mostly about order of calls).
>>>
>>> So do I understand this right that the actual fix is just moving up the
>>> logicalrep_worker_stop() call in DropSubscription().
>>>
>>
>> No the fix is heap_open before SearchSysCache().
> 
> Right.  Is there a reason for moving the _stop() call then?
> 

Nothing specific, just felt it's better there when I was messing with
the function.

-- 
  Petr Jelinek  http://www.2ndQuadrant.com/
  PostgreSQL Development, 24x7 Support, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Get stuck when dropping a subscription during synchronizing table

2017-06-15 Thread Peter Eisentraut

On 6/15/17 12:22, Petr Jelinek wrote:
> On 15/06/17 17:53, Peter Eisentraut wrote:
>> On 6/14/17 18:35, Petr Jelinek wrote:
>>> Attached fixes it (it was mostly about order of calls).
>>
>> So do I understand this right that the actual fix is just moving up the
>> logicalrep_worker_stop() call in DropSubscription().
>>
> 
> No the fix is heap_open before SearchSysCache().

Right.  Is there a reason for moving the _stop() call then?

-- 
Peter Eisentraut  http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Get stuck when dropping a subscription during synchronizing table

2017-06-15 Thread Petr Jelinek

On 15/06/17 17:53, Peter Eisentraut wrote:
> On 6/14/17 18:35, Petr Jelinek wrote:
>> Attached fixes it (it was mostly about order of calls).
> 
> So do I understand this right that the actual fix is just moving up the
> logicalrep_worker_stop() call in DropSubscription().
>

No the fix is heap_open before SearchSysCache().

>> I also split the
>> SetSubscriptionRelState into 2 separate interface while I was changing
>> it, because now that the update_only bool was added it has become quite
>> strange to have single interface for what is basically two separate
>> functions.
> 
> makes sense
> 
>> Other related problem is locking of subscriptions during operations on
>> them, especially AlterSubscription seems like it should lock the
>> subscription itself. I did that in 0002.
> 
> More detail here please.  AlterSubscription() does locking via
> heap_open().  This introduces a new locking method.  What are the
> implications?
> 

I don't think heap_open will be enough once we remove the
AccessExclusiveLock of the catalog in DropSubscription because
concurrent AlterSubscription might happily add tables to the
subscription that has been dropped if we don't lock it. But you made me
realize that even my patch is not enough because we then reread the
subscription info only from syscache without any kind of invalidation
attempt.

-- 
  Petr Jelinek  http://www.2ndQuadrant.com/
  PostgreSQL Development, 24x7 Support, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Get stuck when dropping a subscription during synchronizing table

2017-06-15 Thread Peter Eisentraut

On 6/14/17 18:35, Petr Jelinek wrote:
> Attached fixes it (it was mostly about order of calls).

So do I understand this right that the actual fix is just moving up the
logicalrep_worker_stop() call in DropSubscription().

> I also split the
> SetSubscriptionRelState into 2 separate interface while I was changing
> it, because now that the update_only bool was added it has become quite
> strange to have single interface for what is basically two separate
> functions.

makes sense

> Other related problem is locking of subscriptions during operations on
> them, especially AlterSubscription seems like it should lock the
> subscription itself. I did that in 0002.

More detail here please.  AlterSubscription() does locking via
heap_open().  This introduces a new locking method.  What are the
implications?

-- 
Peter Eisentraut  http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Get stuck when dropping a subscription during synchronizing table

2017-06-14 Thread Masahiko Sawada

On Thu, Jun 15, 2017 at 7:35 AM, Petr Jelinek
 wrote:
> On 13/06/17 21:49, Peter Eisentraut wrote:
>> On 6/13/17 02:33, Noah Misch wrote:
 Steps to reproduce -
 X cluster -> create 100 tables , publish all tables  (create publication 
 pub
 for all tables);
 Y Cluster -> create 100 tables ,create subscription(create subscription sub
 connection 'user=centos host=localhost' publication pub;
 Y cluster ->drop subscription - drop subscription sub;

 check the log file on Y cluster.

 Sometime , i have seen this error on psql prompt and drop subscription
 operation got failed at first attempt.

 postgres=# drop subscription sub;
 ERROR:  tuple concurrently updated
 postgres=# drop subscription sub;
 NOTICE:  dropped replication slot "sub" on publisher
 DROP SUBSCRIPTION
>>>
>>> [Action required within three days.  This is a generic notification.]
>>
>> It's being worked on.  Let's see by Thursday.
>>
>
> Attached fixes it (it was mostly about order of calls). I also split the
> SetSubscriptionRelState into 2 separate interface while I was changing
> it, because now that the update_only bool was added it has become quite
> strange to have single interface for what is basically two separate
> functions.
>
> There are still couple of remaining issues from this thread though.
> Namely the AccessExclusiveLock of the pg_subscription catalog which is
> not very pretty, but we need a way to block launcher from accessing the
> subscription which is being dropped and make sure it will not start new
> workers for it afterwards. Question is how however as by the time
> launcher can lock individual subscription it is already processing it.
> So it looks to me like we'd need to reread the catalog with new snapshot
> after the lock was acquired which seems bit wasteful (I wonder if we
> could just AcceptInvalidationMessages and refetch from syscache). Any
> better ideas?
>
> Other related problem is locking of subscriptions during operations on
> them, especially AlterSubscription seems like it should lock the
> subscription itself. I did that in 0002.
>

Thank you for the patch! Sorry I don't have a time for it today but
I'll review these patches tomorrow.

Regards,

--
Masahiko Sawada
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Get stuck when dropping a subscription during synchronizing table

2017-06-14 Thread Petr Jelinek

On 13/06/17 21:49, Peter Eisentraut wrote:
> On 6/13/17 02:33, Noah Misch wrote:
>>> Steps to reproduce -
>>> X cluster -> create 100 tables , publish all tables  (create publication pub
>>> for all tables);
>>> Y Cluster -> create 100 tables ,create subscription(create subscription sub
>>> connection 'user=centos host=localhost' publication pub;
>>> Y cluster ->drop subscription - drop subscription sub;
>>>
>>> check the log file on Y cluster.
>>>
>>> Sometime , i have seen this error on psql prompt and drop subscription
>>> operation got failed at first attempt.
>>>
>>> postgres=# drop subscription sub;
>>> ERROR:  tuple concurrently updated
>>> postgres=# drop subscription sub;
>>> NOTICE:  dropped replication slot "sub" on publisher
>>> DROP SUBSCRIPTION
>>
>> [Action required within three days.  This is a generic notification.]
> 
> It's being worked on.  Let's see by Thursday.
> 

Attached fixes it (it was mostly about order of calls). I also split the
SetSubscriptionRelState into 2 separate interface while I was changing
it, because now that the update_only bool was added it has become quite
strange to have single interface for what is basically two separate
functions.

There are still couple of remaining issues from this thread though.
Namely the AccessExclusiveLock of the pg_subscription catalog which is
not very pretty, but we need a way to block launcher from accessing the
subscription which is being dropped and make sure it will not start new
workers for it afterwards. Question is how however as by the time
launcher can lock individual subscription it is already processing it.
So it looks to me like we'd need to reread the catalog with new snapshot
after the lock was acquired which seems bit wasteful (I wonder if we
could just AcceptInvalidationMessages and refetch from syscache). Any
better ideas?

Other related problem is locking of subscriptions during operations on
them, especially AlterSubscription seems like it should lock the
subscription itself. I did that in 0002.

-- 
  Petr Jelinek  http://www.2ndQuadrant.com/
  PostgreSQL Development, 24x7 Support, Training & Services
From 005eeb820f4e0528513744136582c4489e2429e3 Mon Sep 17 00:00:00 2001
From: Petr Jelinek 
Date: Wed, 14 Jun 2017 08:14:20 +0200
Subject: [PATCH 2/2] Lock subscription when altering it

---
 src/backend/commands/subscriptioncmds.c | 7 +++
 1 file changed, 7 insertions(+)

diff --git a/src/backend/commands/subscriptioncmds.c b/src/backend/commands/subscriptioncmds.c
index 14c8f3f..e0ec8ea 100644
--- a/src/backend/commands/subscriptioncmds.c
+++ b/src/backend/commands/subscriptioncmds.c
@@ -642,6 +642,13 @@ AlterSubscription(AlterSubscriptionStmt *stmt)
 	   stmt->subname);
 
 	subid = HeapTupleGetOid(tup);
+
+	/*
+	 * Lock the subscription so nobody else can do anything with it (including
+	 * the replication workers).
+	 */
+	LockSharedObject(SubscriptionRelationId, subid, 0, AccessExclusiveLock);
+
 	sub = GetSubscription(subid, false);
 
 	/* Form a new tuple. */
-- 
2.7.4

From 9011698ae800e0f45f960e91f6b16eab15634fac Mon Sep 17 00:00:00 2001
From: Petr Jelinek 
Date: Tue, 13 Jun 2017 19:26:51 +0200
Subject: [PATCH 1/2] Improve the pg_subscription_rel handling

Split the SetSubscriptionRelState into separate Add and Update
functions, removing the unsafe upsert logic as callers are supposed to
know whether they are updating or adding new row.

Reorder the code in the above mentioned functions to avoid "tuple
updated concurrently" warnings.
---
 src/backend/catalog/pg_subscription.c   | 131 +++-
 src/backend/commands/subscriptioncmds.c |  14 +--
 src/backend/replication/logical/tablesync.c |  33 ---
 src/include/catalog/pg_subscription_rel.h   |   6 +-
 4 files changed, 98 insertions(+), 86 deletions(-)

diff --git a/src/backend/catalog/pg_subscription.c b/src/backend/catalog/pg_subscription.c
index c5b2541..fd19675 100644
--- a/src/backend/catalog/pg_subscription.c
+++ b/src/backend/catalog/pg_subscription.c
@@ -225,24 +225,15 @@ textarray_to_stringlist(ArrayType *textarray)
 }
 
 /*
- * Set the state of a subscription table.
- *
- * If update_only is true and the record for given table doesn't exist, do
- * nothing.  This can be used to avoid inserting a new record that was deleted
- * by someone else.  Generally, subscription DDL commands should use false,
- * workers should use true.
- *
- * The insert-or-update logic in this function is not concurrency safe so it
- * might raise an error in rare circumstances.  But if we took a stronger lock
- * such as ShareRowExclusiveLock, we would risk more deadlocks.
+ * Add new state record for a subscription table.
  */
 Oid
-SetSubscriptionRelState(Oid subid, Oid relid, char state,
-		XLogRecPtr sublsn, bool update_only)
+AddSubscriptionRelState(Oid subid, Oid relid, char state,
+		XLogRecPtr sublsn)
 {
 	Relation	rel;
 	HeapTuple	tup;
-	Oid			subrelid =

Re: [HACKERS] Get stuck when dropping a subscription during synchronizing table

2017-06-13 Thread Peter Eisentraut

On 6/13/17 02:33, Noah Misch wrote:
>> Steps to reproduce -
>> X cluster -> create 100 tables , publish all tables  (create publication pub
>> for all tables);
>> Y Cluster -> create 100 tables ,create subscription(create subscription sub
>> connection 'user=centos host=localhost' publication pub;
>> Y cluster ->drop subscription - drop subscription sub;
>>
>> check the log file on Y cluster.
>>
>> Sometime , i have seen this error on psql prompt and drop subscription
>> operation got failed at first attempt.
>>
>> postgres=# drop subscription sub;
>> ERROR:  tuple concurrently updated
>> postgres=# drop subscription sub;
>> NOTICE:  dropped replication slot "sub" on publisher
>> DROP SUBSCRIPTION
> 
> [Action required within three days.  This is a generic notification.]

It's being worked on.  Let's see by Thursday.

-- 
Peter Eisentraut  http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Get stuck when dropping a subscription during synchronizing table

2017-06-13 Thread Noah Misch

On Thu, May 25, 2017 at 01:59:51PM +0530, tushar wrote:
> After applying all your patches, drop subscription no  more hangs while
> dropping  subscription but there is an error   "ERROR:  tuple concurrently
> updated" in the log file of
> standby.
> 
> ---
> logical replication synchronization worker finished processing
> 2017-05-25 09:15:52.654 BST [18575] LOG:  logical replication
> synchronization worker finished processing
> 2017-05-25 09:15:52.656 BST [18563] LOG:  starting logical replication
> worker for subscription "sub"
> 2017-05-25 09:15:52.662 BST [18577] LOG:  logical replication sync for
> subscription sub, table t14 started
> 2017-05-25 09:15:53.657 BST [18563] LOG:  starting logical replication
> worker for subscription "sub"
> 2017-05-25 09:15:53.663 BST [18579] LOG:  logical replication sync for
> subscription sub, table t15 started
> 2017-05-25 09:15:53.724 BST [18563] FATAL:  terminating logical replication
> worker due to administrator command
> 2017-05-25 09:15:53.725 BST [18521] LOG:  worker process: logical
> replication worker for subscription 16684 (PID 18563) exited with exit code
> 1
> 2017-05-25 09:15:54.734 BST [18579] ERROR:  tuple concurrently updated
> 2017-05-25 09:15:54.735 BST [18577] ERROR:  tuple concurrently updated
> 2017-05-25 09:15:54.736 BST [18521] LOG:  worker process: logical
> replication worker for subscription 16684 sync 16426 (PID 18579) exited with
> exit code 1
> 2017-05-25 09:15:54.736 BST [18521] LOG:  worker process: logical
> replication worker for subscription 16684 sync 16423 (PID 18577) exited with
> exit code 1
> ~
> ~
> ~
> 
> Steps to reproduce -
> X cluster -> create 100 tables , publish all tables  (create publication pub
> for all tables);
> Y Cluster -> create 100 tables ,create subscription(create subscription sub
> connection 'user=centos host=localhost' publication pub;
> Y cluster ->drop subscription - drop subscription sub;
> 
> check the log file on Y cluster.
> 
> Sometime , i have seen this error on psql prompt and drop subscription
> operation got failed at first attempt.
> 
> postgres=# drop subscription sub;
> ERROR:  tuple concurrently updated
> postgres=# drop subscription sub;
> NOTICE:  dropped replication slot "sub" on publisher
> DROP SUBSCRIPTION

[Action required within three days.  This is a generic notification.]

The above-described topic is currently a PostgreSQL 10 open item.  Peter,
since you committed the patch believed to have created it, you own this open
item.  If some other commit is more relevant or if this does not belong as a
v10 open item, please let us know.  Otherwise, please observe the policy on
open item ownership[1] and send a status update within three calendar days of
this message.  Include a date for your subsequent status update.  Testers may
discover new open items at any time, and I want to plan to get them all fixed
well in advance of shipping v10.  Consequently, I will appreciate your efforts
toward speedy resolution.  Thanks.

[1] 
https://www.postgresql.org/message-id/20170404140717.GA2675809%40tornado.leadboat.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Get stuck when dropping a subscription during synchronizing table

2017-06-08 Thread Andres Freund

On 2017-06-07 11:51:12 +0200, Petr Jelinek wrote:
> On 07/06/17 03:00, Andres Freund wrote:
> > On 2017-06-06 19:36:13 +0200, Petr Jelinek wrote:
> > 
> >> As a side note, we are starting to have several IsSomeTypeOfProcess
> >> functions for these kind of things. I wonder if bgworker infrastructure
> >> should somehow provide this type of stuff (the proposed bgw_type might
> >> help there) as well as maybe being able to register interrupt handler
> >> for the worker (it's really hard to do it via custom SIGTERM handler as
> >> visible on worker, launcher and walsender issues we are fixing).
> >> Obviously this is PG11 related thought.
> > 
> > Maybe it's also a sign that the bgworker infrastructure isn't really the
> > best thing to use for internal processes like parallel workers and lrep
> > processes - it's really built so core code *doesn't* know anything
> > specific about them, which isn't really what we want in some of these
> > cases.  That's not to say that bgworkers and parallelism/lrep shouldn't
> > share infrastructure, don't get me wrong.
> >
> 
> Well the nice thing about bgworkers is that it provides the basic
> infrastructure.

Right.


> Main reason lrep stuff is using it is that I didn't want to add
> another special backend kind to 20 different places, but turns out it
> still needs to be in some. So either we need to add more support for
> these kind of things to bgworkers or write something for internal
> workers that shares the infrastructure with bgworkers as you say.

Yea, I think we should radically unify a lot of the related
infrastructure between all processes.  We've grown a lot of duplication.


> >> @@ -2848,6 +2849,14 @@ ProcessInterrupts(void)
> >>ereport(FATAL,
> >>(errcode(ERRCODE_ADMIN_SHUTDOWN),
> >> errmsg("terminating logical 
> >> replication worker due to administrator command")));
> >> +  else if (IsLogicalLauncher())
> >> +  {
> >> +  ereport(DEBUG1,
> >> +  (errmsg("logical replication launcher 
> >> shutting down")));
> >> +
> >> +  /* The logical replication launcher can be stopped at 
> >> any time. */
> >> +  proc_exit(0);
> >> +  }
> > 
> > We could use PgBackendStatus->st_backendType for these, merging these
> > different paths.
> > 
> 
> Hmm, that's not that easy, st_backendType will be generic type for
> bgworker as the bgw_type patch didn't land yet (which is discussed in
> yet another thread [1]). It seems like an argument for going forward
> with it (the bgw_type patch) in PG10.

Yea.  I left it as is for now.


> I don't mind, it has some overlap with your proposed fixes for latching
> so if you are willing go ahead.

Done.

Greetings,

Andres Freund


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Get stuck when dropping a subscription during synchronizing table

2017-06-07 Thread Petr Jelinek

On 07/06/17 03:00, Andres Freund wrote:
> On 2017-06-06 19:36:13 +0200, Petr Jelinek wrote:
> 
>> As a side note, we are starting to have several IsSomeTypeOfProcess
>> functions for these kind of things. I wonder if bgworker infrastructure
>> should somehow provide this type of stuff (the proposed bgw_type might
>> help there) as well as maybe being able to register interrupt handler
>> for the worker (it's really hard to do it via custom SIGTERM handler as
>> visible on worker, launcher and walsender issues we are fixing).
>> Obviously this is PG11 related thought.
> 
> Maybe it's also a sign that the bgworker infrastructure isn't really the
> best thing to use for internal processes like parallel workers and lrep
> processes - it's really built so core code *doesn't* know anything
> specific about them, which isn't really what we want in some of these
> cases.  That's not to say that bgworkers and parallelism/lrep shouldn't
> share infrastructure, don't get me wrong.
>

Well the nice thing about bgworkers is that it provides the basic
infrastructure. Main reason lrep stuff is using it is that I didn't want
to add another special backend kind to 20 different places, but turns
out it still needs to be in some. So either we need to add more support
for these kind of things to bgworkers or write something for internal
workers that shares the infrastructure with bgworkers as you say.

> 
>>  
>> -LogicalRepCtx->launcher_pid = 0;
>> -
>> -/* ... and if it returns, we're done */
>> -ereport(DEBUG1,
>> -(errmsg("logical replication launcher shutting down")));
>> +/* Not reachable */
>> +}
> 
> Maybe put a pg_unreachable() there?
> 

Ah didn't know it existed.

> 
>> @@ -2848,6 +2849,14 @@ ProcessInterrupts(void)
>>  ereport(FATAL,
>>  (errcode(ERRCODE_ADMIN_SHUTDOWN),
>>   errmsg("terminating logical 
>> replication worker due to administrator command")));
>> +else if (IsLogicalLauncher())
>> +{
>> +ereport(DEBUG1,
>> +(errmsg("logical replication launcher 
>> shutting down")));
>> +
>> +/* The logical replication launcher can be stopped at 
>> any time. */
>> +proc_exit(0);
>> +}
> 
> We could use PgBackendStatus->st_backendType for these, merging these
> different paths.
> 

Hmm, that's not that easy, st_backendType will be generic type for
bgworker as the bgw_type patch didn't land yet (which is discussed in
yet another thread [1]). It seems like an argument for going forward
with it (the bgw_type patch) in PG10.

> I can take care of this one, if you/Peter want.
> 

I don't mind, it has some overlap with your proposed fixes for latching
so if you are willing go ahead.

[1]
https://www.postgresql.org/message-id/flat/0d795703-a885-2193-2331-f00d7a3a4...@2ndquadrant.com#0d795703-a885-2193-2331-f00d7a3a4...@2ndquadrant.com

-- 
  Petr Jelinek  http://www.2ndQuadrant.com/
  PostgreSQL Development, 24x7 Support, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Get stuck when dropping a subscription during synchronizing table

2017-06-06 Thread Andres Freund

On 2017-06-06 19:36:13 +0200, Petr Jelinek wrote:
> So the fact that we moved workers to standard interrupt handling broke
> launcher in subtle ways because it still uses it's own SIGTERM handling
> but some function it calls are using CHECK_FOR_INTERRUPTS (they are used
> by worker as well). I think we need to move launcher to standard
> interrupt handling as well.

Sounds like a good plan.


> As a side note, we are starting to have several IsSomeTypeOfProcess
> functions for these kind of things. I wonder if bgworker infrastructure
> should somehow provide this type of stuff (the proposed bgw_type might
> help there) as well as maybe being able to register interrupt handler
> for the worker (it's really hard to do it via custom SIGTERM handler as
> visible on worker, launcher and walsender issues we are fixing).
> Obviously this is PG11 related thought.

Maybe it's also a sign that the bgworker infrastructure isn't really the
best thing to use for internal processes like parallel workers and lrep
processes - it's really built so core code *doesn't* know anything
specific about them, which isn't really what we want in some of these
cases.  That's not to say that bgworkers and parallelism/lrep shouldn't
share infrastructure, don't get me wrong.


>  
> - LogicalRepCtx->launcher_pid = 0;
> -
> - /* ... and if it returns, we're done */
> - ereport(DEBUG1,
> - (errmsg("logical replication launcher shutting down")));
> + /* Not reachable */
> +}

Maybe put a pg_unreachable() there?


> @@ -2848,6 +2849,14 @@ ProcessInterrupts(void)
>   ereport(FATAL,
>   (errcode(ERRCODE_ADMIN_SHUTDOWN),
>errmsg("terminating logical 
> replication worker due to administrator command")));
> + else if (IsLogicalLauncher())
> + {
> + ereport(DEBUG1,
> + (errmsg("logical replication launcher 
> shutting down")));
> +
> + /* The logical replication launcher can be stopped at 
> any time. */
> + proc_exit(0);
> + }

We could use PgBackendStatus->st_backendType for these, merging these
different paths.

I can take care of this one, if you/Peter want.

Greetings,

Andres Freund


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Get stuck when dropping a subscription during synchronizing table

2017-06-06 Thread Petr Jelinek

On 03/06/17 18:53, Peter Eisentraut wrote:
> On 6/2/17 14:52, Peter Eisentraut wrote:
>> On 5/24/17 15:14, Petr Jelinek wrote:
>>> All the locking works just fine the way it is in master. The issue with
>>> deadlock with apply comes from the wrong handling of the SIGTERM in the
>>> apply (we didn't set InterruptPending). I changed the SIGTERM handler in
>>> patch 0001 just to die which is actually the correct behavior for apply
>>> workers. I also moved the connection cleanup code to the
>>> before_shmem_exit callback (similar to walreceiver) and now that part
>>> works correctly.
>>
>> I have committed this, in two separate parts.  This should fix the
>> originally reported issue.
>>
>> I will continue to work through your other patches.  I notice there is
>> still a bit of discussion about another patch, so please let me know if
>> there is anything else I should be looking for.
> 
> I have committed the remaining two patches.  I believe this fixes the
> originally reported issue.
> 

So the fact that we moved workers to standard interrupt handling broke
launcher in subtle ways because it still uses it's own SIGTERM handling
but some function it calls are using CHECK_FOR_INTERRUPTS (they are used
by worker as well). I think we need to move launcher to standard
interrupt handling as well. It's not same as other processes though as
it's allowed to be terminated any time (just like autovacuum launcher)
so we just proc_exit(0) instead of FATALing out.

This is related to the nightjar failures btw.

As a side note, we are starting to have several IsSomeTypeOfProcess
functions for these kind of things. I wonder if bgworker infrastructure
should somehow provide this type of stuff (the proposed bgw_type might
help there) as well as maybe being able to register interrupt handler
for the worker (it's really hard to do it via custom SIGTERM handler as
visible on worker, launcher and walsender issues we are fixing).
Obviously this is PG11 related thought.

-- 
  Petr Jelinek  http://www.2ndQuadrant.com/
  PostgreSQL Development, 24x7 Support, Training & Services
From 3fb7ed57d669beba3feba894a11c9dcff8e60414 Mon Sep 17 00:00:00 2001
From: Petr Jelinek 
Date: Tue, 6 Jun 2017 19:26:12 +0200
Subject: [PATCH] Use standard interrupt handling in logical replication
 launcher

---
 src/backend/replication/logical/launcher.c | 41 --
 src/backend/tcop/postgres.c|  9 +++
 src/include/replication/logicallauncher.h  |  2 ++
 3 files changed, 27 insertions(+), 25 deletions(-)

diff --git a/src/backend/replication/logical/launcher.c b/src/backend/replication/logical/launcher.c
index 5aaf24b..52169f1 100644
--- a/src/backend/replication/logical/launcher.c
+++ b/src/backend/replication/logical/launcher.c
@@ -81,7 +81,6 @@ static void logicalrep_worker_cleanup(LogicalRepWorker *worker);
 
 /* Flags set by signal handlers */
 static volatile sig_atomic_t got_SIGHUP = false;
-static volatile sig_atomic_t got_SIGTERM = false;
 
 static bool on_commit_launcher_wakeup = false;
 
@@ -623,20 +622,6 @@ logicalrep_worker_onexit(int code, Datum arg)
 	ApplyLauncherWakeup();
 }
 
-/* SIGTERM: set flag to exit at next convenient time */
-static void
-logicalrep_launcher_sigterm(SIGNAL_ARGS)
-{
-	int			save_errno = errno;
-
-	got_SIGTERM = true;
-
-	/* Waken anything waiting on the process latch */
-	SetLatch(MyLatch);
-
-	errno = save_errno;
-}
-
 /* SIGHUP: set flag to reload configuration at next convenient time */
 static void
 logicalrep_launcher_sighup(SIGNAL_ARGS)
@@ -798,13 +783,14 @@ ApplyLauncherMain(Datum main_arg)
 
 	before_shmem_exit(logicalrep_launcher_onexit, (Datum) 0);
 
+	Assert(LogicalRepCtx->launcher_pid == 0);
+	LogicalRepCtx->launcher_pid = MyProcPid;
+
 	/* Establish signal handlers. */
 	pqsignal(SIGHUP, logicalrep_launcher_sighup);
-	pqsignal(SIGTERM, logicalrep_launcher_sigterm);
+	pqsignal(SIGTERM, die);
 	BackgroundWorkerUnblockSignals();
 
-	LogicalRepCtx->launcher_pid = MyProcPid;
-
 	/*
 	 * Establish connection to nailed catalogs (we only ever access
 	 * pg_subscription).
@@ -812,7 +798,7 @@ ApplyLauncherMain(Datum main_arg)
 	BackgroundWorkerInitializeConnection(NULL, NULL);
 
 	/* Enter main loop */
-	while (!got_SIGTERM)
+	for (;;)
 	{
 		int			rc;
 		List	   *sublist;
@@ -822,6 +808,8 @@ ApplyLauncherMain(Datum main_arg)
 		TimestampTz now;
 		long		wait_time = DEFAULT_NAPTIME_PER_CYCLE;
 
+		CHECK_FOR_INTERRUPTS();
+
 		now = GetCurrentTimestamp();
 
 		/* Limit the start retry to once a wal_retrieve_retry_interval */
@@ -894,13 +882,16 @@ ApplyLauncherMain(Datum main_arg)
 		ResetLatch(>procLatch);
 	}
 
-	LogicalRepCtx->launcher_pid = 0;
-
-	/* ... and if it returns, we're done */
-	ereport(DEBUG1,
-			(errmsg("logical replication launcher shutting down")));
+	/* Not reachable */
+}
 
-	proc_exit(0);
+/*
+ * Is current process the logical replication launcher?
+ */
+bool
+IsLogicalLauncher(void)
+{
+	return

Re: [HACKERS] Get stuck when dropping a subscription during synchronizing table

2017-06-06 Thread Masahiko Sawada

On Sun, Jun 4, 2017 at 1:53 AM, Peter Eisentraut
 wrote:
> On 6/2/17 14:52, Peter Eisentraut wrote:
>> On 5/24/17 15:14, Petr Jelinek wrote:
>>> All the locking works just fine the way it is in master. The issue with
>>> deadlock with apply comes from the wrong handling of the SIGTERM in the
>>> apply (we didn't set InterruptPending). I changed the SIGTERM handler in
>>> patch 0001 just to die which is actually the correct behavior for apply
>>> workers. I also moved the connection cleanup code to the
>>> before_shmem_exit callback (similar to walreceiver) and now that part
>>> works correctly.
>>
>> I have committed this, in two separate parts.  This should fix the
>> originally reported issue.
>>
>> I will continue to work through your other patches.  I notice there is
>> still a bit of discussion about another patch, so please let me know if
>> there is anything else I should be looking for.
>
> I have committed the remaining two patches.  I believe this fixes the
> originally reported issue.
>

IIUC the issue that sync worker could be orphaned and keep running
inside the long COPY is not fixed yet by commit
3c9bc2157a4f465b3c070d1250597568d2dc285f, and should be fixed. Am I
missing something?

Regards,

--
Masahiko Sawada
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Get stuck when dropping a subscription during synchronizing table

2017-06-03 Thread Peter Eisentraut

On 6/2/17 14:52, Peter Eisentraut wrote:
> On 5/24/17 15:14, Petr Jelinek wrote:
>> All the locking works just fine the way it is in master. The issue with
>> deadlock with apply comes from the wrong handling of the SIGTERM in the
>> apply (we didn't set InterruptPending). I changed the SIGTERM handler in
>> patch 0001 just to die which is actually the correct behavior for apply
>> workers. I also moved the connection cleanup code to the
>> before_shmem_exit callback (similar to walreceiver) and now that part
>> works correctly.
> 
> I have committed this, in two separate parts.  This should fix the
> originally reported issue.
> 
> I will continue to work through your other patches.  I notice there is
> still a bit of discussion about another patch, so please let me know if
> there is anything else I should be looking for.

I have committed the remaining two patches.  I believe this fixes the
originally reported issue.

-- 
Peter Eisentraut  http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Get stuck when dropping a subscription during synchronizing table

2017-06-02 Thread Peter Eisentraut

On 5/24/17 15:14, Petr Jelinek wrote:
> All the locking works just fine the way it is in master. The issue with
> deadlock with apply comes from the wrong handling of the SIGTERM in the
> apply (we didn't set InterruptPending). I changed the SIGTERM handler in
> patch 0001 just to die which is actually the correct behavior for apply
> workers. I also moved the connection cleanup code to the
> before_shmem_exit callback (similar to walreceiver) and now that part
> works correctly.

I have committed this, in two separate parts.  This should fix the
originally reported issue.

I will continue to work through your other patches.  I notice there is
still a bit of discussion about another patch, so please let me know if
there is anything else I should be looking for.

-- 
Peter Eisentraut  http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Get stuck when dropping a subscription during synchronizing table

2017-06-01 Thread Noah Misch

On Thu, Jun 01, 2017 at 12:00:33AM -0400, Peter Eisentraut wrote:
> On 5/31/17 02:51, Noah Misch wrote:
> > On Tue, May 30, 2017 at 01:30:35AM +, Noah Misch wrote:
> >> On Thu, May 18, 2017 at 10:27:51PM -0400, Peter Eisentraut wrote:
> >>> On 5/18/17 11:11, Noah Misch wrote:
>  IMMEDIATE ATTENTION REQUIRED.  This PostgreSQL 10 open item is past due 
>  for
>  your status update.  Please reacquaint yourself with the policy on open 
>  item
>  ownership[1] and then reply immediately.  If I do not hear from you by
>  2017-05-19 16:00 UTC, I will transfer this item to release management 
>  team
>  ownership without further notice.
> >>>
> >>> There is no progress on this issue at the moment.  I will report again
> >>> next Wednesday.
> >>
> >> IMMEDIATE ATTENTION REQUIRED.  This PostgreSQL 10 open item is long past 
> >> due
> >> for your status update.  If I do not hear one from you by 2017-05-31 02:00
> >> UTC, I will transfer this item to release management team ownership without
> >> further notice.
> > 
> > This PostgreSQL 10 open item now needs a permanent owner.  Would any other
> > committer like to take ownership?  If this role interests you, please read
> > this thread and the policy linked above, then send an initial status update
> > bearing a date for your subsequent status update.  If the item does not 
> > have a
> > permanent owner by 2017-06-03 07:00 UTC, I will investigate feature removals
> > that would resolve the item.
> 
> It seems I lost track of this item between all the other ones.  I will
> continue to work on this item.  We have patches proposed and I will work
> on committing them until Friday.

If any other committer cares about logical replication features in v10, I'd
recommend he take ownership of this open item despite your plan to work on it.
Otherwise, if you miss fixing this on Friday, it will be too late for others
to volunteer.

> I think I now have updates posted on all my items.

As of your writing that, two of your open items had no conforming status
update, and they still don't:

- Background worker display in pg_stat_activity (logical replication especially)
- ALTER SUBSCRIPTION REFRESH and table sync worker


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Get stuck when dropping a subscription during synchronizing table

2017-06-01 Thread Petr Jelinek

On 01/06/17 17:32, Masahiko Sawada wrote:
> On Thu, May 25, 2017 at 5:29 PM, tushar  wrote:
>> On 05/25/2017 12:44 AM, Petr Jelinek wrote:
>>>
>>> There is still outstanding issue that sync worker will keep running
>>> inside the long COPY because the invalidation messages are also not
>>> processed until it finishes but all the original issues reported here
>>> disappear for me with the attached patches applied.
>>
>> After applying all your patches, drop subscription no  more hangs while
>> dropping  subscription but there is an error   "ERROR:  tuple concurrently
>> updated" in the log file of
>> standby.
>>
> 
> I tried to reproduce this issue with latest four patches I submit but
> it didn't happen. I guess this issue doesn't related to the issues
> reported on this thread and another factor  might be cause of this
> issue. It would be good to test the same again with latest four
> patches or after solved current some open items.
> 

That's because your additional patch kills the workers that do the
concurrent update. While that's probably okay, I still plan to look into
making the subscription and state locking more robust.

-- 
  Petr Jelinek  http://www.2ndQuadrant.com/
  PostgreSQL Development, 24x7 Support, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Get stuck when dropping a subscription during synchronizing table

2017-06-01 Thread Masahiko Sawada

On Thu, May 25, 2017 at 5:29 PM, tushar  wrote:
> On 05/25/2017 12:44 AM, Petr Jelinek wrote:
>>
>> There is still outstanding issue that sync worker will keep running
>> inside the long COPY because the invalidation messages are also not
>> processed until it finishes but all the original issues reported here
>> disappear for me with the attached patches applied.
>
> After applying all your patches, drop subscription no  more hangs while
> dropping  subscription but there is an error   "ERROR:  tuple concurrently
> updated" in the log file of
> standby.
>

I tried to reproduce this issue with latest four patches I submit but
it didn't happen. I guess this issue doesn't related to the issues
reported on this thread and another factor  might be cause of this
issue. It would be good to test the same again with latest four
patches or after solved current some open items.

Regards,

--
Masahiko Sawada
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Get stuck when dropping a subscription during synchronizing table

2017-05-31 Thread Peter Eisentraut

On 5/31/17 02:51, Noah Misch wrote:
> On Tue, May 30, 2017 at 01:30:35AM +, Noah Misch wrote:
>> On Thu, May 18, 2017 at 10:27:51PM -0400, Peter Eisentraut wrote:
>>> On 5/18/17 11:11, Noah Misch wrote:
 IMMEDIATE ATTENTION REQUIRED.  This PostgreSQL 10 open item is past due for
 your status update.  Please reacquaint yourself with the policy on open 
 item
 ownership[1] and then reply immediately.  If I do not hear from you by
 2017-05-19 16:00 UTC, I will transfer this item to release management team
 ownership without further notice.
>>>
>>> There is no progress on this issue at the moment.  I will report again
>>> next Wednesday.
>>
>> IMMEDIATE ATTENTION REQUIRED.  This PostgreSQL 10 open item is long past due
>> for your status update.  If I do not hear one from you by 2017-05-31 02:00
>> UTC, I will transfer this item to release management team ownership without
>> further notice.
> 
> This PostgreSQL 10 open item now needs a permanent owner.  Would any other
> committer like to take ownership?  If this role interests you, please read
> this thread and the policy linked above, then send an initial status update
> bearing a date for your subsequent status update.  If the item does not have a
> permanent owner by 2017-06-03 07:00 UTC, I will investigate feature removals
> that would resolve the item.

It seems I lost track of this item between all the other ones.  I will
continue to work on this item.  We have patches proposed and I will work
on committing them until Friday.

I think I now have updates posted on all my items.

-- 
Peter Eisentraut  http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Get stuck when dropping a subscription during synchronizing table

2017-05-31 Thread Masahiko Sawada

On Thu, May 25, 2017 at 4:14 AM, Petr Jelinek
 wrote:
> Hi,
>
> I finally had time to properly analyze this, and turns out we've been
> all just trying to fix symptoms and the actual problems.
>
> All the locking works just fine the way it is in master. The issue with
> deadlock with apply comes from the wrong handling of the SIGTERM in the
> apply (we didn't set InterruptPending). I changed the SIGTERM handler in
> patch 0001 just to die which is actually the correct behavior for apply
> workers. I also moved the connection cleanup code to the
> before_shmem_exit callback (similar to walreceiver) and now that part
> works correctly.
>
> The issue with orphaned sync workers is actually two separate issues.
> First, due to thinko we always searched for sync worker in
> wait_for_sync_status_change instead of searching for opposite worker as
> was the intention (i.e. sync worker should search for apply and apply
> should search for sync). Thats fixed by 0002. And second, we didn't
> accept any invalidation messages until the whole sync process finished
> (because it flattens all the remote transactions in the single one) so
> sync worker didn't learn about subscription changes/drop until it has
> finished, which I now fixed in 0003.
>
> There is still outstanding issue that sync worker will keep running
> inside the long COPY because the invalidation messages are also not
> processed until it finishes but all the original issues reported here
> disappear for me with the attached patches applied.
>

These patches conflict with current HEAD, I attached updated version patches.

Also, the issue that sync worker will keep running inside the long
COPY can lead the another problem that the user could not create new
subscription with some workers due to not enough free logical
replication worker slots until the long COPY finishes. Attached 0004
patch is the updated version patch I submitted before.

Regards,

--
Masahiko Sawada
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center


0001-Fix-signal-handling-in-logical-workers.patch
Description: Binary data


0002-Make-tablesync-worker-exit-when-apply-dies-while-it-.patch
Description: Binary data


0003-Receive-invalidation-messages-correctly-in-tablesync.patch
Description: Binary data


0004-Wait-for-table-sync-worker-to-finish-when-apply-work.patch
Description: Binary data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Get stuck when dropping a subscription during synchronizing table

2017-05-31 Thread Noah Misch

On Tue, May 30, 2017 at 01:30:35AM +, Noah Misch wrote:
> On Thu, May 18, 2017 at 10:27:51PM -0400, Peter Eisentraut wrote:
> > On 5/18/17 11:11, Noah Misch wrote:
> > > IMMEDIATE ATTENTION REQUIRED.  This PostgreSQL 10 open item is past due 
> > > for
> > > your status update.  Please reacquaint yourself with the policy on open 
> > > item
> > > ownership[1] and then reply immediately.  If I do not hear from you by
> > > 2017-05-19 16:00 UTC, I will transfer this item to release management team
> > > ownership without further notice.
> > 
> > There is no progress on this issue at the moment.  I will report again
> > next Wednesday.
> 
> IMMEDIATE ATTENTION REQUIRED.  This PostgreSQL 10 open item is long past due
> for your status update.  If I do not hear one from you by 2017-05-31 02:00
> UTC, I will transfer this item to release management team ownership without
> further notice.

This PostgreSQL 10 open item now needs a permanent owner.  Would any other
committer like to take ownership?  If this role interests you, please read
this thread and the policy linked above, then send an initial status update
bearing a date for your subsequent status update.  If the item does not have a
permanent owner by 2017-06-03 07:00 UTC, I will investigate feature removals
that would resolve the item.


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Get stuck when dropping a subscription during synchronizing table

2017-05-29 Thread Noah Misch

On Thu, May 18, 2017 at 10:27:51PM -0400, Peter Eisentraut wrote:
> On 5/18/17 11:11, Noah Misch wrote:
> > IMMEDIATE ATTENTION REQUIRED.  This PostgreSQL 10 open item is past due for
> > your status update.  Please reacquaint yourself with the policy on open item
> > ownership[1] and then reply immediately.  If I do not hear from you by
> > 2017-05-19 16:00 UTC, I will transfer this item to release management team
> > ownership without further notice.
> 
> There is no progress on this issue at the moment.  I will report again
> next Wednesday.

IMMEDIATE ATTENTION REQUIRED.  This PostgreSQL 10 open item is long past due
for your status update.  If I do not hear one from you by 2017-05-31 02:00
UTC, I will transfer this item to release management team ownership without
further notice.


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Get stuck when dropping a subscription during synchronizing table

2017-05-25 Thread tushar


On 05/25/2017 12:44 AM, Petr Jelinek wrote:

There is still outstanding issue that sync worker will keep running
inside the long COPY because the invalidation messages are also not
processed until it finishes but all the original issues reported here
disappear for me with the attached patches applied.
After applying all your patches, drop subscription no  more hangs while 
dropping  subscription but there is an error   "ERROR:  tuple 
concurrently updated" in the log file of

standby.

---
logical replication synchronization worker finished processing
2017-05-25 09:15:52.654 BST [18575] LOG:  logical replication 
synchronization worker finished processing
2017-05-25 09:15:52.656 BST [18563] LOG:  starting logical replication 
worker for subscription "sub"
2017-05-25 09:15:52.662 BST [18577] LOG:  logical replication sync for 
subscription sub, table t14 started
2017-05-25 09:15:53.657 BST [18563] LOG:  starting logical replication 
worker for subscription "sub"
2017-05-25 09:15:53.663 BST [18579] LOG:  logical replication sync for 
subscription sub, table t15 started
2017-05-25 09:15:53.724 BST [18563] FATAL:  terminating logical 
replication worker due to administrator command
2017-05-25 09:15:53.725 BST [18521] LOG:  worker process: logical 
replication worker for subscription 16684 (PID 18563) exited with exit 
code 1

2017-05-25 09:15:54.734 BST [18579] ERROR:  tuple concurrently updated
2017-05-25 09:15:54.735 BST [18577] ERROR:  tuple concurrently updated
2017-05-25 09:15:54.736 BST [18521] LOG:  worker process: logical 
replication worker for subscription 16684 sync 16426 (PID 18579) exited 
with exit code 1
2017-05-25 09:15:54.736 BST [18521] LOG:  worker process: logical 
replication worker for subscription 16684 sync 16423 (PID 18577) exited 
with exit code 1

~
~
~

Steps to reproduce -
X cluster -> create 100 tables , publish all tables  (create publication 
pub for all tables);
Y Cluster -> create 100 tables ,create subscription(create subscription 
sub connection 'user=centos host=localhost' publication pub;

Y cluster ->drop subscription - drop subscription sub;

check the log file on Y cluster.

Sometime , i have seen this error on psql prompt and drop subscription 
operation got failed at first attempt.


postgres=# drop subscription sub;
ERROR:  tuple concurrently updated
postgres=# drop subscription sub;
NOTICE:  dropped replication slot "sub" on publisher
DROP SUBSCRIPTION

--
regards,tushar
EnterpriseDB  https://www.enterprisedb.com/
The Enterprise PostgreSQL Company



--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Get stuck when dropping a subscription during synchronizing table

2017-05-24 Thread Masahiko Sawada

On Wed, May 24, 2017 at 3:14 PM, Petr Jelinek
 wrote:
> Hi,
>
> I finally had time to properly analyze this, and turns out we've been
> all just trying to fix symptoms and the actual problems.

Thank you for the patches!

> All the locking works just fine the way it is in master.
> The issue with
> deadlock with apply comes from the wrong handling of the SIGTERM in the
> apply (we didn't set InterruptPending). I changed the SIGTERM handler in
> patch 0001 just to die which is actually the correct behavior for apply
> workers. I also moved the connection cleanup code to the
> before_shmem_exit callback (similar to walreceiver) and now that part
> works correctly.
>
> The issue with orphaned sync workers is actually two separate issues.
> First, due to thinko we always searched for sync worker in
> wait_for_sync_status_change instead of searching for opposite worker as
> was the intention (i.e. sync worker should search for apply and apply
> should search for sync). Thats fixed by 0002. And second, we didn't
> accept any invalidation messages until the whole sync process finished
> (because it flattens all the remote transactions in the single one) so
> sync worker didn't learn about subscription changes/drop until it has
> finished, which I now fixed in 0003.
>
> There is still outstanding issue that sync worker will keep running
> inside the long COPY because the invalidation messages are also not
> processed until it finishes but all the original issues reported here
> disappear for me with the attached patches applied.
>

The issue reported on this thread seems to be solved with your patch.
But because DROP SUBSCRIPTION is only one DDL command that acquires
lock on system catalog with AccessExclusiveLock and the logical
replication mechanism is complex,  I'm concerned that there might be
another potential deadlock issue due to acquire lock on system catalog
with strong lock level. It might be good idea to do either
fine-grained locking or reducing lock level or both.

Regards,

--
Masahiko Sawada
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Get stuck when dropping a subscription during synchronizing table

2017-05-24 Thread Petr Jelinek

Hi,

I finally had time to properly analyze this, and turns out we've been
all just trying to fix symptoms and the actual problems.

All the locking works just fine the way it is in master. The issue with
deadlock with apply comes from the wrong handling of the SIGTERM in the
apply (we didn't set InterruptPending). I changed the SIGTERM handler in
patch 0001 just to die which is actually the correct behavior for apply
workers. I also moved the connection cleanup code to the
before_shmem_exit callback (similar to walreceiver) and now that part
works correctly.

The issue with orphaned sync workers is actually two separate issues.
First, due to thinko we always searched for sync worker in
wait_for_sync_status_change instead of searching for opposite worker as
was the intention (i.e. sync worker should search for apply and apply
should search for sync). Thats fixed by 0002. And second, we didn't
accept any invalidation messages until the whole sync process finished
(because it flattens all the remote transactions in the single one) so
sync worker didn't learn about subscription changes/drop until it has
finished, which I now fixed in 0003.

There is still outstanding issue that sync worker will keep running
inside the long COPY because the invalidation messages are also not
processed until it finishes but all the original issues reported here
disappear for me with the attached patches applied.

-- 
  Petr Jelinek  http://www.2ndQuadrant.com/
  PostgreSQL Development, 24x7 Support, Training & Services


0003-Receive-invalidation-messages-correctly-in-tablesync.patch
Description: binary/octet-stream


0002-Make-tablesync-worker-exit-when-apply-dies-while-it-.patch
Description: binary/octet-stream


0001-Fix-signal-handling-in-logical-workers.patch
Description: binary/octet-stream

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Get stuck when dropping a subscription during synchronizing table

2017-05-19 Thread Petr Jelinek

On 18/05/17 16:16, Robert Haas wrote:
> On Wed, May 17, 2017 at 6:58 AM, Masahiko Sawada  
> wrote:
>> I think the above changes can solve this issue but It seems to me that
>> holding AccessExclusiveLock on pg_subscription by DROP SUBSCRIPTION
>> until commit could lead another deadlock problem in the future. So I'd
>> to contrive ways to reduce lock level somehow if possible. For
>> example, if we change the apply launcher so that it gets the
>> subscription list only when pg_subscription gets invalid, apply
>> launcher cannot try to launch the apply worker being stopped. We
>> invalidate pg_subscription at commit of DROP SUBSCRIPTION and the
>> apply launcher can get new subscription list which doesn't include the
>> entry we removed. That way we can reduce lock level to
>> ShareUpdateExclusiveLock and solve this issue.
>> Also in your patch, we need to change DROP SUBSCRIPTION as well to
>> resolve another case I encountered, where DROP SUBSCRIPTION waits for
>> apply worker while holding a tuple lock on pg_subscription_rel and the
>> apply worker waits for same tuple on pg_subscription_rel in
>> SetSubscriptionRelState().
> 
> I don't really understand the issue being discussed here in any
> detail, but as a general point I'd say that it might be more
> productive to make the locks finer-grained rather than struggling to
> reduce the lock level.  For example, instead of locking all of
> pg_subscription, use LockSharedObject() to lock the individual
> subscription, still with AccessExclusiveLock.  That means that other
> accesses to that subscription also need to take a lock so that you
> actually get a conflict when there should be one, but that should be
> doable.  I expect that trying to manage locking conflicts using only
> catalog-wide locks is a doomed strategy.

We do LockSharedObject() but it's rather useless the way it's done now
as no other access locks it. We can't block all other accesses however,
the workers need to be able to access the catalog during clean shutdown
in some situations. What we need is to block starting of new workers for
that subscription so only those code paths would need to block. So I
think we might want to do both finer-grained locking and decreasing lock
level.

-- 
  Petr Jelinek  http://www.2ndQuadrant.com/
  PostgreSQL Development, 24x7 Support, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Get stuck when dropping a subscription during synchronizing table

2017-05-18 Thread Kyotaro HORIGUCHI

Hello,

At Thu, 18 May 2017 10:16:35 -0400, Robert Haas  wrote 
in 
> On Wed, May 17, 2017 at 6:58 AM, Masahiko Sawada  
> wrote:
> > I think the above changes can solve this issue but It seems to me that
> > holding AccessExclusiveLock on pg_subscription by DROP SUBSCRIPTION
> > until commit could lead another deadlock problem in the future. So I'd
> > to contrive ways to reduce lock level somehow if possible. For
> > example, if we change the apply launcher so that it gets the
> > subscription list only when pg_subscription gets invalid, apply
> > launcher cannot try to launch the apply worker being stopped. We
> > invalidate pg_subscription at commit of DROP SUBSCRIPTION and the
> > apply launcher can get new subscription list which doesn't include the
> > entry we removed. That way we can reduce lock level to
> > ShareUpdateExclusiveLock and solve this issue.
> > Also in your patch, we need to change DROP SUBSCRIPTION as well to
> > resolve another case I encountered, where DROP SUBSCRIPTION waits for
> > apply worker while holding a tuple lock on pg_subscription_rel and the
> > apply worker waits for same tuple on pg_subscription_rel in
> > SetSubscriptionRelState().

Sorry, I don't have enough time to consider this
profoundly. Perhaps will return later.

> I don't really understand the issue being discussed here in any
> detail, but as a general point I'd say that it might be more
> productive to make the locks finer-grained rather than struggling to
> reduce the lock level.  For example, instead of locking all of
> pg_subscription, use LockSharedObject() to lock the individual
> subscription, still with AccessExclusiveLock.  That means that other
> accesses to that subscription also need to take a lock so that you
> actually get a conflict when there should be one, but that should be
> doable.  I expect that trying to manage locking conflicts using only
> catalog-wide locks is a doomed strategy.

Thank you for the suggestion. I think it is a bit differnt from
that. The problem here is that a replication worker may try
reading exactly the tuple for the subscription being deleted just
before responding to a received termination request. So the
finer-graind lock doesn't help.

The focus of resolving this is preventing blocking of workers
caused by DROP SUBSCRIPTION. So Sadasan's patch immediately
released the lock on pg_subscrption and uses another lock for
exclusion. My patch just give up to read the catalog when not
available.

regards,

-- 
Kyotaro Horiguchi
NTT Open Source Software Center



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Get stuck when dropping a subscription during synchronizing table

2017-05-18 Thread Peter Eisentraut

On 5/18/17 11:11, Noah Misch wrote:
> IMMEDIATE ATTENTION REQUIRED.  This PostgreSQL 10 open item is past due for
> your status update.  Please reacquaint yourself with the policy on open item
> ownership[1] and then reply immediately.  If I do not hear from you by
> 2017-05-19 16:00 UTC, I will transfer this item to release management team
> ownership without further notice.

There is no progress on this issue at the moment.  I will report again
next Wednesday.

-- 
Peter Eisentraut  http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Get stuck when dropping a subscription during synchronizing table

2017-05-18 Thread Noah Misch

On Mon, May 15, 2017 at 03:28:14AM +, Noah Misch wrote:
> On Mon, May 08, 2017 at 06:27:30PM +0900, Masahiko Sawada wrote:
> > I encountered a situation where DROP SUBSCRIPTION got stuck when
> > initial table sync is in progress. In my environment, I created
> > several tables with some data on publisher. I created subscription on
> > subscriber and drop subscription immediately after that. It doesn't
> > always happen but I often encountered it on my environment.
> > 
> > ps -x command shows the following.
> > 
> >  96796 ?Ss 0:00 postgres: masahiko postgres [local] DROP
> > SUBSCRIPTION
> >  96801 ?Ts 0:00 postgres: bgworker: logical replication
> > worker for subscription 40993waiting
> >  96805 ?Ss 0:07 postgres: bgworker: logical replication
> > worker for subscription 40993 sync 16418
> >  96806 ?Ss 0:01 postgres: wal sender process masahiko [local] 
> > idle
> >  96807 ?Ss 0:00 postgres: bgworker: logical replication
> > worker for subscription 40993 sync 16421
> >  96808 ?Ss 0:00 postgres: wal sender process masahiko [local] 
> > idle
> > 
> > The DROP SUBSCRIPTION process (pid 96796) is waiting for the apply
> > worker process (pid 96801) to stop while holding a lock on
> > pg_subscription_rel. On the other hand the apply worker is waiting for
> > acquiring a tuple lock on pg_subscription_rel needed for heap_update.
> > Also table sync workers (pid 96805 and 96807) are waiting for the
> > apply worker process to change their status.
> > 
> > Also, even when DROP SUBSCRIPTION is done successfully, the table sync
> > worker can be orphaned because I guess that the apply worker can exit
> > before change status of table sync worker.
> > 
> > I'm using 1f30295.
> 
> [Action required within three days.  This is a generic notification.]
> 
> The above-described topic is currently a PostgreSQL 10 open item.  Peter,
> since you committed the patch believed to have created it, you own this open
> item.  If some other commit is more relevant or if this does not belong as a
> v10 open item, please let us know.  Otherwise, please observe the policy on
> open item ownership[1] and send a status update within three calendar days of
> this message.  Include a date for your subsequent status update.  Testers may
> discover new open items at any time, and I want to plan to get them all fixed
> well in advance of shipping v10.  Consequently, I will appreciate your efforts
> toward speedy resolution.  Thanks.
> 
> [1] 
> https://www.postgresql.org/message-id/20170404140717.GA2675809%40tornado.leadboat.com

IMMEDIATE ATTENTION REQUIRED.  This PostgreSQL 10 open item is past due for
your status update.  Please reacquaint yourself with the policy on open item
ownership[1] and then reply immediately.  If I do not hear from you by
2017-05-19 16:00 UTC, I will transfer this item to release management team
ownership without further notice.

[1] 
https://www.postgresql.org/message-id/20170404140717.GA2675809%40tornado.leadboat.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Get stuck when dropping a subscription during synchronizing table

2017-05-18 Thread Robert Haas

On Wed, May 17, 2017 at 6:58 AM, Masahiko Sawada  wrote:
> I think the above changes can solve this issue but It seems to me that
> holding AccessExclusiveLock on pg_subscription by DROP SUBSCRIPTION
> until commit could lead another deadlock problem in the future. So I'd
> to contrive ways to reduce lock level somehow if possible. For
> example, if we change the apply launcher so that it gets the
> subscription list only when pg_subscription gets invalid, apply
> launcher cannot try to launch the apply worker being stopped. We
> invalidate pg_subscription at commit of DROP SUBSCRIPTION and the
> apply launcher can get new subscription list which doesn't include the
> entry we removed. That way we can reduce lock level to
> ShareUpdateExclusiveLock and solve this issue.
> Also in your patch, we need to change DROP SUBSCRIPTION as well to
> resolve another case I encountered, where DROP SUBSCRIPTION waits for
> apply worker while holding a tuple lock on pg_subscription_rel and the
> apply worker waits for same tuple on pg_subscription_rel in
> SetSubscriptionRelState().

I don't really understand the issue being discussed here in any
detail, but as a general point I'd say that it might be more
productive to make the locks finer-grained rather than struggling to
reduce the lock level.  For example, instead of locking all of
pg_subscription, use LockSharedObject() to lock the individual
subscription, still with AccessExclusiveLock.  That means that other
accesses to that subscription also need to take a lock so that you
actually get a conflict when there should be one, but that should be
doable.  I expect that trying to manage locking conflicts using only
catalog-wide locks is a doomed strategy.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Get stuck when dropping a subscription during synchronizing table

2017-05-17 Thread Masahiko Sawada

On Mon, May 15, 2017 at 8:02 PM, Kyotaro HORIGUCHI
 wrote:
> Hello,
>
> At Fri, 12 May 2017 17:24:07 +0900, Masahiko Sawada  
> wrote in 
>> On Fri, May 12, 2017 at 11:24 AM, Masahiko Sawada  
>> wrote:
>> > On Thu, May 11, 2017 at 6:16 PM, Petr Jelinek
>> >  wrote:
>> >> On 11/05/17 10:10, Masahiko Sawada wrote:
>> >>> On Thu, May 11, 2017 at 4:06 PM, Michael Paquier
>> >>>  wrote:
>>  On Wed, May 10, 2017 at 11:57 AM, Masahiko Sawada 
>>   wrote:
>> > Barring any objections, I'll add these two issues to open item.
>> 
>>  It seems to me that those open items have not been added yet to the
>>  list. If I am following correctly, they could be defined as follows:
>>  - Dropping subscription may stuck if done during tablesync.
>>  -- Analyze deadlock issues with DROP SUBSCRIPTION and apply worker 
>>  process.
>> >>
>> >> I think the solution to this is to reintroduce the LWLock that was
>> >> removed and replaced with the exclusive lock on catalog [1]. I am afraid
>> >> that correct way of handling this is to do both LWLock and catalog lock
>> >> (first LWLock under which we kill the workers and then catalog lock so
>> >> that something that prevents launcher from restarting them is held till
>> >> the end of transaction).
>> >
>> > I agree to reintroduce LWLock and to stop logical rep worker first and
>> > then modify catalog. That way we can reduce catalog lock level (maybe
>> > to RowExclusiveLock) so that apply worker can see it. Also I think
>> > that we need to do more things like in order to prevent that we keep
>> > to hold LWLock until end of transaction, because holding LWLock until
>> > end of transaction is not good idea and could be cause of deadlock. So
>> > for example we can commit the transaction in DropSubscription after
>> > cleaned pg_subscription record and all its dependencies and then start
>> > new transaction for the remaining work. Of course we also need to
>> > disallow DROP SUBSCRIPTION being executed in a user transaction
>> > though.
>>
>> Attached two draft patches to solve these issues.
>>
>> Attached 0001 patch reintroduces LogicalRepLauncherLock and makes DROP
>> SUBSCRIPTION keep holding it until commit. To prevent from deadlock
>> possibility, I disallowed DROP SUBSCRIPTION being called in a
>> transaction block. But there might be more sensible solution for this.
>> please give me feedback.
>
> +* Protect against launcher restarting the worker. This lock will
> +* be released at commit.
>
> This is wrong. COMMIT doesn't release left-over LWLocks, only
> ABORT does (precisely, it seems intended to fire on ERRORs). So
> with this patch, the second DROP SUBSCRIPTION is stuck on the
> LWLock acquired at the first time. And as Petr said, LWLock with
> such a duration seems bad.

Oh I understood. Thank you for pointing out.

>
> The cause seems to be that workers ignore sigterm on certain
> conditions. One of the choke points is GetSubscription, the other
> is get_subscription_list. I think we can treat the both cases
> without LWLocks.
>
> The attached patch does that.
>
> - heap_close + UnlockRelationOid in get_subscription_list() is
>   equivalent to one heap_close or relation_close but I took seeming
>   symmetricity.
>
> - 0.5 seconds for the sleep in ApplyWorkerMain is quite
>   arbitrary. NAPTIME_PER_CYCLE * 1000 could be used instead.
>
> - NULL MySubscription without SIGTERM might not need to be an
>   ERROR.
>
> Any more thoughts?

I think the above changes can solve this issue but It seems to me that
holding AccessExclusiveLock on pg_subscription by DROP SUBSCRIPTION
until commit could lead another deadlock problem in the future. So I'd
to contrive ways to reduce lock level somehow if possible. For
example, if we change the apply launcher so that it gets the
subscription list only when pg_subscription gets invalid, apply
launcher cannot try to launch the apply worker being stopped. We
invalidate pg_subscription at commit of DROP SUBSCRIPTION and the
apply launcher can get new subscription list which doesn't include the
entry we removed. That way we can reduce lock level to
ShareUpdateExclusiveLock and solve this issue.
Also in your patch, we need to change DROP SUBSCRIPTION as well to
resolve another case I encountered, where DROP SUBSCRIPTION waits for
apply worker while holding a tuple lock on pg_subscription_rel and the
apply worker waits for same tuple on pg_subscription_rel in
SetSubscriptionRelState().

>
>
> FYI, I reproduced the situation by the following steps. This
> effectively reproduced the situation without delay insertion for
> me.
>
> # Creating 5 tables with 10 rows on the publisher
> create table t1 (a int);
> ...
> create table t5 (a int);
> insert into t1 (select *

Re: [HACKERS] Get stuck when dropping a subscription during synchronizing table

2017-05-15 Thread Kyotaro HORIGUCHI

Hello,

At Fri, 12 May 2017 17:24:07 +0900, Masahiko Sawada  
wrote in 
> On Fri, May 12, 2017 at 11:24 AM, Masahiko Sawada  
> wrote:
> > On Thu, May 11, 2017 at 6:16 PM, Petr Jelinek
> >  wrote:
> >> On 11/05/17 10:10, Masahiko Sawada wrote:
> >>> On Thu, May 11, 2017 at 4:06 PM, Michael Paquier
> >>>  wrote:
>  On Wed, May 10, 2017 at 11:57 AM, Masahiko Sawada 
>   wrote:
> > Barring any objections, I'll add these two issues to open item.
> 
>  It seems to me that those open items have not been added yet to the
>  list. If I am following correctly, they could be defined as follows:
>  - Dropping subscription may stuck if done during tablesync.
>  -- Analyze deadlock issues with DROP SUBSCRIPTION and apply worker 
>  process.
> >>
> >> I think the solution to this is to reintroduce the LWLock that was
> >> removed and replaced with the exclusive lock on catalog [1]. I am afraid
> >> that correct way of handling this is to do both LWLock and catalog lock
> >> (first LWLock under which we kill the workers and then catalog lock so
> >> that something that prevents launcher from restarting them is held till
> >> the end of transaction).
> >
> > I agree to reintroduce LWLock and to stop logical rep worker first and
> > then modify catalog. That way we can reduce catalog lock level (maybe
> > to RowExclusiveLock) so that apply worker can see it. Also I think
> > that we need to do more things like in order to prevent that we keep
> > to hold LWLock until end of transaction, because holding LWLock until
> > end of transaction is not good idea and could be cause of deadlock. So
> > for example we can commit the transaction in DropSubscription after
> > cleaned pg_subscription record and all its dependencies and then start
> > new transaction for the remaining work. Of course we also need to
> > disallow DROP SUBSCRIPTION being executed in a user transaction
> > though.
> 
> Attached two draft patches to solve these issues.
> 
> Attached 0001 patch reintroduces LogicalRepLauncherLock and makes DROP
> SUBSCRIPTION keep holding it until commit. To prevent from deadlock
> possibility, I disallowed DROP SUBSCRIPTION being called in a
> transaction block. But there might be more sensible solution for this.
> please give me feedback.

+* Protect against launcher restarting the worker. This lock will
+* be released at commit.

This is wrong. COMMIT doesn't release left-over LWLocks, only
ABORT does (precisely, it seems intended to fire on ERRORs). So
with this patch, the second DROP SUBSCRIPTION is stuck on the
LWLock acquired at the first time. And as Petr said, LWLock with
such a duration seems bad.

The cause seems to be that workers ignore sigterm on certain
conditions. One of the choke points is GetSubscription, the other
is get_subscription_list. I think we can treat the both cases
without LWLocks.

The attached patch does that.

- heap_close + UnlockRelationOid in get_subscription_list() is
  equivalent to one heap_close or relation_close but I took seeming
  symmetricity.

- 0.5 seconds for the sleep in ApplyWorkerMain is quite
  arbitrary. NAPTIME_PER_CYCLE * 1000 could be used instead.

- NULL MySubscription without SIGTERM might not need to be an
  ERROR.

Any more thoughts?


FYI, I reproduced the situation by the following steps. This
effectively reproduced the situation without delay insertion for
me.

# Creating 5 tables with 10 rows on the publisher
create table t1 (a int);
...
create table t5 (a int);
insert into t1 (select * from generate_series(0, 9) a);
...
insert into t5 (select * from generate_series(0, 9) a);
create publication p1 for table t1, t2, t3, t4, t5;


# Subscribe them, wait 1sec, then unsbscribe.
create table t1 (a int);
...
create table t5 (a int);
truncate t1, t2, t3, t4, t5; create subscription s1 CONNECTION 'host=/tmp 
port=5432 dbname=postgres' publication p1; select pg_sleep(1); drop 
subscription s1;

Repeated test can be performed by repeatedly enter the last line.

>  -- Avoid orphaned tablesync worker if apply worker exits before
>  changing its status.
> >>>
> >>
> >> The behavior question I have about this is if sync workers should die
> >> when apply worker dies (ie they are tied to apply worker) or if they
> >> should be tied to the subscription.
> >>
> >> I guess taking down all the sync workers when apply worker has exited is
> >> easier to solve. Of course it means that if apply worker restarts in
> >> middle of table synchronization, the table synchronization will have to
> >> start from scratch. That being said, in normal operation apply worker
> >> should only exit/restart if subscription has changed or has been
> >> dropped/disabled and I think sync workers want to exit/restart in that
> >> situation as

Re: [HACKERS] Get stuck when dropping a subscription during synchronizing table

2017-05-14 Thread Noah Misch

On Mon, May 08, 2017 at 06:27:30PM +0900, Masahiko Sawada wrote:
> I encountered a situation where DROP SUBSCRIPTION got stuck when
> initial table sync is in progress. In my environment, I created
> several tables with some data on publisher. I created subscription on
> subscriber and drop subscription immediately after that. It doesn't
> always happen but I often encountered it on my environment.
> 
> ps -x command shows the following.
> 
>  96796 ?Ss 0:00 postgres: masahiko postgres [local] DROP
> SUBSCRIPTION
>  96801 ?Ts 0:00 postgres: bgworker: logical replication
> worker for subscription 40993waiting
>  96805 ?Ss 0:07 postgres: bgworker: logical replication
> worker for subscription 40993 sync 16418
>  96806 ?Ss 0:01 postgres: wal sender process masahiko [local] idle
>  96807 ?Ss 0:00 postgres: bgworker: logical replication
> worker for subscription 40993 sync 16421
>  96808 ?Ss 0:00 postgres: wal sender process masahiko [local] idle
> 
> The DROP SUBSCRIPTION process (pid 96796) is waiting for the apply
> worker process (pid 96801) to stop while holding a lock on
> pg_subscription_rel. On the other hand the apply worker is waiting for
> acquiring a tuple lock on pg_subscription_rel needed for heap_update.
> Also table sync workers (pid 96805 and 96807) are waiting for the
> apply worker process to change their status.
> 
> Also, even when DROP SUBSCRIPTION is done successfully, the table sync
> worker can be orphaned because I guess that the apply worker can exit
> before change status of table sync worker.
> 
> I'm using 1f30295.

[Action required within three days.  This is a generic notification.]

The above-described topic is currently a PostgreSQL 10 open item.  Peter,
since you committed the patch believed to have created it, you own this open
item.  If some other commit is more relevant or if this does not belong as a
v10 open item, please let us know.  Otherwise, please observe the policy on
open item ownership[1] and send a status update within three calendar days of
this message.  Include a date for your subsequent status update.  Testers may
discover new open items at any time, and I want to plan to get them all fixed
well in advance of shipping v10.  Consequently, I will appreciate your efforts
toward speedy resolution.  Thanks.

[1] 
https://www.postgresql.org/message-id/20170404140717.GA2675809%40tornado.leadboat.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Get stuck when dropping a subscription during synchronizing table

2017-05-12 Thread Masahiko Sawada

On Fri, May 12, 2017 at 11:24 AM, Masahiko Sawada  wrote:
> On Thu, May 11, 2017 at 6:16 PM, Petr Jelinek
>  wrote:
>> On 11/05/17 10:10, Masahiko Sawada wrote:
>>> On Thu, May 11, 2017 at 4:06 PM, Michael Paquier
>>>  wrote:
 On Wed, May 10, 2017 at 11:57 AM, Masahiko Sawada  
 wrote:
> Barring any objections, I'll add these two issues to open item.

 It seems to me that those open items have not been added yet to the
 list. If I am following correctly, they could be defined as follows:
 - Dropping subscription may stuck if done during tablesync.
 -- Analyze deadlock issues with DROP SUBSCRIPTION and apply worker process.
>>
>> I think the solution to this is to reintroduce the LWLock that was
>> removed and replaced with the exclusive lock on catalog [1]. I am afraid
>> that correct way of handling this is to do both LWLock and catalog lock
>> (first LWLock under which we kill the workers and then catalog lock so
>> that something that prevents launcher from restarting them is held till
>> the end of transaction).
>
> I agree to reintroduce LWLock and to stop logical rep worker first and
> then modify catalog. That way we can reduce catalog lock level (maybe
> to RowExclusiveLock) so that apply worker can see it. Also I think
> that we need to do more things like in order to prevent that we keep
> to hold LWLock until end of transaction, because holding LWLock until
> end of transaction is not good idea and could be cause of deadlock. So
> for example we can commit the transaction in DropSubscription after
> cleaned pg_subscription record and all its dependencies and then start
> new transaction for the remaining work. Of course we also need to
> disallow DROP SUBSCRIPTION being executed in a user transaction
> though.

Attached two draft patches to solve these issues.

Attached 0001 patch reintroduces LogicalRepLauncherLock and makes DROP
SUBSCRIPTION keep holding it until commit. To prevent from deadlock
possibility, I disallowed DROP SUBSCRIPTION being called in a
transaction block. But there might be more sensible solution for this.
please give me feedback.

>
>>
 -- Avoid orphaned tablesync worker if apply worker exits before
 changing its status.
>>>
>>
>> The behavior question I have about this is if sync workers should die
>> when apply worker dies (ie they are tied to apply worker) or if they
>> should be tied to the subscription.
>>
>> I guess taking down all the sync workers when apply worker has exited is
>> easier to solve. Of course it means that if apply worker restarts in
>> middle of table synchronization, the table synchronization will have to
>> start from scratch. That being said, in normal operation apply worker
>> should only exit/restart if subscription has changed or has been
>> dropped/disabled and I think sync workers want to exit/restart in that
>> situation as well.
>
> I agree that sync workers are tied to the apply worker.
>
>>
>> So for example having shmem detach hook for an apply worker (or reusing
>> the existing one) that searches for all the other workers for same
>> subscription and shuts them down as well sounds like solution to this.
>
> Seems reasonable solution.
>

Regards,

--
Masahiko Sawada
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center


0001-Fix-a-deadlock-bug-between-DROP-SUBSCRIPTION-and-app.patch
Description: Binary data


0002-Wait-for-table-sync-worker-to-finish-when-apply-work.patch
Description: Binary data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Get stuck when dropping a subscription during synchronizing table

2017-05-11 Thread Masahiko Sawada

On Thu, May 11, 2017 at 6:16 PM, Petr Jelinek
 wrote:
> On 11/05/17 10:10, Masahiko Sawada wrote:
>> On Thu, May 11, 2017 at 4:06 PM, Michael Paquier
>>  wrote:
>>> On Wed, May 10, 2017 at 11:57 AM, Masahiko Sawada  
>>> wrote:
 Barring any objections, I'll add these two issues to open item.
>>>
>>> It seems to me that those open items have not been added yet to the
>>> list. If I am following correctly, they could be defined as follows:
>>> - Dropping subscription may stuck if done during tablesync.
>>> -- Analyze deadlock issues with DROP SUBSCRIPTION and apply worker process.
>
> I think the solution to this is to reintroduce the LWLock that was
> removed and replaced with the exclusive lock on catalog [1]. I am afraid
> that correct way of handling this is to do both LWLock and catalog lock
> (first LWLock under which we kill the workers and then catalog lock so
> that something that prevents launcher from restarting them is held till
> the end of transaction).

I agree to reintroduce LWLock and to stop logical rep worker first and
then modify catalog. That way we can reduce catalog lock level (maybe
to RowExclusiveLock) so that apply worker can see it. Also I think
that we need to do more things like in order to prevent that we keep
to hold LWLock until end of transaction, because holding LWLock until
end of transaction is not good idea and could be cause of deadlock. So
for example we can commit the transaction in DropSubscription after
cleaned pg_subscription record and all its dependencies and then start
new transaction for the remaining work. Of course we also need to
disallow DROP SUBSCRIPTION being executed in a user transaction
though.

>
>>> -- Avoid orphaned tablesync worker if apply worker exits before
>>> changing its status.
>>
>
> The behavior question I have about this is if sync workers should die
> when apply worker dies (ie they are tied to apply worker) or if they
> should be tied to the subscription.
>
> I guess taking down all the sync workers when apply worker has exited is
> easier to solve. Of course it means that if apply worker restarts in
> middle of table synchronization, the table synchronization will have to
> start from scratch. That being said, in normal operation apply worker
> should only exit/restart if subscription has changed or has been
> dropped/disabled and I think sync workers want to exit/restart in that
> situation as well.

I agree that sync workers are tied to the apply worker.

>
> So for example having shmem detach hook for an apply worker (or reusing
> the existing one) that searches for all the other workers for same
> subscription and shuts them down as well sounds like solution to this.

Seems reasonable solution.

Regards,

--
Masahiko Sawada
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Get stuck when dropping a subscription during synchronizing table

2017-05-11 Thread Petr Jelinek

On 11/05/17 10:10, Masahiko Sawada wrote:
> On Thu, May 11, 2017 at 4:06 PM, Michael Paquier
>  wrote:
>> On Wed, May 10, 2017 at 11:57 AM, Masahiko Sawada  
>> wrote:
>>> Barring any objections, I'll add these two issues to open item.
>>
>> It seems to me that those open items have not been added yet to the
>> list. If I am following correctly, they could be defined as follows:
>> - Dropping subscription may stuck if done during tablesync.
>> -- Analyze deadlock issues with DROP SUBSCRIPTION and apply worker process.

I think the solution to this is to reintroduce the LWLock that was
removed and replaced with the exclusive lock on catalog [1]. I am afraid
that correct way of handling this is to do both LWLock and catalog lock
(first LWLock under which we kill the workers and then catalog lock so
that something that prevents launcher from restarting them is held till
the end of transaction).

>> -- Avoid orphaned tablesync worker if apply worker exits before
>> changing its status.
> 

The behavior question I have about this is if sync workers should die
when apply worker dies (ie they are tied to apply worker) or if they
should be tied to the subscription.

I guess taking down all the sync workers when apply worker has exited is
easier to solve. Of course it means that if apply worker restarts in
middle of table synchronization, the table synchronization will have to
start from scratch. That being said, in normal operation apply worker
should only exit/restart if subscription has changed or has been
dropped/disabled and I think sync workers want to exit/restart in that
situation as well.

So for example having shmem detach hook for an apply worker (or reusing
the existing one) that searches for all the other workers for same
subscription and shuts them down as well sounds like solution to this.

[1]
https://www.postgresql.org/message-id/cahgqgwhpi8ky-yanffe0sgmhktsycqltnkx07bw9s7-rn17...@mail.gmail.com

-- 
  Petr Jelinek  http://www.2ndQuadrant.com/
  PostgreSQL Development, 24x7 Support, Training & Services

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Get stuck when dropping a subscription during synchronizing table

2017-05-11 Thread Masahiko Sawada

On Thu, May 11, 2017 at 4:06 PM, Michael Paquier
 wrote:
> On Wed, May 10, 2017 at 11:57 AM, Masahiko Sawada  
> wrote:
>> Barring any objections, I'll add these two issues to open item.
>
> It seems to me that those open items have not been added yet to the
> list. If I am following correctly, they could be defined as follows:
> - Dropping subscription may stuck if done during tablesync.
> -- Analyze deadlock issues with DROP SUBSCRIPTION and apply worker process.
> -- Avoid orphaned tablesync worker if apply worker exits before
> changing its status.

Thanks, I think correct. Added it to open item.

>
> I am playing with the code to look at both of them... But feel free to
> update this thread if I don't show up. There are no test cases, but
> some well-placed pg_usleep calls should make both issues easily
> reproducible. I have the gut feeling that other things are hidden
> behind though.

I'm also working on this, so will update it if there is.

Regards,

--
Masahiko Sawada
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Get stuck when dropping a subscription during synchronizing table

2017-05-11 Thread Michael Paquier

On Wed, May 10, 2017 at 11:57 AM, Masahiko Sawada  wrote:
> Barring any objections, I'll add these two issues to open item.

It seems to me that those open items have not been added yet to the
list. If I am following correctly, they could be defined as follows:
- Dropping subscription may stuck if done during tablesync.
-- Analyze deadlock issues with DROP SUBSCRIPTION and apply worker process.
-- Avoid orphaned tablesync worker if apply worker exits before
changing its status.

I am playing with the code to look at both of them... But feel free to
update this thread if I don't show up. There are no test cases, but
some well-placed pg_usleep calls should make both issues easily
reproducible. I have the gut feeling that other things are hidden
behind though.
-- 
Michael


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Get stuck when dropping a subscription during synchronizing table

2017-05-09 Thread Masahiko Sawada

On Wed, May 10, 2017 at 2:46 AM, Masahiko Sawada  wrote:
> On Mon, May 8, 2017 at 8:42 PM, Petr Jelinek
>  wrote:
>> On 08/05/17 11:27, Masahiko Sawada wrote:
>>> Hi,
>>>
>>> I encountered a situation where DROP SUBSCRIPTION got stuck when
>>> initial table sync is in progress. In my environment, I created
>>> several tables with some data on publisher. I created subscription on
>>> subscriber and drop subscription immediately after that. It doesn't
>>> always happen but I often encountered it on my environment.
>>>
>>> ps -x command shows the following.
>>>
>>>  96796 ?Ss 0:00 postgres: masahiko postgres [local] DROP
>>> SUBSCRIPTION
>>>  96801 ?Ts 0:00 postgres: bgworker: logical replication
>>> worker for subscription 40993waiting
>>>  96805 ?Ss 0:07 postgres: bgworker: logical replication
>>> worker for subscription 40993 sync 16418
>>>  96806 ?Ss 0:01 postgres: wal sender process masahiko [local] 
>>> idle
>>>  96807 ?Ss 0:00 postgres: bgworker: logical replication
>>> worker for subscription 40993 sync 16421
>>>  96808 ?Ss 0:00 postgres: wal sender process masahiko [local] 
>>> idle
>>>
>>> The DROP SUBSCRIPTION process (pid 96796) is waiting for the apply
>>> worker process (pid 96801) to stop while holding a lock on
>>> pg_subscription_rel. On the other hand the apply worker is waiting for
>>> acquiring a tuple lock on pg_subscription_rel needed for heap_update.
>>> Also table sync workers (pid 96805 and 96807) are waiting for the
>>> apply worker process to change their status.
>>>
>>
>> Looks like we should kill apply before dropping dependencies.
>
> Sorry, after investigated I found out that DROP SUBSCRIPTION process
> is holding AccessExclusiveLock on pg_subscription (, not
> pg_subscription_rel) and apply worker is waiting for acquiring a lock
> on it.

Hmm it seems there are two cases. One is that the apply worker waits
to acquire AccessShareLock on pg_subscription but DropSubscription
already acquired AcessExclusiveLock on it and waits for the apply
worker to finish. Another case is that the apply worker waits to
acquire a tuple lock on pg_subscrption_rel but DropSubscription (maybe
droppoing dependencies) already acquired it.

> So I guess that the dropping dependencies are not relevant with
> this.  It seems to me that the main cause is that DROP SUBSCRIPTION
> waits for apply worker to finish while keeping to hold
> AccessExclusiveLock on pg_subscription. Perhaps we need to contrive
> ways to reduce lock level somehow.
>
>>
>>> Also, even when DROP SUBSCRIPTION is done successfully, the table sync
>>> worker can be orphaned because I guess that the apply worker can exit
>>> before change status of table sync worker.
>>
>> Well the tablesync worker should stop itself if the subscription got
>> removed, but of course again the dependencies are an issue, so we should
>> probably kill those explicitly as well.
>
> Yeah, I think that we should ensure that the apply worker exits after
> killed all involved table sync workers.
>

Barring any objections, I'll add these two issues to open item.

Regards,

--
Masahiko Sawada
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Get stuck when dropping a subscription during synchronizing table

2017-05-09 Thread Masahiko Sawada

On Mon, May 8, 2017 at 8:42 PM, Petr Jelinek
 wrote:
> On 08/05/17 11:27, Masahiko Sawada wrote:
>> Hi,
>>
>> I encountered a situation where DROP SUBSCRIPTION got stuck when
>> initial table sync is in progress. In my environment, I created
>> several tables with some data on publisher. I created subscription on
>> subscriber and drop subscription immediately after that. It doesn't
>> always happen but I often encountered it on my environment.
>>
>> ps -x command shows the following.
>>
>>  96796 ?Ss 0:00 postgres: masahiko postgres [local] DROP
>> SUBSCRIPTION
>>  96801 ?Ts 0:00 postgres: bgworker: logical replication
>> worker for subscription 40993waiting
>>  96805 ?Ss 0:07 postgres: bgworker: logical replication
>> worker for subscription 40993 sync 16418
>>  96806 ?Ss 0:01 postgres: wal sender process masahiko [local] 
>> idle
>>  96807 ?Ss 0:00 postgres: bgworker: logical replication
>> worker for subscription 40993 sync 16421
>>  96808 ?Ss 0:00 postgres: wal sender process masahiko [local] 
>> idle
>>
>> The DROP SUBSCRIPTION process (pid 96796) is waiting for the apply
>> worker process (pid 96801) to stop while holding a lock on
>> pg_subscription_rel. On the other hand the apply worker is waiting for
>> acquiring a tuple lock on pg_subscription_rel needed for heap_update.
>> Also table sync workers (pid 96805 and 96807) are waiting for the
>> apply worker process to change their status.
>>
>
> Looks like we should kill apply before dropping dependencies.

Sorry, after investigated I found out that DROP SUBSCRIPTION process
is holding AccessExclusiveLock on pg_subscription (, not
pg_subscription_rel) and apply worker is waiting for acquiring a lock
on it. So I guess that the dropping dependencies are not relevant with
this.  It seems to me that the main cause is that DROP SUBSCRIPTION
waits for apply worker to finish while keeping to hold
AccessExclusiveLock on pg_subscription. Perhaps we need to contrive
ways to reduce lock level somehow.

>
>> Also, even when DROP SUBSCRIPTION is done successfully, the table sync
>> worker can be orphaned because I guess that the apply worker can exit
>> before change status of table sync worker.
>
> Well the tablesync worker should stop itself if the subscription got
> removed, but of course again the dependencies are an issue, so we should
> probably kill those explicitly as well.

Yeah, I think that we should ensure that the apply worker exits after
killed all involved table sync workers.

Regards,

--
Masahiko Sawada
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Get stuck when dropping a subscription during synchronizing table

2017-05-08 Thread Masahiko Sawada

On Mon, May 8, 2017 at 8:53 PM, Erik Rijkers  wrote:
> On 2017-05-08 13:13, Masahiko Sawada wrote:
>>
>> On Mon, May 8, 2017 at 7:14 PM, Erik Rijkers  wrote:
>>>
>>> On 2017-05-08 11:27, Masahiko Sawada wrote:


>
>>>
>>> FWIW, running
>>>
 0001-WIP-Fix-off-by-one-around-GetLastImportantRecPtr.patch+
 0002-WIP-Possibly-more-robust-snapbuild-approach.patch +
 fix-statistics-reporting-in-logical-replication-work.patch
>>>
>>>
>>> (on top of 44c528810)
>>
>>
>> Thanks, which thread are these patches attached on?
>>
>
> The first two patches are here:
> https://www.postgresql.org/message-id/20170505004237.edtahvrwb3uwd5rs%40alap3.anarazel.de
>
> and last one:
> https://www.postgresql.org/message-id/22cc402c-88eb-fa35-217f-0060db2c72f0%402ndquadrant.com
>
> ( I have to include that last one or my tests fail within minutes. )

Thank you! I will look at these patches.

Regards,

--
Masahiko Sawada
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Get stuck when dropping a subscription during synchronizing table

2017-05-08 Thread Erik Rijkers


On 2017-05-08 13:13, Masahiko Sawada wrote:

On Mon, May 8, 2017 at 7:14 PM, Erik Rijkers  wrote:

On 2017-05-08 11:27, Masahiko Sawada wrote:






FWIW, running


0001-WIP-Fix-off-by-one-around-GetLastImportantRecPtr.patch+
0002-WIP-Possibly-more-robust-snapbuild-approach.patch +
fix-statistics-reporting-in-logical-replication-work.patch


(on top of 44c528810)


Thanks, which thread are these patches attached on?



The first two patches are here:
https://www.postgresql.org/message-id/20170505004237.edtahvrwb3uwd5rs%40alap3.anarazel.de

and last one:
https://www.postgresql.org/message-id/22cc402c-88eb-fa35-217f-0060db2c72f0%402ndquadrant.com

( I have to include that last one or my tests fail within minutes. )


Erik Rijkers


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Get stuck when dropping a subscription during synchronizing table

2017-05-08 Thread Petr Jelinek

On 08/05/17 11:27, Masahiko Sawada wrote:
> Hi,
> 
> I encountered a situation where DROP SUBSCRIPTION got stuck when
> initial table sync is in progress. In my environment, I created
> several tables with some data on publisher. I created subscription on
> subscriber and drop subscription immediately after that. It doesn't
> always happen but I often encountered it on my environment.
> 
> ps -x command shows the following.
> 
>  96796 ?Ss 0:00 postgres: masahiko postgres [local] DROP
> SUBSCRIPTION
>  96801 ?Ts 0:00 postgres: bgworker: logical replication
> worker for subscription 40993waiting
>  96805 ?Ss 0:07 postgres: bgworker: logical replication
> worker for subscription 40993 sync 16418
>  96806 ?Ss 0:01 postgres: wal sender process masahiko [local] idle
>  96807 ?Ss 0:00 postgres: bgworker: logical replication
> worker for subscription 40993 sync 16421
>  96808 ?Ss 0:00 postgres: wal sender process masahiko [local] idle
> 
> The DROP SUBSCRIPTION process (pid 96796) is waiting for the apply
> worker process (pid 96801) to stop while holding a lock on
> pg_subscription_rel. On the other hand the apply worker is waiting for
> acquiring a tuple lock on pg_subscription_rel needed for heap_update.
> Also table sync workers (pid 96805 and 96807) are waiting for the
> apply worker process to change their status.
> 

Looks like we should kill apply before dropping dependencies.

> Also, even when DROP SUBSCRIPTION is done successfully, the table sync
> worker can be orphaned because I guess that the apply worker can exit
> before change status of table sync worker.

Well the tablesync worker should stop itself if the subscription got
removed, but of course again the dependencies are an issue, so we should
probably kill those explicitly as well.

-- 
  Petr Jelinek  http://www.2ndQuadrant.com/
  PostgreSQL Development, 24x7 Support, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Get stuck when dropping a subscription during synchronizing table

2017-05-08 Thread Masahiko Sawada

On Mon, May 8, 2017 at 7:14 PM, Erik Rijkers  wrote:
> On 2017-05-08 11:27, Masahiko Sawada wrote:
>>
>> Hi,
>>
>> I encountered a situation where DROP SUBSCRIPTION got stuck when
>> initial table sync is in progress. In my environment, I created
>> several tables with some data on publisher. I created subscription on
>> subscriber and drop subscription immediately after that. It doesn't
>> always happen but I often encountered it on my environment.
>>
>> ps -x command shows the following.
>>
>>  96796 ?Ss 0:00 postgres: masahiko postgres [local] DROP
>> SUBSCRIPTION
>>  96801 ?Ts 0:00 postgres: bgworker: logical replication
>> worker for subscription 40993waiting
>>  96805 ?Ss 0:07 postgres: bgworker: logical replication
>> worker for subscription 40993 sync 16418
>>  96806 ?Ss 0:01 postgres: wal sender process masahiko [local]
>> idle
>>  96807 ?Ss 0:00 postgres: bgworker: logical replication
>> worker for subscription 40993 sync 16421
>>  96808 ?Ss 0:00 postgres: wal sender process masahiko [local]
>> idle
>>
>
> FWIW, running
>
>> 0001-WIP-Fix-off-by-one-around-GetLastImportantRecPtr.patch+
>> 0002-WIP-Possibly-more-robust-snapbuild-approach.patch +
>> fix-statistics-reporting-in-logical-replication-work.patch
>
> (on top of 44c528810)

Thanks, which thread are these patches attached on?

>
> I have encountered the same condition as well in the last few days, a few
> times (I think 2 or 3 times).

IIUC there are two issues; one is that the deadlock can happen between
the DROP SUBSCRIPTION and the apply worker process, another one is the
table sync worker can be orphaned if the apply worker exits before
changing status. The latter might relate to another issue reported by
Jeff[1].

[1] 
https://www.postgresql.org/message-id/CAMkU%3D1xUJKs%3D2etq2K7bmbY51Q7g853HLxJ7qEB2Snog9oRvDw%40mail.gmail.com

Regards,

--
Masahiko Sawada
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Get stuck when dropping a subscription during synchronizing table

2017-05-08 Thread Erik Rijkers


On 2017-05-08 11:27, Masahiko Sawada wrote:

Hi,

I encountered a situation where DROP SUBSCRIPTION got stuck when
initial table sync is in progress. In my environment, I created
several tables with some data on publisher. I created subscription on
subscriber and drop subscription immediately after that. It doesn't
always happen but I often encountered it on my environment.

ps -x command shows the following.

 96796 ?Ss 0:00 postgres: masahiko postgres [local] DROP
SUBSCRIPTION
 96801 ?Ts 0:00 postgres: bgworker: logical replication
worker for subscription 40993waiting
 96805 ?Ss 0:07 postgres: bgworker: logical replication
worker for subscription 40993 sync 16418
 96806 ?Ss 0:01 postgres: wal sender process masahiko 
[local] idle

 96807 ?Ss 0:00 postgres: bgworker: logical replication
worker for subscription 40993 sync 16421
 96808 ?Ss 0:00 postgres: wal sender process masahiko 
[local] idle




FWIW, running


0001-WIP-Fix-off-by-one-around-GetLastImportantRecPtr.patch+
0002-WIP-Possibly-more-robust-snapbuild-approach.patch +
fix-statistics-reporting-in-logical-replication-work.patch

(on top of 44c528810)

I have encountered the same condition as well in the last few days, a 
few times (I think 2 or 3 times).


Erik Rijkers


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

[HACKERS] Get stuck when dropping a subscription during synchronizing table

2017-05-08 Thread Masahiko Sawada

Hi,

I encountered a situation where DROP SUBSCRIPTION got stuck when
initial table sync is in progress. In my environment, I created
several tables with some data on publisher. I created subscription on
subscriber and drop subscription immediately after that. It doesn't
always happen but I often encountered it on my environment.

ps -x command shows the following.

 96796 ?Ss 0:00 postgres: masahiko postgres [local] DROP
SUBSCRIPTION
 96801 ?Ts 0:00 postgres: bgworker: logical replication
worker for subscription 40993waiting
 96805 ?Ss 0:07 postgres: bgworker: logical replication
worker for subscription 40993 sync 16418
 96806 ?Ss 0:01 postgres: wal sender process masahiko [local] idle
 96807 ?Ss 0:00 postgres: bgworker: logical replication
worker for subscription 40993 sync 16421
 96808 ?Ss 0:00 postgres: wal sender process masahiko [local] idle

The DROP SUBSCRIPTION process (pid 96796) is waiting for the apply
worker process (pid 96801) to stop while holding a lock on
pg_subscription_rel. On the other hand the apply worker is waiting for
acquiring a tuple lock on pg_subscription_rel needed for heap_update.
Also table sync workers (pid 96805 and 96807) are waiting for the
apply worker process to change their status.

Also, even when DROP SUBSCRIPTION is done successfully, the table sync
worker can be orphaned because I guess that the apply worker can exit
before change status of table sync worker.

I'm using 1f30295.

Regards,

--
Masahiko Sawada
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

73 matches

Mail list logo