Re: [HACKERS] tablesync patch broke the assumption that logical rep depends on?

2017-04-27 Thread Neha Khatri
On Thu, Apr 27, 2017 at 4:01 AM, Tom Lane  wrote:

> Jeff Janes  writes:
> > This gives me compiler warning:
> > launcher.c: In function 'logicalrep_worker_launch':
> > launcher.c:257: warning: 'slot' may be used uninitialized in this
> function
>
> Yeah, me too.  Fix pushed.


Somewhat off the mark, but related to warning above, would you get similar
warning for "address" in ProcessUtilitySlow() in utility.c.

Regards,
Neha


Re: [HACKERS] tablesync patch broke the assumption that logical rep depends on?

2017-04-26 Thread Tom Lane
Jeff Janes  writes:
> This gives me compiler warning:
> launcher.c: In function 'logicalrep_worker_launch':
> launcher.c:257: warning: 'slot' may be used uninitialized in this function

Yeah, me too.  Fix pushed.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] tablesync patch broke the assumption that logical rep depends on?

2017-04-26 Thread Jeff Janes
On Wed, Apr 26, 2017 at 8:00 AM, Peter Eisentraut <
peter.eisentr...@2ndquadrant.com> wrote:

> On 4/25/17 15:42, Petr Jelinek wrote:
> >> Here is the patch doing just that.
> >
> > And one more revision which also checks in_use when attaching shared
> > memory. This is mainly to improve the user visible behavior in
> > theoretical corner case when the worker is supposed to be cleaned up but
> > actually still manages to start (it would still fail even without this
> > change but the error message would be more obscure).
>
> Committed that, with some editorializing.
>

This gives me compiler warning:

launcher.c: In function 'logicalrep_worker_launch':
launcher.c:257: warning: 'slot' may be used uninitialized in this function

gcc version 4.4.7 20120313 (Red Hat 4.4.7-18) (GCC)


Cheers,

Jeff


Re: [HACKERS] tablesync patch broke the assumption that logical rep depends on?

2017-04-26 Thread Peter Eisentraut
On 4/25/17 15:42, Petr Jelinek wrote:
>> Here is the patch doing just that.
> 
> And one more revision which also checks in_use when attaching shared
> memory. This is mainly to improve the user visible behavior in
> theoretical corner case when the worker is supposed to be cleaned up but
> actually still manages to start (it would still fail even without this
> change but the error message would be more obscure).

Committed that, with some editorializing.

-- 
Peter Eisentraut  http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] tablesync patch broke the assumption that logical rep depends on?

2017-04-25 Thread Petr Jelinek
On 22/04/17 22:09, Petr Jelinek wrote:
> On 21/04/17 16:31, Petr Jelinek wrote:
>> On 21/04/17 16:23, Peter Eisentraut wrote:
>>> On 4/21/17 10:11, Petr Jelinek wrote:
 On 21/04/17 16:09, Peter Eisentraut wrote:
> On 4/20/17 14:29, Petr Jelinek wrote:
>> +/* Find unused worker slot. */
>> +if (!w->in_use)
>>  {
>> -worker = >workers[slot];
>> -break;
>> +worker = w;
>> +slot = i;
>> +}
>
> Doesn't this still need a break?  Otherwise it always picks the last slot.
>

 Yes it will pick the last slot, does that matter though, is the first
 one better somehow?

 We can't break because we also need to continue the counter (I think the
 issue that the counter solves is probably just theoretical, but still).
>>>
>>> I see.  I think the code would be less confusing if we break the loop
>>> like before and call logicalrep_sync_worker_count() separately.
>>>
 Hmm actually, maybe the if (!w->in_use) should be if (worker == NULL &&
 !w->in_use)?
>>>
>>> That would also do it.  But it's getting a bit fiddly.
>>>
>>
>> I just wanted to avoid looping twice, especially since the garbage
>> collecting code has to also do the loop. I guess I'll go with my
>> original coding for this then which was to put retry label above the
>> loop first, then try finding worker slot, if found call the
>> logicalrep_sync_worker_count and if not found do the garbage collection
>> and if we cleaned up something then goto retry.
>>
> 
> Here is the patch doing just that.
> 

And one more revision which also checks in_use when attaching shared
memory. This is mainly to improve the user visible behavior in
theoretical corner case when the worker is supposed to be cleaned up but
actually still manages to start (it would still fail even without this
change but the error message would be more obscure).

-- 
  Petr Jelinek  http://www.2ndQuadrant.com/
  PostgreSQL Development, 24x7 Support, Training & Services


Fix-various-concurrency-issues-in-logical-replicatiov4.patch
Description: binary/octet-stream

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] tablesync patch broke the assumption that logical rep depends on?

2017-04-22 Thread Petr Jelinek
On 21/04/17 16:31, Petr Jelinek wrote:
> On 21/04/17 16:23, Peter Eisentraut wrote:
>> On 4/21/17 10:11, Petr Jelinek wrote:
>>> On 21/04/17 16:09, Peter Eisentraut wrote:
 On 4/20/17 14:29, Petr Jelinek wrote:
> + /* Find unused worker slot. */
> + if (!w->in_use)
>   {
> - worker = >workers[slot];
> - break;
> + worker = w;
> + slot = i;
> + }

 Doesn't this still need a break?  Otherwise it always picks the last slot.

>>>
>>> Yes it will pick the last slot, does that matter though, is the first
>>> one better somehow?
>>>
>>> We can't break because we also need to continue the counter (I think the
>>> issue that the counter solves is probably just theoretical, but still).
>>
>> I see.  I think the code would be less confusing if we break the loop
>> like before and call logicalrep_sync_worker_count() separately.
>>
>>> Hmm actually, maybe the if (!w->in_use) should be if (worker == NULL &&
>>> !w->in_use)?
>>
>> That would also do it.  But it's getting a bit fiddly.
>>
> 
> I just wanted to avoid looping twice, especially since the garbage
> collecting code has to also do the loop. I guess I'll go with my
> original coding for this then which was to put retry label above the
> loop first, then try finding worker slot, if found call the
> logicalrep_sync_worker_count and if not found do the garbage collection
> and if we cleaned up something then goto retry.
> 

Here is the patch doing just that.

-- 
  Petr Jelinek  http://www.2ndQuadrant.com/
  PostgreSQL Development, 24x7 Support, Training & Services
From fd16e40c870fc08b76ba823e75d6126361bb0ca6 Mon Sep 17 00:00:00 2001
From: Petr Jelinek 
Date: Wed, 19 Apr 2017 03:48:53 +0200
Subject: [PATCH] Fix various concurrency issues in logical replication worker
 launching

The code was originally written with assumption that launcher is the
only proccess starting the worker. However that hasn't been true since
commit 7c4f52409 which failed to modify the worker management code
adequately.

This patch adds in_use field to LogicalRepWorker struct to indicate if
the worker slot is being used and uses proper locking everywhere this
flag is set or read.

However if the parent process which dies while the new worker is
starting and the new worker failes to attach to shared memory, this flag
would never get cleared. We solve this rare corner case by adding sort
of garbage collector for in_use slots. This uses another filed in the
LogicalRepWorker struct named launch_time which contains timestamp of
when the worker has been started. If any request to start a new worker
does not find free slot, we'll check for workers that were supposed to
start but took long to actually do so and reuse their slot.

In passing also fix possible race conditions when stopping worker that
hasn't finished starting yet.
---
 src/backend/replication/logical/launcher.c | 162 +++--
 src/include/replication/worker_internal.h  |   9 ++
 2 files changed, 138 insertions(+), 33 deletions(-)

diff --git a/src/backend/replication/logical/launcher.c 
b/src/backend/replication/logical/launcher.c
index f6ae610..b55ac2a 100644
--- a/src/backend/replication/logical/launcher.c
+++ b/src/backend/replication/logical/launcher.c
@@ -38,6 +38,7 @@
 #include "replication/logicallauncher.h"
 #include "replication/logicalworker.h"
 #include "replication/slot.h"
+#include "replication/walreceiver.h"
 #include "replication/worker_internal.h"
 
 #include "storage/ipc.h"
@@ -76,6 +77,7 @@ static void ApplyLauncherWakeup(void);
 static void logicalrep_launcher_onexit(int code, Datum arg);
 static void logicalrep_worker_onexit(int code, Datum arg);
 static void logicalrep_worker_detach(void);
+static void logicalrep_worker_cleanup(LogicalRepWorker *worker);
 
 /* Flags set by signal handlers */
 volatile sig_atomic_t got_SIGHUP = false;
@@ -154,15 +156,19 @@ get_subscription_list(void)
 /*
  * Wait for a background worker to start up and attach to the shmem context.
  *
- * This is like WaitForBackgroundWorkerStartup(), except that we wait for
- * attaching, not just start and we also just exit if postmaster died.
+ * This is only needed for cleaning up the shared memory in case the worker
+ * fails to attach.
  */
-static bool
+static void
 WaitForReplicationWorkerAttach(LogicalRepWorker *worker,
   
BackgroundWorkerHandle *handle)
 {
BgwHandleStatus status;
int rc;
+   uint16  generation;
+
+   /* Remember generation for future identification. */
+   generation = worker->generation;
 
for (;;)
{
@@ -170,18 +176,29 @@ WaitForReplicationWorkerAttach(LogicalRepWorker *worker,
 
CHECK_FOR_INTERRUPTS();
 
+   LWLockAcquire(LogicalRepWorkerLock, LW_SHARED);
+
+

Re: [HACKERS] tablesync patch broke the assumption that logical rep depends on?

2017-04-21 Thread Petr Jelinek
On 21/04/17 16:23, Peter Eisentraut wrote:
> On 4/21/17 10:11, Petr Jelinek wrote:
>> On 21/04/17 16:09, Peter Eisentraut wrote:
>>> On 4/20/17 14:29, Petr Jelinek wrote:
 +  /* Find unused worker slot. */
 +  if (!w->in_use)
{
 -  worker = >workers[slot];
 -  break;
 +  worker = w;
 +  slot = i;
 +  }
>>>
>>> Doesn't this still need a break?  Otherwise it always picks the last slot.
>>>
>>
>> Yes it will pick the last slot, does that matter though, is the first
>> one better somehow?
>>
>> We can't break because we also need to continue the counter (I think the
>> issue that the counter solves is probably just theoretical, but still).
> 
> I see.  I think the code would be less confusing if we break the loop
> like before and call logicalrep_sync_worker_count() separately.
> 
>> Hmm actually, maybe the if (!w->in_use) should be if (worker == NULL &&
>> !w->in_use)?
> 
> That would also do it.  But it's getting a bit fiddly.
> 

I just wanted to avoid looping twice, especially since the garbage
collecting code has to also do the loop. I guess I'll go with my
original coding for this then which was to put retry label above the
loop first, then try finding worker slot, if found call the
logicalrep_sync_worker_count and if not found do the garbage collection
and if we cleaned up something then goto retry.

-- 
  Petr Jelinek  http://www.2ndQuadrant.com/
  PostgreSQL Development, 24x7 Support, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] tablesync patch broke the assumption that logical rep depends on?

2017-04-21 Thread Peter Eisentraut
On 4/21/17 10:11, Petr Jelinek wrote:
> On 21/04/17 16:09, Peter Eisentraut wrote:
>> On 4/20/17 14:29, Petr Jelinek wrote:
>>> +   /* Find unused worker slot. */
>>> +   if (!w->in_use)
>>> {
>>> -   worker = >workers[slot];
>>> -   break;
>>> +   worker = w;
>>> +   slot = i;
>>> +   }
>>
>> Doesn't this still need a break?  Otherwise it always picks the last slot.
>>
> 
> Yes it will pick the last slot, does that matter though, is the first
> one better somehow?
> 
> We can't break because we also need to continue the counter (I think the
> issue that the counter solves is probably just theoretical, but still).

I see.  I think the code would be less confusing if we break the loop
like before and call logicalrep_sync_worker_count() separately.

> Hmm actually, maybe the if (!w->in_use) should be if (worker == NULL &&
> !w->in_use)?

That would also do it.  But it's getting a bit fiddly.

-- 
Peter Eisentraut  http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] tablesync patch broke the assumption that logical rep depends on?

2017-04-21 Thread Petr Jelinek
On 21/04/17 16:09, Peter Eisentraut wrote:
> On 4/20/17 14:29, Petr Jelinek wrote:
>> +/* Find unused worker slot. */
>> +if (!w->in_use)
>>  {
>> -worker = >workers[slot];
>> -break;
>> +worker = w;
>> +slot = i;
>> +}
> 
> Doesn't this still need a break?  Otherwise it always picks the last slot.
> 

Yes it will pick the last slot, does that matter though, is the first
one better somehow?

We can't break because we also need to continue the counter (I think the
issue that the counter solves is probably just theoretical, but still).

Hmm actually, maybe the if (!w->in_use) should be if (worker == NULL &&
!w->in_use)?

-- 
  Petr Jelinek  http://www.2ndQuadrant.com/
  PostgreSQL Development, 24x7 Support, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] tablesync patch broke the assumption that logical rep depends on?

2017-04-21 Thread Peter Eisentraut
On 4/20/17 22:24, Noah Misch wrote:
> On Mon, Apr 17, 2017 at 02:09:54PM -0400, Peter Eisentraut wrote:
>> I think we're not really sure yet what to do about this.  Discussion is
>> ongoing.  I'll report back on Wednesday.
> 
> This PostgreSQL 10 open item is past due for your status update, and this is
> overall the seventh time you have you allowed one of your open items to go out
> of compliance.  Kindly start complying with the policy on open item ownership:
> https://www.postgresql.org/message-id/20170404140717.GA2675809%40tornado.leadboat.com

We are actively working on it.  It should be resolved within a few days.
 Next check-in Monday.

-- 
Peter Eisentraut  http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] tablesync patch broke the assumption that logical rep depends on?

2017-04-21 Thread Peter Eisentraut
On 4/20/17 14:29, Petr Jelinek wrote:
> + /* Find unused worker slot. */
> + if (!w->in_use)
>   {
> - worker = >workers[slot];
> - break;
> + worker = w;
> + slot = i;
> + }

Doesn't this still need a break?  Otherwise it always picks the last slot.

-- 
Peter Eisentraut  http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] tablesync patch broke the assumption that logical rep depends on?

2017-04-20 Thread Noah Misch
On Mon, Apr 17, 2017 at 02:09:54PM -0400, Peter Eisentraut wrote:
> I think we're not really sure yet what to do about this.  Discussion is
> ongoing.  I'll report back on Wednesday.

This PostgreSQL 10 open item is past due for your status update, and this is
overall the seventh time you have you allowed one of your open items to go out
of compliance.  Kindly start complying with the policy on open item ownership:
https://www.postgresql.org/message-id/20170404140717.GA2675809%40tornado.leadboat.com

At a bare minimum, send a status update within 24 hours, and include a date
for your subsequent status update.


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] tablesync patch broke the assumption that logical rep depends on?

2017-04-20 Thread Petr Jelinek
On 20/04/17 18:58, Peter Eisentraut wrote:
> On 4/18/17 22:13, Petr Jelinek wrote:
>> So my idea was to add some kind of inuse flag. This turned out to be bit
>> more complicated in terms of how to clean it than I would have hoped.
>> This is due to the fact that there is no way to reliably tell if worker
>> has failed to start if the parent worker crashed while waiting.
>>
>> My solution to that is to use similar logic to autovacuum where we use
>> timeout for worker to attach to shmem. We do this only if there is no
>> free slot found when launch of replication worker was requested.
> 
> It looks like launch_time is never set the current time in your patch.
> 

Oops, fixed.

-- 
  Petr Jelinek  http://www.2ndQuadrant.com/
  PostgreSQL Development, 24x7 Support, Training & Services
From 331cf79d58deff37d2697e1e1fe594c55a92cd42 Mon Sep 17 00:00:00 2001
From: Petr Jelinek 
Date: Wed, 19 Apr 2017 03:48:53 +0200
Subject: [PATCH] Fix various concurrency issues in logical replication worker
 launching

The code was originally written with assumption that launcher is the
only proccess starting the worker. However that hasn't been true since
commit 7c4f52409 which failed to modify the worker management code
adequately.

This patch adds in_use field to LogicalRepWorker struct to indicate if
the worker slot is being used and uses proper locking everywhere this
flag is set or read.

However if the parent process which dies while the new worker is
starting and the new worker failes to attach to shared memory, this flag
would never get cleared. We solve this rare corner case by adding sort
of garbage collector for in_use slots. This uses another filed in the
LogicalRepWorker struct named launch_time which contains timestamp of
when the worker has been started. If any request to start a new worker
does not find free slot, we'll check for workers that were supposed to
start but took long to actually do so and reuse their slot.

In passing also fix possible race conditions when stopping worker that
hasn't finished starting yet.
---
 src/backend/replication/logical/launcher.c | 166 +++--
 src/include/replication/worker_internal.h  |   9 ++
 2 files changed, 140 insertions(+), 35 deletions(-)

diff --git a/src/backend/replication/logical/launcher.c 
b/src/backend/replication/logical/launcher.c
index f6ae610..761fbfa 100644
--- a/src/backend/replication/logical/launcher.c
+++ b/src/backend/replication/logical/launcher.c
@@ -38,6 +38,7 @@
 #include "replication/logicallauncher.h"
 #include "replication/logicalworker.h"
 #include "replication/slot.h"
+#include "replication/walreceiver.h"
 #include "replication/worker_internal.h"
 
 #include "storage/ipc.h"
@@ -76,6 +77,7 @@ static void ApplyLauncherWakeup(void);
 static void logicalrep_launcher_onexit(int code, Datum arg);
 static void logicalrep_worker_onexit(int code, Datum arg);
 static void logicalrep_worker_detach(void);
+static void logicalrep_worker_cleanup(LogicalRepWorker *worker);
 
 /* Flags set by signal handlers */
 volatile sig_atomic_t got_SIGHUP = false;
@@ -154,15 +156,19 @@ get_subscription_list(void)
 /*
  * Wait for a background worker to start up and attach to the shmem context.
  *
- * This is like WaitForBackgroundWorkerStartup(), except that we wait for
- * attaching, not just start and we also just exit if postmaster died.
+ * This is only needed for cleaning up the shared memory in case the worker
+ * fails to attach.
  */
-static bool
+static void
 WaitForReplicationWorkerAttach(LogicalRepWorker *worker,
   
BackgroundWorkerHandle *handle)
 {
BgwHandleStatus status;
int rc;
+   uint16  generation;
+
+   /* Remember generation for future identification. */
+   generation = worker->generation;
 
for (;;)
{
@@ -170,18 +176,29 @@ WaitForReplicationWorkerAttach(LogicalRepWorker *worker,
 
CHECK_FOR_INTERRUPTS();
 
+   LWLockAcquire(LogicalRepWorkerLock, LW_SHARED);
+
+   /* Worker either died or started, no need to do anything. */
+   if (!worker->in_use || worker->proc)
+   {
+   LWLockRelease(LogicalRepWorkerLock);
+   return;
+   }
+
+   LWLockRelease(LogicalRepWorkerLock);
+
+   /* Check if worker has died before attaching and clean up after 
it. */
status = GetBackgroundWorkerPid(handle, );
 
-   /*
-* Worker started and attached to our shmem. This check is safe
-* because only launcher ever starts the workers, so nobody can 
steal
-* the worker slot.
-*/
-   if (status == BGWH_STARTED && worker->proc)
-   return true;
-   /* Worker didn't start or died before attaching to our shmem. */
  

Re: [HACKERS] tablesync patch broke the assumption that logical rep depends on?

2017-04-20 Thread Peter Eisentraut
On 4/18/17 22:13, Petr Jelinek wrote:
> So my idea was to add some kind of inuse flag. This turned out to be bit
> more complicated in terms of how to clean it than I would have hoped.
> This is due to the fact that there is no way to reliably tell if worker
> has failed to start if the parent worker crashed while waiting.
> 
> My solution to that is to use similar logic to autovacuum where we use
> timeout for worker to attach to shmem. We do this only if there is no
> free slot found when launch of replication worker was requested.

It looks like launch_time is never set the current time in your patch.

-- 
Peter Eisentraut  http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] tablesync patch broke the assumption that logical rep depends on?

2017-04-18 Thread Petr Jelinek
On 17/04/17 20:09, Peter Eisentraut wrote:
> On 4/16/17 22:01, Noah Misch wrote:
>> This PostgreSQL 10 open item is past due for your status update.  Kindly send
>> a status update within 24 hours, and include a date for your subsequent 
>> status
>> update.  Refer to the policy on open item ownership:
>> https://www.postgresql.org/message-id/20170404140717.GA2675809%40tornado.leadboat.com
> 
> I think we're not really sure yet what to do about this.  Discussion is
> ongoing.  I'll report back on Wednesday.
> 

So my idea was to add some kind of inuse flag. This turned out to be bit
more complicated in terms of how to clean it than I would have hoped.
This is due to the fact that there is no way to reliably tell if worker
has failed to start if the parent worker crashed while waiting.

My solution to that is to use similar logic to autovacuum where we use
timeout for worker to attach to shmem. We do this only if there is no
free slot found when launch of replication worker was requested.

While working on this patch I also noticed other subtle concurrency
issues and fixed them as well - stopping worker that hasn't finished
starting yet wasn't completely concurrency safe and limiting sync
workers per subscription theoretically wasn't either (although I don't
think it could happen in practice).

I do wonder now though if it's such a good idea to have the
BackgroundWorkerHandle private to the bgworker.c given that this is the
3rd time (twice before was outside of postgres core) I had to write
similar generation mechanism that it uses for unique bgworker
authentication inside the process which started it. It would have been
much easier if I could just save the BackgroundWorkerHandle itself to
shmem so it could be used across processes instead of having to reinvent
it every time.

-- 
  Petr Jelinek  http://www.2ndQuadrant.com/
  PostgreSQL Development, 24x7 Support, Training & Services
From 6d32379cda7d9f69c42185b9684d32570554f4e3 Mon Sep 17 00:00:00 2001
From: Petr Jelinek 
Date: Wed, 19 Apr 2017 03:48:53 +0200
Subject: [PATCH] Fix various concurrency issues in logical replication worker
 launching

The code was originally written with assumption that launcher is the
only proccess starting the worker. However that hasn't been true since
commit 7c4f52409 which failed to modify the worker management code
adequately.

This patch adds in_use field to LogicalRepWorker struct to indicate if
the worker slot is being used and uses proper locking everywhere this
flag is set or read.

However if the parent process which dies while the new worker is
starting and the new worker failes to attach to shared memory, this flag
would never get cleared. We solve this rare corner case by adding sort
of garbage collector for in_use slots. This uses another filed in the
LogicalRepWorker struct named launch_time which contains timestamp of
when the worker has been started. If any request to start a new worker
does not find free slot, we'll check for workers that were supposed to
start but took long to actually do so and reuse their slot.

In passing also fix possible race conditions when stopping worker that
hasn't finished starting yet.
---
 src/backend/replication/logical/launcher.c | 167 +++--
 src/include/replication/worker_internal.h  |   9 ++
 2 files changed, 141 insertions(+), 35 deletions(-)

diff --git a/src/backend/replication/logical/launcher.c 
b/src/backend/replication/logical/launcher.c
index f6ae610..41f8cfe 100644
--- a/src/backend/replication/logical/launcher.c
+++ b/src/backend/replication/logical/launcher.c
@@ -38,6 +38,7 @@
 #include "replication/logicallauncher.h"
 #include "replication/logicalworker.h"
 #include "replication/slot.h"
+#include "replication/walreceiver.h"
 #include "replication/worker_internal.h"
 
 #include "storage/ipc.h"
@@ -76,6 +77,7 @@ static void ApplyLauncherWakeup(void);
 static void logicalrep_launcher_onexit(int code, Datum arg);
 static void logicalrep_worker_onexit(int code, Datum arg);
 static void logicalrep_worker_detach(void);
+static void logicalrep_worker_cleanup(LogicalRepWorker *worker);
 
 /* Flags set by signal handlers */
 volatile sig_atomic_t got_SIGHUP = false;
@@ -154,15 +156,19 @@ get_subscription_list(void)
 /*
  * Wait for a background worker to start up and attach to the shmem context.
  *
- * This is like WaitForBackgroundWorkerStartup(), except that we wait for
- * attaching, not just start and we also just exit if postmaster died.
+ * This is only needed for cleaning up the shared memory in case the worker
+ * fails to attach.
  */
-static bool
+static void
 WaitForReplicationWorkerAttach(LogicalRepWorker *worker,
   
BackgroundWorkerHandle *handle)
 {
BgwHandleStatus status;
int rc;
+   uint16  generation;
+
+   /* Remember generation for future identification. */
+   generation = 

Re: [HACKERS] tablesync patch broke the assumption that logical rep depends on?

2017-04-17 Thread Peter Eisentraut
On 4/16/17 22:01, Noah Misch wrote:
> This PostgreSQL 10 open item is past due for your status update.  Kindly send
> a status update within 24 hours, and include a date for your subsequent status
> update.  Refer to the policy on open item ownership:
> https://www.postgresql.org/message-id/20170404140717.GA2675809%40tornado.leadboat.com

I think we're not really sure yet what to do about this.  Discussion is
ongoing.  I'll report back on Wednesday.

-- 
Peter Eisentraut  http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] tablesync patch broke the assumption that logical rep depends on?

2017-04-16 Thread Noah Misch
On Thu, Apr 13, 2017 at 04:56:05AM +, Noah Misch wrote:
> On Tue, Apr 11, 2017 at 02:28:44AM +0900, Fujii Masao wrote:
> >  src/backend/replication/logical/launcher.c
> >  * Worker started and attached to our shmem. This check is safe
> >  * because only launcher ever starts the workers, so nobody can 
> > steal
> >  * the worker slot.
> > 
> > The tablesync patch enabled even worker to start another worker.
> > So the above assumption is not valid for now.
> > 
> > This issue seems to cause the corner case where the launcher picks up
> > the same worker slot that previously-started worker has already picked
> > up to start another worker.
> 
> [Action required within three days.  This is a generic notification.]
> 
> The above-described topic is currently a PostgreSQL 10 open item.  Peter,
> since you committed the patch believed to have created it, you own this open
> item.  If some other commit is more relevant or if this does not belong as a
> v10 open item, please let us know.  Otherwise, please observe the policy on
> open item ownership[1] and send a status update within three calendar days of
> this message.  Include a date for your subsequent status update.  Testers may
> discover new open items at any time, and I want to plan to get them all fixed
> well in advance of shipping v10.  Consequently, I will appreciate your efforts
> toward speedy resolution.  Thanks.
> 
> [1] 
> https://www.postgresql.org/message-id/20170404140717.GA2675809%40tornado.leadboat.com

This PostgreSQL 10 open item is past due for your status update.  Kindly send
a status update within 24 hours, and include a date for your subsequent status
update.  Refer to the policy on open item ownership:
https://www.postgresql.org/message-id/20170404140717.GA2675809%40tornado.leadboat.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] tablesync patch broke the assumption that logical rep depends on?

2017-04-16 Thread Petr Jelinek
On 16/04/17 21:27, Steve Singer wrote:
> On 04/10/2017 01:28 PM, Fujii Masao wrote:
>> Hi,
>>
>>   src/backend/replication/logical/launcher.c
>>   * Worker started and attached to our shmem. This check is safe
>>   * because only launcher ever starts the workers, so nobody
>> can steal
>>   * the worker slot.
>>
>> The tablesync patch enabled even worker to start another worker.
>> So the above assumption is not valid for now.
>>
>> This issue seems to cause the corner case where the launcher picks up
>> the same worker slot that previously-started worker has already picked
>> up to start another worker.
>>
>> Regards,
>>
> 
> I'm not sure if what I am seeing is related to this or another issue.
> 
> If I create a subscription for a publication that doesn't exist (yet) I
> see 'publication mypub does not exist" in the subscribers log
> 
> If I then create the publication on the publisher the error doesn't go
> away, the worker process keeps spinning with the same error.
> 
> If I then drop the subscription and recreate it it works.
> 

No that's a result of how logical decoding/slots work. We see catalogs
as what they looked like at the specific point in time. If you create
slot (by creating subscription) it will not see publication that was
created after until decoding on that slot reaches point in time when it
was actually created.

-- 
  Petr Jelinek  http://www.2ndQuadrant.com/
  PostgreSQL Development, 24x7 Support, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] tablesync patch broke the assumption that logical rep depends on?

2017-04-16 Thread Steve Singer

On 04/10/2017 01:28 PM, Fujii Masao wrote:

Hi,

  src/backend/replication/logical/launcher.c
  * Worker started and attached to our shmem. This check is safe
  * because only launcher ever starts the workers, so nobody can steal
  * the worker slot.

The tablesync patch enabled even worker to start another worker.
So the above assumption is not valid for now.

This issue seems to cause the corner case where the launcher picks up
the same worker slot that previously-started worker has already picked
up to start another worker.

Regards,



I'm not sure if what I am seeing is related to this or another issue.

If I create a subscription for a publication that doesn't exist (yet) I 
see 'publication mypub does not exist" in the subscribers log


If I then create the publication on the publisher the error doesn't go 
away, the worker process keeps spinning with the same error.


If I then drop the subscription and recreate it it works.




--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] tablesync patch broke the assumption that logical rep depends on?

2017-04-14 Thread Petr Jelinek
On 13/04/17 19:31, Fujii Masao wrote:
> On Fri, Apr 14, 2017 at 1:28 AM, Peter Eisentraut
>  wrote:
>> On 4/10/17 13:28, Fujii Masao wrote:
>>>  src/backend/replication/logical/launcher.c
>>>  * Worker started and attached to our shmem. This check is safe
>>>  * because only launcher ever starts the workers, so nobody can 
>>> steal
>>>  * the worker slot.
>>>
>>> The tablesync patch enabled even worker to start another worker.
>>> So the above assumption is not valid for now.
>>>
>>> This issue seems to cause the corner case where the launcher picks up
>>> the same worker slot that previously-started worker has already picked
>>> up to start another worker.
>>
>> I think what the comment should rather say is that workers are always
>> started through logicalrep_worker_launch() and worker slots are always
>> handed out while holding LogicalRepWorkerLock exclusively, so nobody can
>> steal the worker slot.
>>
>> Does that make sense?
> 
> No unless I'm missing something.
> 
> logicalrep_worker_launch() picks up unused worker slot (slot's proc == NULL)
> while holding LogicalRepWorkerLock. But it releases the lock before the slot
> is marked as used (i.e., slot is set to non-NULL). Then newly-launched worker
> calls logicalrep_worker_attach() and marks the slot as used.
> 
> So if another logicalrep_worker_launch() starts after LogicalRepWorkerLock
> is released before the slot is marked as used, it can pick up the same slot
> because that slot looks unused.
> 

Yeah I think it's less of a problem of that comment than the fact that
logicalrep_worker_launch isn't concurrency safe. We need in_use marker
for the workers and update it as needed instead of relying on pgproc.
I'll write up something over the weekend.

-- 
  Petr Jelinek  http://www.2ndQuadrant.com/
  PostgreSQL Development, 24x7 Support, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] tablesync patch broke the assumption that logical rep depends on?

2017-04-13 Thread Fujii Masao
On Fri, Apr 14, 2017 at 1:28 AM, Peter Eisentraut
 wrote:
> On 4/10/17 13:28, Fujii Masao wrote:
>>  src/backend/replication/logical/launcher.c
>>  * Worker started and attached to our shmem. This check is safe
>>  * because only launcher ever starts the workers, so nobody can steal
>>  * the worker slot.
>>
>> The tablesync patch enabled even worker to start another worker.
>> So the above assumption is not valid for now.
>>
>> This issue seems to cause the corner case where the launcher picks up
>> the same worker slot that previously-started worker has already picked
>> up to start another worker.
>
> I think what the comment should rather say is that workers are always
> started through logicalrep_worker_launch() and worker slots are always
> handed out while holding LogicalRepWorkerLock exclusively, so nobody can
> steal the worker slot.
>
> Does that make sense?

No unless I'm missing something.

logicalrep_worker_launch() picks up unused worker slot (slot's proc == NULL)
while holding LogicalRepWorkerLock. But it releases the lock before the slot
is marked as used (i.e., slot is set to non-NULL). Then newly-launched worker
calls logicalrep_worker_attach() and marks the slot as used.

So if another logicalrep_worker_launch() starts after LogicalRepWorkerLock
is released before the slot is marked as used, it can pick up the same slot
because that slot looks unused.

Regards,

-- 
Fujii Masao


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] tablesync patch broke the assumption that logical rep depends on?

2017-04-13 Thread Peter Eisentraut
On 4/10/17 13:28, Fujii Masao wrote:
>  src/backend/replication/logical/launcher.c
>  * Worker started and attached to our shmem. This check is safe
>  * because only launcher ever starts the workers, so nobody can steal
>  * the worker slot.
> 
> The tablesync patch enabled even worker to start another worker.
> So the above assumption is not valid for now.
> 
> This issue seems to cause the corner case where the launcher picks up
> the same worker slot that previously-started worker has already picked
> up to start another worker.

I think what the comment should rather say is that workers are always
started through logicalrep_worker_launch() and worker slots are always
handed out while holding LogicalRepWorkerLock exclusively, so nobody can
steal the worker slot.

Does that make sense?

-- 
Peter Eisentraut  http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] tablesync patch broke the assumption that logical rep depends on?

2017-04-12 Thread Noah Misch
On Tue, Apr 11, 2017 at 02:28:44AM +0900, Fujii Masao wrote:
>  src/backend/replication/logical/launcher.c
>  * Worker started and attached to our shmem. This check is safe
>  * because only launcher ever starts the workers, so nobody can steal
>  * the worker slot.
> 
> The tablesync patch enabled even worker to start another worker.
> So the above assumption is not valid for now.
> 
> This issue seems to cause the corner case where the launcher picks up
> the same worker slot that previously-started worker has already picked
> up to start another worker.

[Action required within three days.  This is a generic notification.]

The above-described topic is currently a PostgreSQL 10 open item.  Peter,
since you committed the patch believed to have created it, you own this open
item.  If some other commit is more relevant or if this does not belong as a
v10 open item, please let us know.  Otherwise, please observe the policy on
open item ownership[1] and send a status update within three calendar days of
this message.  Include a date for your subsequent status update.  Testers may
discover new open items at any time, and I want to plan to get them all fixed
well in advance of shipping v10.  Consequently, I will appreciate your efforts
toward speedy resolution.  Thanks.

[1] 
https://www.postgresql.org/message-id/20170404140717.GA2675809%40tornado.leadboat.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers