Re: [HACKERS] src/test/subscription/t/005_encoding.pl is broken

2017-09-20 Thread Tom Lane
Michael Paquier  writes:
> On Wed, Sep 20, 2017 at 12:33 AM, Tom Lane  wrote:
>> That would indicate that something isn't ever retrying the worker
>> start; but if that's the case, how is it that we get through the
>> other subscription tests with my random-failure patch in place?

> I have been able to dig into this issue further, and the problem is
> indeed in the wait logic of 005_encoding.pl. It is important to wait
> for the initial sync of the subscriber to happen. There is no need to
> incorporate the additional wait query in wait_for_caught_up() as well.
> Please see the attached which fixes the stability problems for me even
> after forcing failures in launcher.c.

Ah, that makes sense.  Pushed.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] src/test/subscription/t/005_encoding.pl is broken

2017-09-19 Thread Michael Paquier
On Wed, Sep 20, 2017 at 12:33 AM, Tom Lane  wrote:
> That would indicate that something isn't ever retrying the worker
> start; but if that's the case, how is it that we get through the
> other subscription tests with my random-failure patch in place?

I have been able to dig into this issue further, and the problem is
indeed in the wait logic of 005_encoding.pl. It is important to wait
for the initial sync of the subscriber to happen. There is no need to
incorporate the additional wait query in wait_for_caught_up() as well.
Please see the attached which fixes the stability problems for me even
after forcing failures in launcher.c.
-- 
Michael


tap-subs-encoding.patch
Description: Binary data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] src/test/subscription/t/005_encoding.pl is broken

2017-09-19 Thread Tom Lane
Michael Paquier  writes:
> Now, I just had a look at the logs for a failure and a success, and
> one difference can be seen in the subscriber's logs as follows:
> -LOG:  logical replication table synchronization worker for
> subscription "mysub", table "test1" has started
> -LOG:  logical replication table synchronization worker for
> subscription "mysub", table "test1" has finished
> +WARNING:  out of background worker slots
> +HINT:  You might need to increase max_worker_processes.
> The "+" portion is for a failure, and I think that this causes the
> subscription to not consume the changes from the publisher which
> explains the failure in the test as the logical worker applying the
> changes on the subscriber-side is not here.

That would indicate that something isn't ever retrying the worker
start; but if that's the case, how is it that we get through the
other subscription tests with my random-failure patch in place?

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] src/test/subscription/t/005_encoding.pl is broken

2017-09-19 Thread Michael Paquier
On Tue, Sep 19, 2017 at 8:51 PM, Robert Haas  wrote:
> On Mon, Sep 18, 2017 at 1:58 PM, Andres Freund  wrote:
>> To my knowledge here's not really any difference between the two in
>> logical replication. Received changes are immediately applied, there's
>> no equivalent to a walreceiver queing up "logical wal" onto disk.
>
> Huh?  Decoding and applying the changes has to take some finite time
> greater than 0.

FWIW, the wait method looks fine to me.

Now, I just had a look at the logs for a failure and a success, and
one difference can be seen in the subscriber's logs as follows:
-LOG:  logical replication table synchronization worker for
subscription "mysub", table "test1" has started
-LOG:  logical replication table synchronization worker for
subscription "mysub", table "test1" has finished
+WARNING:  out of background worker slots
+HINT:  You might need to increase max_worker_processes.
The "+" portion is for a failure, and I think that this causes the
subscription to not consume the changes from the publisher which
explains the failure in the test as the logical worker applying the
changes on the subscriber-side is not here.
-- 
Michael


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] src/test/subscription/t/005_encoding.pl is broken

2017-09-19 Thread Robert Haas
On Mon, Sep 18, 2017 at 1:58 PM, Andres Freund  wrote:
> To my knowledge here's not really any difference between the two in
> logical replication. Received changes are immediately applied, there's
> no equivalent to a walreceiver queing up "logical wal" onto disk.

Huh?  Decoding and applying the changes has to take some finite time
greater than 0.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] src/test/subscription/t/005_encoding.pl is broken

2017-09-18 Thread Tom Lane
Andres Freund  writes:
> On 2017-09-18 11:50:06 -0400, Tom Lane wrote:
>> The reason seems to be that its method of waiting for replication
>> to happen is completely inapropos.  It's watching for the master
>> to say that the slave has received all the WAL, but that does not
>> ensure that the logicalrep apply workers have caught up, does it?

> To my knowledge here's not really any difference between the two in
> logical replication. Received changes are immediately applied, there's
> no equivalent to a walreceiver queing up "logical wal" onto disk.

> So I'm not sure that theory holds.

Well, there's *something* wrong with this test's wait method.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] src/test/subscription/t/005_encoding.pl is broken

2017-09-18 Thread Andres Freund
Hi,

On 2017-09-18 11:50:06 -0400, Tom Lane wrote:
> The reason seems to be that its method of waiting for replication
> to happen is completely inapropos.  It's watching for the master
> to say that the slave has received all the WAL, but that does not
> ensure that the logicalrep apply workers have caught up, does it?

To my knowledge here's not really any difference between the two in
logical replication. Received changes are immediately applied, there's
no equivalent to a walreceiver queing up "logical wal" onto disk.

So I'm not sure that theory holds.

Greetings,

Andres Freund


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] src/test/subscription/t/005_encoding.pl is broken

2017-09-18 Thread Tom Lane
To reproduce the subscription-startup hang that Thomas Munro observed,
I changed src/backend/replication/logical/launcher.c like this:

@@ -427,7 +427,8 @@ retry:
bgw.bgw_notify_pid = MyProcPid;
bgw.bgw_main_arg = Int32GetDatum(slot);
 
-   if (!RegisterDynamicBackgroundWorker(, _handle))
+   if (random() < 10 ||
+   !RegisterDynamicBackgroundWorker(, _handle))
{
/* Failed to start worker, so clean up the worker slot. */
LWLockAcquire(LogicalRepWorkerLock, LW_EXCLUSIVE);

This causes about 50% of worker launch requests to fail.
With the fix I just committed, 002_types.pl gets through fine,
but 005_encoding.pl does not; it sometimes fails like this:

t/005_encoding.pl . 1/1 
#   Failed test 'data replicated to subscriber'
#   at t/005_encoding.pl line 49.
#  got: ''
# expected: '1'
# Looks like you failed 1 test of 1.
t/005_encoding.pl . Dubious, test returned 1 (wstat 256, 0x100)
Failed 1/1 subtests 

The reason seems to be that its method of waiting for replication
to happen is completely inapropos.  It's watching for the master
to say that the slave has received all the WAL, but that does not
ensure that the logicalrep apply workers have caught up, does it?

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers