Re: [HACKERS] logical replication - still unstable after all these months

2017-06-17 Thread Erik Rijkers
On 2017-06-18 00:27, Peter Eisentraut wrote: On 6/17/17 06:48, Erik Rijkers wrote: On 2017-05-28 12:44, Erik Rijkers wrote: re: srsubstate in pg_subscription_rel: No idea what it means. At the very least this value 'w' is missing from the documentation, which only mentions: i = initalize

Re: [HACKERS] logical replication - still unstable after all these months

2017-06-17 Thread Peter Eisentraut
On 6/17/17 06:48, Erik Rijkers wrote: > On 2017-05-28 12:44, Erik Rijkers wrote: > > re: srsubstate in pg_subscription_rel: > >> No idea what it means. At the very least this value 'w' is missing >> from the documentation, which only mentions: >> i = initalize >> d = data copy >> s =

Re: [HACKERS] logical replication - still unstable after all these months

2017-06-17 Thread Erik Rijkers
On 2017-05-28 12:44, Erik Rijkers wrote: re: srsubstate in pg_subscription_rel: No idea what it means. At the very least this value 'w' is missing from the documentation, which only mentions: i = initalize d = data copy s = synchronized r = (normal replication) Shouldn't we add this

Re: [HACKERS] logical replication - still unstable after all these months

2017-06-06 Thread Petr Jelinek
On 06/06/17 21:09, Robert Haas wrote: > On Tue, Jun 6, 2017 at 3:01 PM, Erik Rijkers wrote: >> Belated apologies all round for the somewhat provocative $subject; but I >> felt at that moment that this item needed some extra attention. > > FWIW, it seemed like a pretty fair

Re: [HACKERS] logical replication - still unstable after all these months

2017-06-06 Thread Robert Haas
On Tue, Jun 6, 2017 at 3:01 PM, Erik Rijkers wrote: > Belated apologies all round for the somewhat provocative $subject; but I > felt at that moment that this item needed some extra attention. FWIW, it seemed like a pretty fair subject line to me given your test results. I think

Re: [HACKERS] logical replication - still unstable after all these months

2017-06-06 Thread Erik Rijkers
On 2017-06-06 20:53, Peter Eisentraut wrote: On 6/4/17 22:38, Petr Jelinek wrote: Committed that, with some further updates of comments to reflect the Belated apologies all round for the somewhat provocative $subject; but I felt at that moment that this item needed some extra attention. I

Re: [HACKERS] logical replication - still unstable after all these months

2017-06-06 Thread Peter Eisentraut
On 6/4/17 22:38, Petr Jelinek wrote: > On 03/06/17 16:12, Jeff Janes wrote: >> >> On Fri, Jun 2, 2017 at 4:10 PM, Petr Jelinek >> > wrote: >> >> >> While I was testing something for different thread I noticed that I >>

Re: [HACKERS] logical replication - still unstable after all these months

2017-06-04 Thread Petr Jelinek
On 03/06/17 16:12, Jeff Janes wrote: > > On Fri, Jun 2, 2017 at 4:10 PM, Petr Jelinek > > wrote: > > > While I was testing something for different thread I noticed that I > manage transactions incorrectly in this patch.

Re: [HACKERS] logical replication - still unstable after all these months

2017-06-04 Thread Mark Kirkwood
On 05/06/17 13:08, Mark Kirkwood wrote: On 05/06/17 00:04, Erik Rijkers wrote: On 2017-05-31 16:20, Erik Rijkers wrote: On 2017-05-31 11:16, Petr Jelinek wrote: [...] Thanks to Mark's offer I was able to study the issue as it happened and found the cause of this.

Re: [HACKERS] logical replication - still unstable after all these months

2017-06-04 Thread Mark Kirkwood
On 05/06/17 00:04, Erik Rijkers wrote: On 2017-05-31 16:20, Erik Rijkers wrote: On 2017-05-31 11:16, Petr Jelinek wrote: [...] Thanks to Mark's offer I was able to study the issue as it happened and found the cause of this. [0001-Improve-handover-logic-between-sync-and-apply-worker.patch]

Re: [HACKERS] logical replication - still unstable after all these months

2017-06-04 Thread Erik Rijkers
On 2017-05-31 16:20, Erik Rijkers wrote: On 2017-05-31 11:16, Petr Jelinek wrote: [...] Thanks to Mark's offer I was able to study the issue as it happened and found the cause of this. [0001-Improve-handover-logic-between-sync-and-apply-worker.patch] This looks good: --

Re: [HACKERS] logical replication - still unstable after all these months

2017-06-03 Thread Jeff Janes
On Fri, Jun 2, 2017 at 4:10 PM, Petr Jelinek wrote: > > While I was testing something for different thread I noticed that I > manage transactions incorrectly in this patch. Fixed here, I didn't test > it much yet (it takes a while as you know :) ). Not sure if it's

Re: [HACKERS] logical replication - still unstable after all these months

2017-06-02 Thread Petr Jelinek
On 03/06/17 04:45, Mark Kirkwood wrote: > On 03/06/17 11:10, Petr Jelinek wrote: > >> On 02/06/17 22:29, Petr Jelinek wrote: >>> On 02/06/17 08:55, Mark Kirkwood wrote: On 02/06/17 17:11, Erik Rijkers wrote: > On 2017-06-02 00:46, Mark Kirkwood wrote: >> On 31/05/17 21:16, Petr

Re: [HACKERS] logical replication - still unstable after all these months

2017-06-02 Thread Mark Kirkwood
On 03/06/17 11:10, Petr Jelinek wrote: On 02/06/17 22:29, Petr Jelinek wrote: On 02/06/17 08:55, Mark Kirkwood wrote: On 02/06/17 17:11, Erik Rijkers wrote: On 2017-06-02 00:46, Mark Kirkwood wrote: On 31/05/17 21:16, Petr Jelinek wrote: I'm seeing a new failure with the patch applied -

Re: [HACKERS] logical replication - still unstable after all these months

2017-06-02 Thread Petr Jelinek
On 02/06/17 22:29, Petr Jelinek wrote: > On 02/06/17 08:55, Mark Kirkwood wrote: >> On 02/06/17 17:11, Erik Rijkers wrote: >> >>> On 2017-06-02 00:46, Mark Kirkwood wrote: On 31/05/17 21:16, Petr Jelinek wrote: I'm seeing a new failure with the patch applied - this time the

Re: [HACKERS] logical replication - still unstable after all these months

2017-06-02 Thread Petr Jelinek
On 02/06/17 08:55, Mark Kirkwood wrote: > On 02/06/17 17:11, Erik Rijkers wrote: > >> On 2017-06-02 00:46, Mark Kirkwood wrote: >>> On 31/05/17 21:16, Petr Jelinek wrote: >>> >>> I'm seeing a new failure with the patch applied - this time the >>> history table has missing rows. Petr, I'll put

Re: [HACKERS] logical replication - still unstable after all these months

2017-06-02 Thread Mark Kirkwood
On 02/06/17 17:11, Erik Rijkers wrote: On 2017-06-02 00:46, Mark Kirkwood wrote: On 31/05/17 21:16, Petr Jelinek wrote: I'm seeing a new failure with the patch applied - this time the history table has missing rows. Petr, I'll put back your access :-) Is this error during 1-minute runs?

Re: [HACKERS] logical replication - still unstable after all these months

2017-06-01 Thread Erik Rijkers
On 2017-06-02 00:46, Mark Kirkwood wrote: On 31/05/17 21:16, Petr Jelinek wrote: I'm seeing a new failure with the patch applied - this time the history table has missing rows. Petr, I'll put back your access :-) Is this error during 1-minute runs? I'm asking because I've moved back to

Re: [HACKERS] logical replication - still unstable after all these months

2017-06-01 Thread Mark Kirkwood
On 31/05/17 21:16, Petr Jelinek wrote: On 29/05/17 23:06, Mark Kirkwood wrote: On 29/05/17 23:14, Petr Jelinek wrote: On 29/05/17 03:33, Jeff Janes wrote: What would you want to look at? Would saving the WAL from the master be helpful? Useful info is, logs from provider (mainly the

Re: [HACKERS] logical replication - still unstable after all these months

2017-05-31 Thread Peter Eisentraut
On 5/31/17 05:16, Petr Jelinek wrote: > I've been running tests on this overnight on another machine where I was > able to reproduce the original issue within few runs (once I found what > causes it) and so far looks good. I'll give people another day or so to test this before committing. --

Re: [HACKERS] logical replication - still unstable after all these months

2017-05-31 Thread Erik Rijkers
On 2017-05-31 11:16, Petr Jelinek wrote: [...] Thanks to Mark's offer I was able to study the issue as it happened and found the cause of this. [0001-Improve-handover-logic-between-sync-and-apply-worker.patch] This looks good: -- out_20170531_1141.txt 100 -- pgbench -c 90 -j 8 -T 60 -P

Re: [HACKERS] logical replication - still unstable after all these months

2017-05-31 Thread Petr Jelinek
On 29/05/17 23:06, Mark Kirkwood wrote: > On 29/05/17 23:14, Petr Jelinek wrote: > >> On 29/05/17 03:33, Jeff Janes wrote: >> >>> What would you want to look at? Would saving the WAL from the master be >>> helpful? >>> >> Useful info is, logs from provider (mainly the logical decoding logs >>

Re: [HACKERS] logical replication - still unstable after all these months

2017-05-31 Thread Erik Rijkers
On 2017-05-26 08:10, Erik Rijkers wrote: If you run a pgbench session of 1 minute over a logical replication connection and repeat that 100x this is what you get: At clients 90, 64, 8, scale 25: -- out_20170525_0944.txt 100 -- pgbench -c 90 -j 8 -T 60 -P 12 -n -- scale 25 7 -- Not

Re: [HACKERS] logical replication - still unstable after all these months

2017-05-29 Thread Mark Kirkwood
On 29/05/17 23:14, Petr Jelinek wrote: On 29/05/17 03:33, Jeff Janes wrote: What would you want to look at? Would saving the WAL from the master be helpful? Useful info is, logs from provider (mainly the logical decoding logs that mention LSNs), logs from subscriber (the lines about when

Re: [HACKERS] logical replication - still unstable after all these months

2017-05-29 Thread Petr Jelinek
On 29/05/17 03:33, Jeff Janes wrote: > On Sun, May 28, 2017 at 3:17 PM, Mark Kirkwood > > > wrote: > > The framework ran 600 tests last night, and I see 3 'NOK' results, > i.e 3 failed test runs (all scale 25 and 8

Re: [HACKERS] logical replication - still unstable after all these months

2017-05-28 Thread Mark Kirkwood
On 29/05/17 13:33, Jeff Janes wrote: On Sun, May 28, 2017 at 3:17 PM, Mark Kirkwood > wrote: On 28/05/17 19:01, Mark Kirkwood wrote: So running in cloud land now...so for no errors - will update. The

Re: [HACKERS] logical replication - still unstable after all these months

2017-05-28 Thread Mark Kirkwood
On 29/05/17 16:26, Erik Rijkers wrote: On 2017-05-29 00:17, Mark Kirkwood wrote: On 28/05/17 19:01, Mark Kirkwood wrote: So running in cloud land now...so for no errors - will update. The framework ran 600 tests last night, and I see 3 'NOK' results, i.e 3 failed test runs (all scale 25

Re: [HACKERS] logical replication - still unstable after all these months

2017-05-28 Thread Erik Rijkers
On 2017-05-29 03:33, Jeff Janes wrote: On Sun, May 28, 2017 at 3:17 PM, Mark Kirkwood < mark.kirkw...@catalyst.net.nz> wrote: I also got a failure, after 87 iterations of a similar test case. It [...] repeated the runs, but so far it hasn't failed again in over 800 iterations Could you

Re: [HACKERS] logical replication - still unstable after all these months

2017-05-28 Thread Erik Rijkers
On 2017-05-29 00:17, Mark Kirkwood wrote: On 28/05/17 19:01, Mark Kirkwood wrote: So running in cloud land now...so for no errors - will update. The framework ran 600 tests last night, and I see 3 'NOK' results, i.e 3 failed test runs (all scale 25 and 8 pgbench clients). Given the way

Re: [HACKERS] logical replication - still unstable after all these months

2017-05-28 Thread Jeff Janes
On Sun, May 28, 2017 at 3:17 PM, Mark Kirkwood < mark.kirkw...@catalyst.net.nz> wrote: > On 28/05/17 19:01, Mark Kirkwood wrote: > > >> So running in cloud land now...so for no errors - will update. >> >> >> >> > The framework ran 600 tests last night, and I see 3 'NOK' results, i.e 3 > failed

Re: [HACKERS] logical replication - still unstable after all these months

2017-05-28 Thread Mark Kirkwood
On 28/05/17 19:01, Mark Kirkwood wrote: So running in cloud land now...so for no errors - will update. The framework ran 600 tests last night, and I see 3 'NOK' results, i.e 3 failed test runs (all scale 25 and 8 pgbench clients). Given the way the test decides on failure (gets tired of

Re: [HACKERS] logical replication - still unstable after all these months

2017-05-28 Thread Erik Rijkers
On 2017-05-26 15:59, Petr Jelinek wrote: Hmm, I was under the impression that the changes we proposed in the snapbuild thread fixed your issues, does this mean they didn't? Or the modified versions of those that were eventually committed didn't? Or did issues reappear at some point? Here is

Re: [HACKERS] logical replication - still unstable after all these months

2017-05-28 Thread Mark Kirkwood
On 27/05/17 20:30, Erik Rijkers wrote: Here is what I have: instances.sh: starts up 2 assert enabled sessions instances_fast.sh: alternative to instances.sh starts up 2 assert disabled 'fast' sessions testset.sh loop to call pgbench_derail2.sh with varying params

Re: [HACKERS] logical replication - still unstable after all these months

2017-05-27 Thread Erik Rijkers
On 2017-05-28 01:15, Mark Kirkwood wrote: Also, any idea which rows are different? If you want something out of the box that will do that for you see DBIx::Compare. I used to save the content-diffs too but in the end decided they were useless (to me, anyway). -- Sent via pgsql-hackers

Re: [HACKERS] logical replication - still unstable after all these months

2017-05-27 Thread Erik Rijkers
On 2017-05-28 01:21, Mark Kirkwood wrote: Sorry - I see you have done this already. On 28/05/17 11:15, Mark Kirkwood wrote: Interesting - might be good to see your test script too (so we can better understand how you are deciding if the runs are successful or not). Yes, in

Re: [HACKERS] logical replication - still unstable after all these months

2017-05-27 Thread Mark Kirkwood
Sorry - I see you have done this already. On 28/05/17 11:15, Mark Kirkwood wrote: Interesting - might be good to see your test script too (so we can better understand how you are deciding if the runs are successful or not). -- Sent via pgsql-hackers mailing list

Re: [HACKERS] logical replication - still unstable after all these months

2017-05-27 Thread Mark Kirkwood
Interesting - might be good to see your test script too (so we can better understand how you are deciding if the runs are successful or not). Also, any idea which rows are different? If you want something out of the box that will do that for you see DBIx::Compare. regards Mark On

Re: [HACKERS] logical replication - still unstable after all these months

2017-05-27 Thread Erik Rijkers
On 2017-05-27 17:11, Andres Freund wrote: On May 27, 2017 6:13:19 AM EDT, Simon Riggs wrote: On 27 May 2017 at 09:44, Erik Rijkers wrote: I am very curious at your results. We take your bug report on good faith, but we still haven't seen details of

Re: [HACKERS] logical replication - still unstable after all these months

2017-05-27 Thread Andres Freund
On May 27, 2017 6:13:19 AM EDT, Simon Riggs wrote: >On 27 May 2017 at 09:44, Erik Rijkers wrote: > >> I am very curious at your results. > >We take your bug report on good faith, but we still haven't seen >details of the problem or how to recreate it. >

Re: [HACKERS] logical replication - still unstable after all these months

2017-05-27 Thread Simon Riggs
On 27 May 2017 at 09:44, Erik Rijkers wrote: > I am very curious at your results. We take your bug report on good faith, but we still haven't seen details of the problem or how to recreate it. Please post some details. Thanks. -- Simon Riggs

Re: [HACKERS] logical replication - still unstable after all these months

2017-05-27 Thread Erik Rijkers
On 2017-05-27 10:30, Erik Rijkers wrote: On 2017-05-27 01:35, Mark Kirkwood wrote: Here is what I have: instances.sh: testset.sh pgbench_derail2.sh pubsub.sh To be clear: ( Apart from that standalone call like ./pgbench_derail2.sh $scale $clients $duration $date_str ) I normally run

Re: [HACKERS] logical replication - still unstable after all these months

2017-05-27 Thread Erik Rijkers
On 2017-05-27 01:35, Mark Kirkwood wrote: On 26/05/17 20:09, Erik Rijkers wrote: The idea is simple enough: startup instance1 startup instance2 (on same machine) primary: init pgbench tables primary: add primary key to pgbench_history copy empty tables to replica by dump/restore primary:

Re: [HACKERS] logical replication - still unstable after all these months

2017-05-26 Thread Erik Rijkers
On 2017-05-27 01:35, Mark Kirkwood wrote: On 26/05/17 20:09, Erik Rijkers wrote: this whole thing 100x Some questions that might help me get it right: - do you think we need to stop and start the instances every time? - do we need to init pgbench each time? - could we just drop the

Re: [HACKERS] logical replication - still unstable after all these months

2017-05-26 Thread Mark Kirkwood
On 26/05/17 20:09, Erik Rijkers wrote: The idea is simple enough: startup instance1 startup instance2 (on same machine) primary: init pgbench tables primary: add primary key to pgbench_history copy empty tables to replica by dump/restore primary: start publication replica: start subscription

Re: [HACKERS] logical replication - still unstable after all these months

2017-05-26 Thread Jeff Janes
On Fri, May 26, 2017 at 12:27 AM, Erik Rijkers wrote: > On 2017-05-26 08:58, Simon Riggs wrote: > >> On 26 May 2017 at 07:10, Erik Rijkers wrote: >> >> - Do you agree this number of failures is far too high? >>> - Am I the only one finding so many failures? >>>

Re: [HACKERS] logical replication - still unstable after all these months

2017-05-26 Thread Jeff Janes
On Fri, May 26, 2017 at 5:17 AM, tushar wrote: > > run second time = > ./pgbench -T 20 -c 90 -j 90 -f test.sql postgres > > check the row count on master/standby > Master= > postgres=# select count(*) from pgbench_history ; > count > > 536836 > (1

Re: [HACKERS] logical replication - still unstable after all these months

2017-05-26 Thread Petr Jelinek
On 26/05/17 16:51, Alvaro Herrera wrote: > Erik Rijkers wrote: > >> I wouldn't say that problems (re)appeared at a certain point; my impression >> is rather that logical replication has become better and better. But I kept >> getting the odd failure, without a clear cause, but always

Re: [HACKERS] logical replication - still unstable after all these months

2017-05-26 Thread Alvaro Herrera
Erik Rijkers wrote: > I wouldn't say that problems (re)appeared at a certain point; my impression > is rather that logical replication has become better and better. But I kept > getting the odd failure, without a clear cause, but always (eventually) > repeatable on other machines. I did the

Re: [HACKERS] logical replication - still unstable after all these months

2017-05-26 Thread Erik Rijkers
On 2017-05-26 15:59, Petr Jelinek wrote: Hi, Hmm, I was under the impression that the changes we proposed in the snapbuild thread fixed your issues, does this mean they didn't? Or the modified versions of those that were eventually committed didn't? Or did issues reappear at some point? I

Re: [HACKERS] logical replication - still unstable after all these months

2017-05-26 Thread Petr Jelinek
Hi, Hmm, I was under the impression that the changes we proposed in the snapbuild thread fixed your issues, does this mean they didn't? Or the modified versions of those that were eventually committed didn't? Or did issues reappear at some point? -- Petr Jelinek

Re: [HACKERS] logical replication - still unstable after all these months

2017-05-26 Thread tushar
On 05/26/2017 12:57 PM, Erik Rijkers wrote: The failure is that in the result state the replicated tables differ from the original tables. I am also getting similar behavior Master= run pgbench with scaling factor =1 (./pg_bench -i -s 1 postgres ) delete rows from pgbench_history ( delete

Re: [HACKERS] logical replication - still unstable after all these months

2017-05-26 Thread Erik Rijkers
On 2017-05-26 10:29, Mark Kirkwood wrote: On 26/05/17 20:09, Erik Rijkers wrote: On 2017-05-26 09:40, Simon Riggs wrote: If we can find out what the bug is with a repeatable test case we can fix it. Could you provide more details? Thanks I will, just need some time to clean things up a

Re: [HACKERS] logical replication - still unstable after all these months

2017-05-26 Thread Mark Kirkwood
On 26/05/17 20:09, Erik Rijkers wrote: On 2017-05-26 09:40, Simon Riggs wrote: If we can find out what the bug is with a repeatable test case we can fix it. Could you provide more details? Thanks I will, just need some time to clean things up a bit. But what I would like is for someone

Re: [HACKERS] logical replication - still unstable after all these months

2017-05-26 Thread Erik Rijkers
On 2017-05-26 09:40, Simon Riggs wrote: If we can find out what the bug is with a repeatable test case we can fix it. Could you provide more details? Thanks I will, just need some time to clean things up a bit. But what I would like is for someone else to repeat my 100x1-minute tests,

Re: [HACKERS] logical replication - still unstable after all these months

2017-05-26 Thread Simon Riggs
On 26 May 2017 at 08:27, Erik Rijkers wrote: > On 2017-05-26 08:58, Simon Riggs wrote: >> >> On 26 May 2017 at 07:10, Erik Rijkers wrote: >> >>> - Do you agree this number of failures is far too high? >>> - Am I the only one finding so many failures? >> >> >> What

Re: [HACKERS] logical replication - still unstable after all these months

2017-05-26 Thread Erik Rijkers
On 2017-05-26 08:58, Simon Riggs wrote: On 26 May 2017 at 07:10, Erik Rijkers wrote: - Do you agree this number of failures is far too high? - Am I the only one finding so many failures? What type of failure are you getting? The failure is that in the result state the

Re: [HACKERS] logical replication - still unstable after all these months

2017-05-26 Thread Simon Riggs
On 26 May 2017 at 07:10, Erik Rijkers wrote: > - Do you agree this number of failures is far too high? > - Am I the only one finding so many failures? What type of failure are you getting? -- Simon Riggshttp://www.2ndQuadrant.com/ PostgreSQL Development, 24x7

[HACKERS] logical replication - still unstable after all these months

2017-05-26 Thread Erik Rijkers
If you run a pgbench session of 1 minute over a logical replication connection and repeat that 100x this is what you get: At clients 90, 64, 8, scale 25: -- out_20170525_0944.txt 100 -- pgbench -c 90 -j 8 -T 60 -P 12 -n -- scale 25 93 -- All is well. 7 -- Not good. --