Re: [HACKERS] pg_rewind test race condition..?

2016-03-26 Thread Michael Paquier
On Wed, Apr 29, 2015 at 10:03 PM, Stephen Frost  wrote:
> Another thought would be to provide an option to pg_rewind to have it do
> an explicit checkpoint before it reads the control file..  I'm not
> against having it simply always do it as I don't see pg_rewind being a
> commonly run thing, but I know some environments have very heavy
> checkpoints and that might not be ideal.

Waking an old thread. It seems to me that such an option would still
be useful when rewinding using a stream so as an inconsistent control
file is not read. The default should definitely be to not issue this
checkpoint, but in some environments I believe that this could really
matter, say two failovers followed by two rewinds done consequently.
Thoughts?
-- 
Michael


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] pg_rewind test race condition..?

2015-12-08 Thread Oleksii Kliukin
Hi,

On Wed, Apr 29, 2015, at 01:36 AM, Heikki Linnakangas wrote:
> The problem seems to be that when the standby is promoted, it's a 
> so-called "fast promotion", where it writes an end-of-recovery record 
> and starts accepting queries before creating a real checkpoint. 
> pg_rewind looks at the TLI in the latest checkpoint, as it's in the 
> control file, but that isn't updated until the checkpoint completes. I 
> don't see it on my laptop normally, but I can reproduce it if I insert a 
> "sleep(5)" in StartupXLog, just before it requests the checkpoint:
> 
> --- a/src/backend/access/transam/xlog.c
> +++ b/src/backend/access/transam/xlog.c
> @@ -7173,7 +7173,10 @@ StartupXLOG(void)
>* than is appropriate now that we're not in standby mode anymore.
>*/
>   if (fast_promoted)
> +   {
> +   sleep(5);
>   RequestCheckpoint(CHECKPOINT_FORCE);
> +   }
>   }
> 
> The simplest fix would be to force a checkpoint in the regression test, 
> before running pg_rewind. It's a bit of a cop out, since you'd still get 
> the same issue when you tried to do the same thing in the real world. It 
> should be rare in practice - you'd not normally run pg_rewind 
> immediately after promoting the standby - but a better error message at 
> least would be nice..

I think we came across this issue in production. We run a Python daemon
called 'Patroni' (https://github.com/zalando/patroni) in order to
automate failovers. It stops the current master before promotion of one
of the replicas, and runs pg_rewind on the former master to make it
"compatible" with the new one. Naturally, since pg_rewind is launched
automatically, it is called right after detecting that the new master is
running.

What we get sometimes is  "source and target cluster are on the same
timeline". Per the explanation from Heikki, it happens when pg_rewind
connects to the new master after promotion but before the first
checkpoint with the new timeline. Since "fast promotion" is the only
option for pg_ctl promote, I think it might be beneficial to add an
option for pg_rewind to issue a checkpoint  (I worked around this
problem by adding a checkpoint call to Patroni, another way would be to
create the 'fallback_promote' flle and do the 'slow' promote without
relying on pg_ctl promote functionality).

There is another weird case I cannot explain easily. If I do a single
promotion/rewind, while executing an insert that takes quite some time
(insert into test select id from generate_series(1, 1000) id) and
actually interrupting it by the Patroni promotion (which does a
checkpoint and pg_ctl -mf stop on the former master, and pg_ctl promote
on the replica candidate), I'm getting the following (postgresql0 is the
rewound node, note that the latest checkpoint location there is lower
than the prior checkpoint one):

$ pg_controldata data/postgresql0
pg_control version number:942
Catalog version number:   201511071
Database system identifier:   6225948429980199864
Database cluster state:   in archive recovery
pg_control last modified: Tue Dec  8 17:20:16 2015
Latest checkpoint location:   0/628
Prior checkpoint location:0/6000138
Latest checkpoint's REDO location:0/628
Latest checkpoint's REDO WAL file:00010006
Latest checkpoint's TimeLineID:   1
Latest checkpoint's PrevTimeLineID:   1
Latest checkpoint's full_page_writes: on
Latest checkpoint's NextXID:  0/944
Latest checkpoint's NextOID:  16390
Latest checkpoint's NextMultiXactId:  1
Latest checkpoint's NextMultiOffset:  0
Latest checkpoint's oldestXID:931
Latest checkpoint's oldestXID's DB:   1
Latest checkpoint's oldestActiveXID:  0
Latest checkpoint's oldestMultiXid:   1
Latest checkpoint's oldestMulti's DB: 1
Latest checkpoint's oldestCommitTs:   0
Latest checkpoint's newestCommitTs:   0
Time of latest checkpoint:Tue Dec  8 17:20:04 2015
Fake LSN counter for unlogged rels:   0/1
Minimum recovery ending location: 0/6020580
Min recovery ending loc's timeline:   2
Backup start location:0/0
Backup end location:  0/0
End-of-backup record required:no
wal_level setting:hot_standby
wal_log_hints setting:on
max_connections setting:  100
max_worker_processes setting: 8
max_prepared_xacts setting:   0
max_locks_per_xact setting:   64
track_commit_timestamp setting:   off
Maximum data alignment:   8
Database block size:  8192
Blocks per segment of large relation: 131072
WAL block size:   8192
Bytes per WAL segment:16777216
Maximum length of identifiers:64
Maximum columns in an index:  32
Maximum size of a TOAST chunk:1996
Size of a large-object chunk: 2048
Date/time type storage:   64-bit integers
Float

Re: [HACKERS] pg_rewind test race condition..?

2015-05-01 Thread Stephen Frost
Heikki,

* Heikki Linnakangas (hlinn...@iki.fi) wrote:
> On 04/29/2015 06:03 AM, Stephen Frost wrote:
> >Forcing a checkpoint in the regression tests and then providing a better
> >error message sounds reasonable to me.  I agree that it's very unlikely
> >to happen in the real world, even when you're bouncing between systems
> >for upgrades, etc, you're unlikely to do it fast enough for this issue
> >to exhibit itself, and a better error message would help any users who
> >manage to run into this (perhaps during their own testing).
> 
> I've committed this simple fix for now.

Looks good, I'm no longer seeing the failures.

Thanks!

Stephen


signature.asc
Description: Digital signature


Re: [HACKERS] pg_rewind test race condition..?

2015-04-30 Thread Heikki Linnakangas

On 04/29/2015 06:03 AM, Stephen Frost wrote:

* Heikki Linnakangas (hlinn...@iki.fi) wrote:

--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -7173,7 +7173,10 @@ StartupXLOG(void)
 * than is appropriate now that we're not in standby mode anymore.
 */
if (fast_promoted)
+   {
+   sleep(5);
RequestCheckpoint(CHECKPOINT_FORCE);
+   }
  }

The simplest fix would be to force a checkpoint in the regression
test, before running pg_rewind. It's a bit of a cop out, since you'd
still get the same issue when you tried to do the same thing in the
real world. It should be rare in practice - you'd not normally run
pg_rewind immediately after promoting the standby - but a better
error message at least would be nice..


Forcing a checkpoint in the regression tests and then providing a better
error message sounds reasonable to me.  I agree that it's very unlikely
to happen in the real world, even when you're bouncing between systems
for upgrades, etc, you're unlikely to do it fast enough for this issue
to exhibit itself, and a better error message would help any users who
manage to run into this (perhaps during their own testing).


I've committed this simple fix for now.
- Heikki



--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] pg_rewind test race condition..?

2015-04-29 Thread Stephen Frost
Heikki,

* Heikki Linnakangas (hlinn...@iki.fi) wrote:
> The problem seems to be that when the standby is promoted, it's a
> so-called "fast promotion", where it writes an end-of-recovery
> record and starts accepting queries before creating a real
> checkpoint. pg_rewind looks at the TLI in the latest checkpoint, as
> it's in the control file, but that isn't updated until the
> checkpoint completes. I don't see it on my laptop normally, but I
> can reproduce it if I insert a "sleep(5)" in StartupXLog, just
> before it requests the checkpoint:

Ah, interesting.

> --- a/src/backend/access/transam/xlog.c
> +++ b/src/backend/access/transam/xlog.c
> @@ -7173,7 +7173,10 @@ StartupXLOG(void)
>* than is appropriate now that we're not in standby mode anymore.
>*/
>   if (fast_promoted)
> + {
> + sleep(5);
>   RequestCheckpoint(CHECKPOINT_FORCE);
> + }
>  }
> 
> The simplest fix would be to force a checkpoint in the regression
> test, before running pg_rewind. It's a bit of a cop out, since you'd
> still get the same issue when you tried to do the same thing in the
> real world. It should be rare in practice - you'd not normally run
> pg_rewind immediately after promoting the standby - but a better
> error message at least would be nice..

Forcing a checkpoint in the regression tests and then providing a better
error message sounds reasonable to me.  I agree that it's very unlikely
to happen in the real world, even when you're bouncing between systems
for upgrades, etc, you're unlikely to do it fast enough for this issue
to exhibit itself, and a better error message would help any users who
manage to run into this (perhaps during their own testing).

Another thought would be to provide an option to pg_rewind to have it do
an explicit checkpoint before it reads the control file..  I'm not
against having it simply always do it as I don't see pg_rewind being a
commonly run thing, but I know some environments have very heavy
checkpoints and that might not be ideal.

Thanks!

Stephen


signature.asc
Description: Digital signature


Re: [HACKERS] pg_rewind test race condition..?

2015-04-28 Thread Heikki Linnakangas

On 04/28/2015 11:02 AM, Stephen Frost wrote:

Heikki,

   Not sure if anyone else is seeing this, but I'm getting regression
   test failures when running the pg_rewind tests pretty consistently
   with 'make check'.  Specifically with "basic remote", I'm getting:

source and target cluster are on the same timeline
Failure, exiting

   in regress_log/pg_rewind_log_basic_remote.

   If I throw a "sleep(5);" into t/001_basic.pl before the call to
   RewindTest::run_pg_rewind($test_mode); then everything works fine.


The problem seems to be that when the standby is promoted, it's a 
so-called "fast promotion", where it writes an end-of-recovery record 
and starts accepting queries before creating a real checkpoint. 
pg_rewind looks at the TLI in the latest checkpoint, as it's in the 
control file, but that isn't updated until the checkpoint completes. I 
don't see it on my laptop normally, but I can reproduce it if I insert a 
"sleep(5)" in StartupXLog, just before it requests the checkpoint:


--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -7173,7 +7173,10 @@ StartupXLOG(void)
 * than is appropriate now that we're not in standby mode anymore.
 */
if (fast_promoted)
+   {
+   sleep(5);
RequestCheckpoint(CHECKPOINT_FORCE);
+   }
 }

The simplest fix would be to force a checkpoint in the regression test, 
before running pg_rewind. It's a bit of a cop out, since you'd still get 
the same issue when you tried to do the same thing in the real world. It 
should be rare in practice - you'd not normally run pg_rewind 
immediately after promoting the standby - but a better error message at 
least would be nice..

- Heikki



--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] pg_rewind test race condition..?

2015-04-28 Thread Stephen Frost
Heikki,

  Not sure if anyone else is seeing this, but I'm getting regression
  test failures when running the pg_rewind tests pretty consistently
  with 'make check'.  Specifically with "basic remote", I'm getting:

source and target cluster are on the same timeline
Failure, exiting

  in regress_log/pg_rewind_log_basic_remote.

  If I throw a "sleep(5);" into t/001_basic.pl before the call to
  RewindTest::run_pg_rewind($test_mode); then everything works fine.

  ./configure --silent --prefix="$INSTALL" --with-openssl --with-tcl
  --with-tclconfig=/usr/lib/tcl8.6 --with-perl --enable-debug
  --enable-cassert --enable-tap-tests --with-gssapi

  Full unsuccessful pg_rewind_log_basic_remote output:
-
waiting for server to startLOG:  database system was shut down at 
2015-04-28 13:46:34 EDT
LOG:  database system is ready to accept connections
 done
server started
waiting for server to startLOG:  database system was interrupted; last 
known up at 2015-04-28 13:46:35 EDT
LOG:  entering standby mode
LOG:  redo starts at 0/228
LOG:  consistent recovery state reached at 0/2F8
LOG:  database system is ready to accept read only connections
LOG:  started streaming WAL from primary at 0/300 on timeline 1
 done
server started
server promoting
LOG:  received promote request
FATAL:  terminating walreceiver process due to administrator command
LOG:  invalid record length at 0/30EB410
LOG:  redo done at 0/30EB3A0
LOG:  last completed transaction was at log time 2015-04-28 13:46:37.540952-04
LOG:  selected new timeline ID: 2
LOG:  archive recovery complete
LOG:  database system is ready to accept connections
waiting for server to shut downLOG:  received fast shutdown request
LOG:  aborting any active transactions
LOG:  shutting down
LOG:  database system is shut down
 done
server stopped

source and target cluster are on the same timeline
Failure, exiting
waiting for server to startLOG:  database system was shut down at 
2015-04-28 13:46:39 EDT
LOG:  entering standby mode
LOG:  consistent recovery state reached at 0/3311438
LOG:  invalid record length at 0/3311438
LOG:  database system is ready to accept read only connections
LOG:  fetching timeline history file for timeline 2 from primary server
LOG:  started streaming WAL from primary at 0/300 on timeline 1
LOG:  replication terminated by primary server
DETAIL:  End of WAL reached on timeline 1 at 0/30EB410.
LOG:  new timeline 2 forked off current database system timeline 1 before 
current recovery point 0/3311438
LOG:  restarted WAL streaming at 0/300 on timeline 1
LOG:  replication terminated by primary server
DETAIL:  End of WAL reached on timeline 1 at 0/30EB410.
LOG:  new timeline 2 forked off current database system timeline 1 before 
current recovery point 0/3311438
 done
server started
LOG:  received immediate shutdown request
LOG:  received immediate shutdown request
-

  Full successful pg_rewind_log_basic_remote output:

-
waiting for server to startLOG:  database system was shut down at 
2015-04-28 13:54:04 EDT
LOG:  database system is ready to accept connections
 done
server started
waiting for server to startLOG:  database system was interrupted; last 
known up at 2015-04-28 13:54:05 EDT
LOG:  entering standby mode
LOG:  redo starts at 0/228
LOG:  consistent recovery state reached at 0/2F8
LOG:  database system is ready to accept read only connections
LOG:  started streaming WAL from primary at 0/300 on timeline 1
 done
server started
server promoting
LOG:  received promote request
FATAL:  terminating walreceiver process due to administrator command
LOG:  invalid record length at 0/30EB410
LOG:  redo done at 0/30EB3A0
LOG:  last completed transaction was at log time 2015-04-28 13:54:07.547533-04
LOG:  selected new timeline ID: 2
LOG:  archive recovery complete
LOG:  database system is ready to accept connections
waiting for server to shut downLOG:  received fast shutdown request
LOG:  aborting any active transactions
LOG:  shutting down
LOG:  database system is shut down
 done
server stopped
The servers diverged at WAL position 0/30EB410 on timeline 1.
Rewinding from last common checkpoint at 0/30EB3A0 on timeline 1
Done!
waiting for server to startLOG:  database system was interrupted while in 
recovery at log time 2015-04-28 13:54:08 EDT
HINT:  If this has occurred more than once some data might be corrupted and you 
might need to choose an earlier recovery target.
LOG:  entering standby mode
LOG:  redo starts at 0/30EB368
LOG:  consistent recovery state reached at 0/3113788
LOG:  invalid record length at 0/3113788
LOG:  database system is ready to accept read only connections
LOG:  started streaming WAL from primary at 0/300 on timeline 2
 done
server started
LOG:  received immediate shutdown reques