Re: requested timeline ... does not contain minimum recovery point ...
> On Jul 12, 2018, at 19:54, Andres Freund wrote: > Do you see a "checkpoint complete: wrote ..." message > before the rewind started? Checking, but I suspect that's exactly the problem. This raises a question: Would it make sense for pg_rewind to either force a checkpoint or have a --checkpoint option along the lines of pg_basebackup? This scenario (pg_rewind being run very quickly after secondary promotion) is not uncommon when there's scripting around the switch-over process. -- -- Christophe Pettus x...@thebuild.com
Re: requested timeline ... does not contain minimum recovery point ...
On 2018-07-12 19:22:50 -0700, Christophe Pettus wrote: > > > On Jul 12, 2018, at 17:52, Michael Paquier wrote: > > Wild guess: you did not issue a checkpoint on the promoted standby > > before running pg_rewind. > > I don't believe a manual checkpoint was done on the target (promoted standby, > new master), but it did one as usual during startup after the timeline switch: > > > 2018-07-10 19:28:38 UTC [5068]: [1-1] user=,db=,app=,client= LOG: > > checkpoint starting: force > > > The pg_rewind was started about 90 seconds later. Note that that message doesn't indicate a completed checkpoint, just that one started. Do you see a "checkpoint complete: wrote ..." message before the rewind started? Greetings, Andres Freund
Re: requested timeline ... does not contain minimum recovery point ...
> On Jul 12, 2018, at 19:22, Christophe Pettus wrote: > > >> On Jul 12, 2018, at 17:52, Michael Paquier wrote: >> Wild guess: you did not issue a checkpoint on the promoted standby >> before running pg_rewind. > > I don't believe a manual checkpoint was done on the target (promoted standby, > new master), but it did one as usual during startup after the timeline switch: > >> 2018-07-10 19:28:38 UTC [5068]: [1-1] user=,db=,app=,client= LOG: >> checkpoint starting: force > > The pg_rewind was started about 90 seconds later. That being said, the pg_rewind output seems to indicate that the old divergence point was still being picked up, rather than the one on timeline 104: > servers diverged at WAL position A58/5000 on timeline 103 > rewinding from last common checkpoint at A58/4E0689F0 on timeline 103 -- -- Christophe Pettus x...@thebuild.com
Re: requested timeline ... does not contain minimum recovery point ...
> On Jul 12, 2018, at 17:52, Michael Paquier wrote: > Wild guess: you did not issue a checkpoint on the promoted standby > before running pg_rewind. I don't believe a manual checkpoint was done on the target (promoted standby, new master), but it did one as usual during startup after the timeline switch: > 2018-07-10 19:28:38 UTC [5068]: [1-1] user=,db=,app=,client= LOG: checkpoint > starting: force The pg_rewind was started about 90 seconds later. -- -- Christophe Pettus x...@thebuild.com
Re: requested timeline ... does not contain minimum recovery point ...
On Thu, Jul 12, 2018 at 02:26:17PM -0700, Christophe Pettus wrote: > What surprises me about the error is that while the recovery point > seems reasonable, it shouldn't be on timeline 103, but on timeline > 105. Wild guess: you did not issue a checkpoint on the promoted standby before running pg_rewind. -- Michael signature.asc Description: PGP signature
Re: requested timeline ... does not contain minimum recovery point ...
Hi, On 2018-07-12 10:20:06 -0700, Christophe Pettus wrote: > PostgreSQL 9.6.9, Windows Server 2012 Datacenter (64-bit). > > We're trying to diagnose the error: > > requested timeline 105 does not contain minimum recovery point > A58/6B109F28 on timeline 103 > > The error occurs when a WAL-shipping (not streaming) secondary starts up. > > These two machines have been part of a stress-test where, repeatedly, the > secondary is promoted, the old primary is rewound using pg_rewind, and then > attached to the new primary. This has worked for multiple iterations, but > this error popped up. The last cycle was particularly fast: the new primary > was only up for about 10 seconds (although it had completed recovery) before > being shut down again, and pg_rewind applied to it to reconnect it with the > promoted secondary. This needs a lot more information before somebody can reasonably act on it. Greetings, Andres Freund