Re: requested timeline ... does not contain minimum recovery point ...

2018-07-13 Thread Christophe Pettus


> On Jul 12, 2018, at 19:54, Andres Freund  wrote:
> Do you see a "checkpoint complete: wrote ..." message
> before the rewind started?

Checking, but I suspect that's exactly the problem.

This raises a question: Would it make sense for pg_rewind to either force a 
checkpoint or have a --checkpoint option along the lines of pg_basebackup?  
This scenario (pg_rewind being run very quickly after secondary promotion) is 
not uncommon when there's scripting around the switch-over process.

--
-- Christophe Pettus
   x...@thebuild.com




Re: requested timeline ... does not contain minimum recovery point ...

2018-07-12 Thread Andres Freund
On 2018-07-12 19:22:50 -0700, Christophe Pettus wrote:
> 
> > On Jul 12, 2018, at 17:52, Michael Paquier  wrote:
> > Wild guess: you did not issue a checkpoint on the promoted standby
> > before running pg_rewind.
> 
> I don't believe a manual checkpoint was done on the target (promoted standby, 
> new master), but it did one as usual during startup after the timeline switch:
> 
> > 2018-07-10 19:28:38 UTC [5068]: [1-1] user=,db=,app=,client= LOG:  
> > checkpoint starting: force
> 
> 
> The pg_rewind was started about 90 seconds later.

Note that that message doesn't indicate a completed checkpoint, just
that one started. Do you see a "checkpoint complete: wrote ..." message
before the rewind started?

Greetings,

Andres Freund



Re: requested timeline ... does not contain minimum recovery point ...

2018-07-12 Thread Christophe Pettus


> On Jul 12, 2018, at 19:22, Christophe Pettus  wrote:
> 
> 
>> On Jul 12, 2018, at 17:52, Michael Paquier  wrote:
>> Wild guess: you did not issue a checkpoint on the promoted standby
>> before running pg_rewind.
> 
> I don't believe a manual checkpoint was done on the target (promoted standby, 
> new master), but it did one as usual during startup after the timeline switch:
> 
>> 2018-07-10 19:28:38 UTC [5068]: [1-1] user=,db=,app=,client= LOG:  
>> checkpoint starting: force
> 
> The pg_rewind was started about 90 seconds later.

That being said, the pg_rewind output seems to indicate that the old divergence 
point was still being picked up, rather than the one on timeline 104:

> servers diverged at WAL position A58/5000 on timeline 103
> rewinding from last common checkpoint at A58/4E0689F0 on timeline 103

--
-- Christophe Pettus
   x...@thebuild.com




Re: requested timeline ... does not contain minimum recovery point ...

2018-07-12 Thread Christophe Pettus


> On Jul 12, 2018, at 17:52, Michael Paquier  wrote:
> Wild guess: you did not issue a checkpoint on the promoted standby
> before running pg_rewind.

I don't believe a manual checkpoint was done on the target (promoted standby, 
new master), but it did one as usual during startup after the timeline switch:

> 2018-07-10 19:28:38 UTC [5068]: [1-1] user=,db=,app=,client= LOG:  checkpoint 
> starting: force


The pg_rewind was started about 90 seconds later.

--
-- Christophe Pettus
   x...@thebuild.com




Re: requested timeline ... does not contain minimum recovery point ...

2018-07-12 Thread Michael Paquier
On Thu, Jul 12, 2018 at 02:26:17PM -0700, Christophe Pettus wrote:
> What surprises me about the error is that while the recovery point
> seems reasonable, it shouldn't be on timeline 103, but on timeline
> 105.

Wild guess: you did not issue a checkpoint on the promoted standby
before running pg_rewind.
--
Michael


signature.asc
Description: PGP signature


Re: requested timeline ... does not contain minimum recovery point ...

2018-07-12 Thread Andres Freund
Hi,

On 2018-07-12 10:20:06 -0700, Christophe Pettus wrote:
> PostgreSQL 9.6.9, Windows Server 2012 Datacenter (64-bit).
> 
> We're trying to diagnose the error:
> 
>   requested timeline 105 does not contain minimum recovery point 
> A58/6B109F28 on timeline 103
> 
> The error occurs when a WAL-shipping (not streaming) secondary starts up.
> 
> These two machines have been part of a stress-test where, repeatedly, the 
> secondary is promoted, the old primary is rewound using pg_rewind, and then 
> attached to the new primary.  This has worked for multiple iterations, but 
> this error popped up.  The last cycle was particularly fast: the new primary 
> was only up for about 10 seconds (although it had completed recovery) before 
> being shut down again, and pg_rewind applied to it to reconnect it with the 
> promoted secondary.

This needs a lot more information before somebody can reasonably act on
it.

Greetings,

Andres Freund