Re: PANIC during crash recovery of a recently promoted standby

2018-07-05 Thread Michael Paquier
On Thu, Jul 05, 2018 at 01:03:14PM +0530, Pavan Deolasee wrote: > Many thanks Michael for doing the gruelling of coming up with a more > complete fix, verifying all the cases, in various back branches. No problem. I hope I got the credits right. If there is anything wrong please feel free to

Re: PANIC during crash recovery of a recently promoted standby

2018-07-05 Thread Pavan Deolasee
On Thu, Jul 5, 2018 at 7:20 AM, Michael Paquier wrote: > On Mon, Jul 02, 2018 at 10:41:05PM +0900, Michael Paquier wrote: > > I am planning to finish wrapping this patch luckily on Wednesday JST > > time, or in the worst case on Thursday. I got this problem on my mind > > for a couple of days

Re: PANIC during crash recovery of a recently promoted standby

2018-07-04 Thread Michael Paquier
On Mon, Jul 02, 2018 at 10:41:05PM +0900, Michael Paquier wrote: > I am planning to finish wrapping this patch luckily on Wednesday JST > time, or in the worst case on Thursday. I got this problem on my mind > for a couple of days now and I could not find a case where the approach > taken could

Re: PANIC during crash recovery of a recently promoted standby

2018-07-02 Thread Michael Paquier
On Mon, Jul 02, 2018 at 04:25:13PM +0900, Kyotaro HORIGUCHI wrote: > When minRecoveryPoint is invalid, there're only two possible > cases. It may be at very beginning of archive reovery or may be > running a crash recovery. In the latter case, we have detected > crash recovery before redo starts.

Re: PANIC during crash recovery of a recently promoted standby

2018-06-27 Thread Michael Paquier
Adding Heikki and Andres in CC here for awareness.. On Wed, Jun 27, 2018 at 05:29:38PM +0900, Michael Paquier wrote: > I have spent a bit of time testing this on HEAD, 10 and 9.6. For 9.5, > 9.4 and 9.3 I have reproduced the failure and tested the patch, but I > lacked time to perform more

Re: PANIC during crash recovery of a recently promoted standby

2018-06-27 Thread Michael Paquier
On Fri, Jun 22, 2018 at 03:25:48PM +0900, Michael Paquier wrote: > On Fri, Jun 22, 2018 at 02:34:02PM +0900, Kyotaro HORIGUCHI wrote: >> Hello, sorry for the absense and I looked the second patch. > > Thanks for the review! I have been spending some time testing and torturing the patch for all

Re: PANIC during crash recovery of a recently promoted standby

2018-06-22 Thread Michael Paquier
On Fri, Jun 22, 2018 at 02:34:02PM +0900, Kyotaro HORIGUCHI wrote: > Hello, sorry for the absense and I looked the second patch. Thanks for the review! > At Fri, 22 Jun 2018 13:45:21 +0900, Michael Paquier > wrote in <20180622044521.gc5...@paquier.xyz> >> long as crash recovery runs. And

Re: PANIC during crash recovery of a recently promoted standby

2018-06-21 Thread Kyotaro HORIGUCHI
Hello, sorry for the absense and I looked the second patch. At Fri, 22 Jun 2018 13:45:21 +0900, Michael Paquier wrote in <20180622044521.gc5...@paquier.xyz> > On Fri, Jun 22, 2018 at 10:08:24AM +0530, Pavan Deolasee wrote: > > On Fri, Jun 22, 2018 at 9:28 AM, Michael Paquier > > wrote: > >> So

Re: PANIC during crash recovery of a recently promoted standby

2018-06-21 Thread Michael Paquier
On Fri, Jun 22, 2018 at 10:08:24AM +0530, Pavan Deolasee wrote: > On Fri, Jun 22, 2018 at 9:28 AM, Michael Paquier > wrote: >> So an extra pair of eyes from another committer would be >> welcome. I am letting that cool down for a couple of days now. > > I am not a committer, so don't know if my

Re: PANIC during crash recovery of a recently promoted standby

2018-06-21 Thread Pavan Deolasee
On Fri, Jun 22, 2018 at 9:28 AM, Michael Paquier wrote: > > > This is not really a complicated patch, and it took a lot of energy from > me the last couple of days per the nature of the many scenarios to think > about... Thanks for the efforts. It wasn't an easy bug to chase to begin with. So

Re: PANIC during crash recovery of a recently promoted standby

2018-06-21 Thread Michael Paquier
On Thu, Jun 07, 2018 at 07:58:29PM +0900, Kyotaro HORIGUCHI wrote: > (I believe that) By definition recovery doesn't end until the > end-of-recovery check point ends so from the viewpoint I think it > is wrong to clear ControlFile->minRecoveryPoint before the end. > > Invalid-page checking during

Re: PANIC during crash recovery of a recently promoted standby

2018-06-20 Thread Michael Paquier
On Thu, Jun 07, 2018 at 07:58:29PM +0900, Kyotaro HORIGUCHI wrote: > Invalid-page checking during crash recovery is hamful rather than > useless. It is done by CheckRecoveryConsistency even in crash > recovery against its expectation because there's a case where > minRecoveryPoint is valid but

Re: PANIC during crash recovery of a recently promoted standby

2018-05-24 Thread Michael Paquier
On Mon, May 14, 2018 at 01:14:22PM +0530, Pavan Deolasee wrote: > Looks like I didn't understand Alvaro's comment when he mentioned it to me > off-list. But I now see what Michael and Alvaro mean and that indeed seems > like a problem. I was thinking that the test for (ControlFile->state == >

Re: PANIC during crash recovery of a recently promoted standby

2018-05-14 Thread Pavan Deolasee
On Fri, May 11, 2018 at 8:39 PM, Alvaro Herrera wrote: > Michael Paquier wrote: > > On Thu, May 10, 2018 at 10:52:12AM +0530, Pavan Deolasee wrote: > > > I propose that we should always clear the minRecoveryPoint after > promotion > > > to ensure that crash recovery

Re: PANIC during crash recovery of a recently promoted standby

2018-05-13 Thread Michael Paquier
On Sat, May 12, 2018 at 07:41:33AM +0900, Michael Paquier wrote: > pg_ctl promote would wait for the control file to be updated, so you > cannot use it in the TAP tests to trigger the promotion. Still I think > I found one after waking up? Please note I have not tested it: > - Use a custom

Re: PANIC during crash recovery of a recently promoted standby

2018-05-11 Thread Michael Paquier
On Fri, May 11, 2018 at 12:09:58PM -0300, Alvaro Herrera wrote: > Yeah, I had this exact comment, but I was unable to come up with a test > case that would cause a problem. pg_ctl promote would wait for the control file to be updated, so you cannot use it in the TAP tests to trigger the

Re: PANIC during crash recovery of a recently promoted standby

2018-05-11 Thread Alvaro Herrera
Michael Paquier wrote: > On Thu, May 10, 2018 at 10:52:12AM +0530, Pavan Deolasee wrote: > > I propose that we should always clear the minRecoveryPoint after promotion > > to ensure that crash recovery always run to the end if a just-promoted > > standby crashes before completing its first regular

Re: PANIC during crash recovery of a recently promoted standby

2018-05-10 Thread Michael Paquier
On Thu, May 10, 2018 at 10:52:12AM +0530, Pavan Deolasee wrote: > I propose that we should always clear the minRecoveryPoint after promotion > to ensure that crash recovery always run to the end if a just-promoted > standby crashes before completing its first regular checkpoint. A WIP patch > is

PANIC during crash recovery of a recently promoted standby

2018-05-09 Thread Pavan Deolasee
Hello, I recently investigated a problem where a standby is promoted to be the new master. The promoted standby crashes shortly thereafter for whatever reason. Upon running the crash recovery, the promoted standby (now master) PANICs with message such as: PANIC,XX000,"WAL contains references to