Re: [HACKERS] Fast promotion failure

2013-05-21 Thread Heikki Linnakangas
On 21.05.2013 00:00, Simon Riggs wrote: When we set the new timeline we should be updating all values that might be used elsewhere. If we do that, then no matter when or how we run GetXLogReplayRecPtr, it can't ever get it wrong in any backend. --- a/src/backend/access/transam/xlog.c +++

Re: [HACKERS] Fast promotion failure

2013-05-21 Thread Simon Riggs
On 21 May 2013 07:46, Heikki Linnakangas hlinnakan...@vmware.com wrote: On 21.05.2013 00:00, Simon Riggs wrote: When we set the new timeline we should be updating all values that might be used elsewhere. If we do that, then no matter when or how we run GetXLogReplayRecPtr, it can't ever get

Re: [HACKERS] Fast promotion failure

2013-05-21 Thread Simon Riggs
On 21 May 2013 09:26, Simon Riggs si...@2ndquadrant.com wrote: I'm OK with that principle... Well, after fighting some more with that, I've gone with the, er, principle of slightly less ugliness. -- Simon Riggs http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7

Re: [HACKERS] Fast promotion failure

2013-05-20 Thread Heikki Linnakangas
On 19.05.2013 17:25, Simon Riggs wrote: However, there is a call to RecoveryInProgress() at the top of the main loop of the checkpointer, which does explicitly state that it initializes TimeLineID if it's not set yet. The checkpointer makes the decision about whether to run a restartpoint or a

Re: [HACKERS] Fast promotion failure

2013-05-20 Thread Simon Riggs
On 20 May 2013 18:47, Heikki Linnakangas hlinnakan...@vmware.com wrote: On 19.05.2013 17:25, Simon Riggs wrote: So while I believe that the checkpointer might have an incorrect TLI and that you've seen a bug, what isn't clear is that the checkpointer is the only process that would see an

Re: [HACKERS] Fast promotion failure

2013-05-20 Thread Heikki Linnakangas
On 20.05.2013 22:18, Simon Riggs wrote: On 20 May 2013 18:47, Heikki Linnakangashlinnakan...@vmware.com wrote: Not sure what the best fix would be. Perhaps change the code in CreateRestartPoint() to do something like this instead: GetXLogReplayRecPtr(replayTLI); if (RecoveryInProgress())

Re: [HACKERS] Fast promotion failure

2013-05-20 Thread Simon Riggs
On 20 May 2013 20:40, Heikki Linnakangas hlinnakan...@vmware.com wrote: On 20.05.2013 22:18, Simon Riggs wrote: On 20 May 2013 18:47, Heikki Linnakangashlinnakan...@vmware.com wrote: Not sure what the best fix would be. Perhaps change the code in CreateRestartPoint() to do something like

Re: [HACKERS] Fast promotion failure

2013-05-19 Thread Simon Riggs
On 7 May 2013 10:57, Heikki Linnakangas hlinnakan...@vmware.com wrote: While testing the bug from the Assertion failure at standby promotion, I bumped into a different bug in fast promotion. When the first checkpoint after fast promotion is performed, there is no guarantee that the

Re: [HACKERS] Fast promotion failure

2013-05-16 Thread Kyotaro HORIGUCHI
Hello, Is the point of this discussion that the patch may leave out some glich about timing of timeline-related changing and Heikki saw an egress of that? AFAIU, the committed patch has some gap in overall scenario which is the fast promotion issue. Right, the fast

Re: [HACKERS] Fast promotion failure

2013-05-16 Thread Simon Riggs
On 16 May 2013 07:02, Kyotaro HORIGUCHI horiguchi.kyot...@lab.ntt.co.jp wrote: fast promotion issue. Excuse me for not joining the thread earlier. I'm not available today, but will join in later in my evening. -- Simon Riggs http://www.2ndQuadrant.com/ PostgreSQL

Re: [HACKERS] Fast promotion failure

2013-05-16 Thread Amit Kapila
On Thursday, May 16, 2013 11:33 AM Kyotaro HORIGUCHI wrote: Hello, Is the point of this discussion that the patch may leave out some glich about timing of timeline-related changing and Heikki saw an egress of that? AFAIU, the committed patch has some gap in overall

Re: [HACKERS] Fast promotion failure

2013-05-13 Thread Heikki Linnakangas
On 13.05.2013 06:07, Amit Kapila wrote: On Monday, May 13, 2013 5:54 AM Kyotaro HORIGUCHI wrote: Heikki said in the fist message in this thread that he suspected the cause of the failure he had seen to be wrong TLI on whitch checkpointer runs. Nevertheless, the patch you suggested for me looks

Re: [HACKERS] Fast promotion failure

2013-05-13 Thread Amit Kapila
On Monday, May 13, 2013 1:13 PM Heikki Linnakangas wrote: On 13.05.2013 06:07, Amit Kapila wrote: On Monday, May 13, 2013 5:54 AM Kyotaro HORIGUCHI wrote: Heikki said in the fist message in this thread that he suspected the cause of the failure he had seen to be wrong TLI on whitch

Re: [HACKERS] Fast promotion failure

2013-05-12 Thread Kyotaro HORIGUCHI
2013/05/10 20:01 Amit Kapila amit.kap...@huawei.com: C 2013-05-10 15:32:32.170 JST 9242 FATAL: could not receive data from WAL stream: Is there any chance, that there is any network glitch caused this one time error. Unix domam sockets are hardly likely to have such troubles. This test

Re: [HACKERS] Fast promotion failure

2013-05-12 Thread Amit Kapila
On Monday, May 13, 2013 5:54 AM Kyotaro HORIGUCHI wrote: 2013/05/10 20:01 Amit Kapila amit.kap...@huawei.com: C 2013-05-10 15:32:32.170 JST 9242 FATAL: could not receive data from WAL stream: Is there any chance, that there is any network glitch caused this one time error. Unix

Re: [HACKERS] Fast promotion failure

2013-05-10 Thread Kyotaro HORIGUCHI
Thank you for noticing me of that. It seems to me, it is the same problem as discussed and fixed in below thread. http://www.postgresql.org/message-id/51894942.4080...@vmware.com Could you try with fixes given by heikki. The first one settles the timeline transition problem for the

Re: [HACKERS] Fast promotion failure

2013-05-10 Thread Amit Kapila
On Friday, May 10, 2013 2:07 PM Kyotaro HORIGUCHI wrote: Thank you for noticing me of that. It seems to me, it is the same problem as discussed and fixed in below thread. http://www.postgresql.org/message-id/51894942.4080...@vmware.com Could you try with fixes given by heikki. The

Re: [HACKERS] Fast promotion failure

2013-05-09 Thread Kyotaro HORIGUCHI
Hello, I think it can so happen that last checkpoint is with old timeline and there are operations with new timeline which might have caused the problem Heikki has seen. I suppose to have seen that. After adding an SQL command to modify the DB on standby-B after passive(propagated?)

Re: [HACKERS] Fast promotion failure

2013-05-09 Thread Kyotaro HORIGUCHI
With printing some additinal logs, the situation should be more clear.. It seems that Sby-B failes to promote to TLI= 2; nevertheless the history file for TLI = 2 is somehow sent to sby-C. So sby-B remains on TLI=1 but sby-C solely switches onto TLI=2. # Come to think of this, I suspect that

Re: [HACKERS] Fast promotion failure

2013-05-09 Thread Amit Kapila
On Thursday, May 09, 2013 2:14 PM Kyotaro HORIGUCHI wrote: With printing some additinal logs, the situation should be more clear.. It seems that Sby-B failes to promote to TLI= 2; nevertheless the history file for TLI = 2 is somehow sent to sby-C. So sby-B remains on TLI=1 but sby-C solely

Re: [HACKERS] Fast promotion failure

2013-05-08 Thread Fujii Masao
On Tue, May 7, 2013 at 6:57 PM, Heikki Linnakangas hlinnakan...@vmware.com wrote: While testing the bug from the Assertion failure at standby promotion, I bumped into a different bug in fast promotion. When the first checkpoint after fast promotion is performed, there is no guarantee that the

Re: [HACKERS] Fast promotion failure

2013-05-08 Thread Amit Kapila
On Thursday, May 09, 2013 6:29 AM Fujii Masao wrote: On Tue, May 7, 2013 at 6:57 PM, Heikki Linnakangas hlinnakan...@vmware.com wrote: While testing the bug from the Assertion failure at standby promotion, I bumped into a different bug in fast promotion. When the first checkpoint after

[HACKERS] Fast promotion failure

2013-05-07 Thread Heikki Linnakangas
While testing the bug from the Assertion failure at standby promotion, I bumped into a different bug in fast promotion. When the first checkpoint after fast promotion is performed, there is no guarantee that the checkpointer process is running with the correct, new, ThisTimeLineID. In