Re: [HACKERS] Switching timeline over streaming replication
On 23.12.2012 16:37, Fujii Masao wrote: On Fri, Dec 21, 2012 at 1:48 AM, Fujii Masao wrote: On Sat, Dec 15, 2012 at 9:36 AM, Fujii Masao wrote: I found another "requested timeline does not contain minimum recovery point" error scenario in HEAD: 1. Set up the master 'M', one standby 'S1', and one cascade standby 'S2'. 2. Shutdown the master 'M' and promote the standby 'S1', and wait for 'S2' to reconnect to 'S1'. 3. Set up new cascade standby 'S3' connecting to 'S2'. Then 'S3' fails to start the recovery because of the following error: FATAL: requested timeline 2 does not contain minimum recovery point 0/300 on timeline 1 LOG: startup process (PID 33104) exited with exit code 1 LOG: aborting startup due to startup process failure The result of pg_controldata of 'S3' is: Latest checkpoint location: 0/388 Prior checkpoint location:0/260 Latest checkpoint's REDO location:0/388 Latest checkpoint's REDO WAL file:00020003 Latest checkpoint's TimeLineID: 2 Min recovery ending location: 0/300 Min recovery ending loc's timeline: 1 Backup start location:0/0 Backup end location: 0/0 The content of the timeline history file '0002.history' is: 1 0/388 no recovery target specified I still could reproduce this problem. Attached is the shell script which reproduces the problem. This problem happens when new standby starts up from the backup taken from another standby and its recovery starts from the shutdown checkpoint record which causes timeline switch. In this case, the timeline of minimum recovery point can be different from that of latest checkpoint (i.e., shutdown checkpoint). But the following check in StartupXLOG() assumes that they are always the same wrongly. So the problem happens. /* * The min recovery point should be part of the requested timeline's * history, too. */ if (!XLogRecPtrIsInvalid(ControlFile->minRecoveryPoint)&& tliOfPointInHistory(ControlFile->minRecoveryPoint - 1, expectedTLEs) != ControlFile->minRecoveryPointTLI) ereport(FATAL, (errmsg("requested timeline %u does not contain minimum recovery point %X/%X on timeline %u", recoveryTargetTLI, (uint32) (ControlFile->minRecoveryPoint>> 32), (uint32) ControlFile->minRecoveryPoint, ControlFile->minRecoveryPointTLI))); No, it doesn't assume that min recovery point is on the same timeline as the checkpoint record. This is another variant of the "timeline history files are not included in the backup" problem discussed on the other thread with subject "pg_basebackup from cascading standby after timeline switch". If you remove the min recovery point check above, the test case still fails, with a different error message: LOG: unexpected timeline ID 1 in log segment 00020003, offset 0 If you modify the test script to copy the 0002.history file to the data-standby3/pg_xlog after running pg_basebackup, the test case works. (we still need to fix it, of course) - Heikki -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Switching timeline over streaming replication
On Fri, Dec 21, 2012 at 1:48 AM, Fujii Masao wrote: > On Sat, Dec 15, 2012 at 9:36 AM, Fujii Masao wrote: >> On Sat, Dec 8, 2012 at 12:51 AM, Heikki Linnakangas >> wrote: >>> On 06.12.2012 15:39, Amit Kapila wrote: On Thursday, December 06, 2012 12:53 AM Heikki Linnakangas wrote: > > On 05.12.2012 14:32, Amit Kapila wrote: >> >> On Tuesday, December 04, 2012 10:01 PM Heikki Linnakangas wrote: >>> >>> After some diversions to fix bugs and refactor existing code, I've >>> committed a couple of small parts of this patch, which just add some >>> sanity checks to notice incorrect PITR scenarios. Here's a new >>> version of the main patch based on current HEAD. >> >> >> After testing with the new patch, the following problems are observed. >> >> Defect - 1: >> >> 1. start primary A >> 2. start standby B following A >> 3. start cascade standby C following B. >> 4. start another standby D following C. >> 5. Promote standby B. >> 6. After successful time line switch in cascade standby C& D, > > stop D. >> >> 7. Restart D, Startup is successful and connecting to standby C. >> 8. Stop C. >> 9. Restart C, startup is failing. > > > Ok, the error I get in that scenario is: > > C 2012-12-05 19:55:43.840 EET 9283 FATAL: requested timeline 2 does not > contain minimum recovery point 0/3023F08 on timeline 1 C 2012-12-05 > 19:55:43.841 EET 9282 LOG: startup process (PID 9283) exited with exit > code 1 C 2012-12-05 19:55:43.841 EET 9282 LOG: aborting startup due to > startup process failure > > > That mismatch causes the error. I'd like to fix this by always treating > the checkpoint record to be part of the new timeline. That feels more > correct. The most straightforward way to implement that would be to peek > at the xlog record before updating replayEndRecPtr and replayEndTLI. If > it's a checkpoint record that changes TLI, set replayEndTLI to the new > timeline before calling the redo-function. But it's a bit of a > modularity violation to peek into the record like that. > > Or we could just revert the sanity check at beginning of recovery that > throws the "requested timeline 2 does not contain minimum recovery point > 0/3023F08 on timeline 1" error. The error I added to redo of checkpoint > record that says "unexpected timeline ID %u in checkpoint record, before > reaching minimum recovery point %X/%X on timeline %u" checks basically > the same thing, but at a later stage. However, the way > minRecoveryPointTLI is updated still seems wrong to me, so I'd like to > fix that. > > I'm thinking of something like the attached (with some more comments > before committing). Thoughts? This has fixed the problem reported. However, I am not able to think will there be any problem if we remove check "requested timeline 2 does not contain minimum recovery point > > 0/3023F08 on timeline 1" at beginning of recovery and just update replayEndTLI with ThisTimeLineID? >>> >>> >>> Well, it seems wrong for the control file to contain a situation like this: >>> >>> pg_control version number:932 >>> Catalog version number: 201211281 >>> Database system identifier: 5819228770976387006 >>> Database cluster state: shut down in recovery >>> pg_control last modified: pe 7. joulukuuta 2012 17.39.57 >>> Latest checkpoint location: 0/3023EA8 >>> Prior checkpoint location:0/260 >>> Latest checkpoint's REDO location:0/3023EA8 >>> Latest checkpoint's REDO WAL file:00020003 >>> Latest checkpoint's TimeLineID: 2 >>> ... >>> Time of latest checkpoint:pe 7. joulukuuta 2012 17.39.49 >>> Min recovery ending location: 0/3023F08 >>> Min recovery ending loc's timeline: 1 >>> >>> Note the latest checkpoint location and its TimelineID, and compare them >>> with the min recovery ending location. The min recovery ending location is >>> ahead of latest checkpoint's location; the min recovery ending location >>> actually points to the end of the checkpoint record. But how come the min >>> recovery ending location's timeline is 1, while the checkpoint record's >>> timeline is 2. >>> >>> Now maybe that would happen to work if remove the sanity check, but it still >>> seems horribly confusing. I'm afraid that discrepancy will come back to >>> haunt us later if we leave it like that. So I'd like to fix that. >>> >>> Mulling over this for some more, I propose the attached patch. With the >>> patch, we peek into the checkpoint record, and actually perform the timeline >>> switch (by changing ThisTimeLineID) before replaying it. That way the >>> checkpoint record is really considered to
Re: [HACKERS] Switching timeline over streaming replication
On 21 December 2012 18:13, Heikki Linnakangas wrote: > On 21.12.2012 01:50, Thom Brown wrote: > >> Now I'm getting this on all standbys after promoting the first standby in >> a >> chain. >> > > ... > > > TRAP: FailedAssertion("!(((sentPtr)<**= (SendRqstPtr)))", File: > > "walsender.c", Line: 1425) > > Sigh. I'm sounding like a broken record, but I just committed another fix > for this, should work now. Thanks Heikki. Just quickly retested with a new set of 120 standbys and all looks fine as far as the logs are concerned: LOG: replication terminated by primary server DETAIL: End of WAL reached on timeline 1 LOG: record with zero length at 0/37902A0 LOG: fetching timeline history file for timeline 2 from primary server LOG: restarted WAL streaming at 0/300 on timeline 1 LOG: replication terminated by primary server DETAIL: End of WAL reached on timeline 1 LOG: new target timeline is 2 LOG: restarted WAL streaming at 0/300 on timeline 2 LOG: replication terminated by primary server DETAIL: End of WAL reached on timeline 2 LOG: record with zero length at 0/643E248 LOG: fetching timeline history file for timeline 3 from primary server LOG: restarted WAL streaming at 0/600 on timeline 2 LOG: replication terminated by primary server DETAIL: End of WAL reached on timeline 2 LOG: new target timeline is 3 LOG: restarted WAL streaming at 0/600 on timeline 3 LOG: replication terminated by primary server DETAIL: End of WAL reached on timeline 3 LOG: record with zero length at 0/6BB13A8 LOG: fetching timeline history file for timeline 4 from primary server LOG: restarted WAL streaming at 0/600 on timeline 3 LOG: replication terminated by primary server DETAIL: End of WAL reached on timeline 3 LOG: new target timeline is 4 LOG: restarted WAL streaming at 0/600 on timeline 4 -- Thom
Re: [HACKERS] Switching timeline over streaming replication
On 21.12.2012 01:50, Thom Brown wrote: Now I'm getting this on all standbys after promoting the first standby in a chain. > ... > TRAP: FailedAssertion("!(((sentPtr)<= (SendRqstPtr)))", File: > "walsender.c", Line: 1425) Sigh. I'm sounding like a broken record, but I just committed another fix for this, should work now. - Heikki -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Switching timeline over streaming replication
On 20 December 2012 12:45, Heikki Linnakangas wrote: > On 17.12.2012 15:05, Thom Brown wrote: > >> I just set up 120 chained standbys, and for some reason I'm seeing these >> errors: >> >> LOG: replication terminated by primary server >> DETAIL: End of WAL reached on timeline 1 >> LOG: record with zero length at 0/301EC10 >> LOG: fetching timeline history file for timeline 2 from primary server >> LOG: restarted WAL streaming at 0/300 on timeline 1 >> LOG: replication terminated by primary server >> DETAIL: End of WAL reached on timeline 1 >> LOG: new target timeline is 2 >> LOG: restarted WAL streaming at 0/300 on timeline 2 >> LOG: replication terminated by primary server >> DETAIL: End of WAL reached on timeline 2 >> FATAL: error reading result of streaming command: ERROR: requested WAL >> segment 00020003 has already been removed >> >> ERROR: requested WAL segment 00020003 has already been >> removed >> LOG: started streaming WAL from primary at 0/300 on timeline 2 >> ERROR: requested WAL segment 00020003 has already been >> removed >> > > I just committed a patch that should make the "requested WAL segment > 00020003 has already been removed" errors go away. The > trick was for walsenders to not switch to the new timeline until at least > one record has been replayed on it. That closes the window where the > walsender already considers the new timeline to be the latest, but the WAL > file has not been created yet. > Now I'm getting this on all standbys after promoting the first standby in a chain. LOG: replication terminated by primary server DETAIL: End of WAL reached on timeline 1 LOG: record with zero length at 0/301EC10 LOG: fetching timeline history file for timeline 2 from primary server LOG: restarted WAL streaming at 0/300 on timeline 1 FATAL: could not receive data from WAL stream: LOG: new target timeline is 2 FATAL: could not connect to the primary server: FATAL: the database system is in recovery mode LOG: started streaming WAL from primary at 0/300 on timeline 2 TRAP: FailedAssertion("!(((sentPtr) <= (SendRqstPtr)))", File: "walsender.c", Line: 1425) LOG: server process (PID 19917) was terminated by signal 6: Aborted LOG: terminating any other active server processes LOG: all server processes terminated; reinitializing LOG: database system was interrupted while in recovery at log time 2012-12-20 23:41:23 GMT HINT: If this has occurred more than once some data might be corrupted and you might need to choose an earlier recovery target. LOG: entering standby mode FATAL: the database system is in recovery mode LOG: redo starts at 0/228 LOG: consistent recovery state reached at 0/2E8 LOG: database system is ready to accept read only connections LOG: record with zero length at 0/301EC70 LOG: started streaming WAL from primary at 0/300 on timeline 2 LOG: unexpected EOF on standby connection And if I restart the new primary, the first new standby connected to it shows: LOG: replication terminated by primary server DETAIL: End of WAL reached on timeline 2 FATAL: error reading result of streaming command: server closed the connection unexpectedly This probably means the server terminated abnormally before or while processing the request. LOG: record with zero length at 0/301F1E0 However, all other standbys don't show any additional log output. -- Thom
Re: [HACKERS] Switching timeline over streaming replication
> I just committed a patch that should make the "requested WAL segment > 00020003 has already been removed" errors go away. > The > trick was for walsenders to not switch to the new timeline until at > least one record has been replayed on it. That closes the window > where > the walsender already considers the new timeline to be the latest, > but > the WAL file has not been created yet. OK, I'll download the snapshot in a couple days and make sure this didn't breaks something else. --Josh -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Switching timeline over streaming replication
On Sat, Dec 15, 2012 at 9:36 AM, Fujii Masao wrote: > On Sat, Dec 8, 2012 at 12:51 AM, Heikki Linnakangas > wrote: >> On 06.12.2012 15:39, Amit Kapila wrote: >>> >>> On Thursday, December 06, 2012 12:53 AM Heikki Linnakangas wrote: On 05.12.2012 14:32, Amit Kapila wrote: > > On Tuesday, December 04, 2012 10:01 PM Heikki Linnakangas wrote: >> >> After some diversions to fix bugs and refactor existing code, I've >> committed a couple of small parts of this patch, which just add some >> sanity checks to notice incorrect PITR scenarios. Here's a new >> version of the main patch based on current HEAD. > > > After testing with the new patch, the following problems are observed. > > Defect - 1: > > 1. start primary A > 2. start standby B following A > 3. start cascade standby C following B. > 4. start another standby D following C. > 5. Promote standby B. > 6. After successful time line switch in cascade standby C& D, stop D. > > 7. Restart D, Startup is successful and connecting to standby C. > 8. Stop C. > 9. Restart C, startup is failing. Ok, the error I get in that scenario is: C 2012-12-05 19:55:43.840 EET 9283 FATAL: requested timeline 2 does not contain minimum recovery point 0/3023F08 on timeline 1 C 2012-12-05 19:55:43.841 EET 9282 LOG: startup process (PID 9283) exited with exit code 1 C 2012-12-05 19:55:43.841 EET 9282 LOG: aborting startup due to startup process failure >>> That mismatch causes the error. I'd like to fix this by always treating the checkpoint record to be part of the new timeline. That feels more correct. The most straightforward way to implement that would be to peek at the xlog record before updating replayEndRecPtr and replayEndTLI. If it's a checkpoint record that changes TLI, set replayEndTLI to the new timeline before calling the redo-function. But it's a bit of a modularity violation to peek into the record like that. Or we could just revert the sanity check at beginning of recovery that throws the "requested timeline 2 does not contain minimum recovery point 0/3023F08 on timeline 1" error. The error I added to redo of checkpoint record that says "unexpected timeline ID %u in checkpoint record, before reaching minimum recovery point %X/%X on timeline %u" checks basically the same thing, but at a later stage. However, the way minRecoveryPointTLI is updated still seems wrong to me, so I'd like to fix that. I'm thinking of something like the attached (with some more comments before committing). Thoughts? >>> >>> >>> This has fixed the problem reported. >>> However, I am not able to think will there be any problem if we remove >>> check >>> "requested timeline 2 does not contain minimum recovery point 0/3023F08 on timeline 1" at beginning of recovery and just update >>> >>> replayEndTLI with ThisTimeLineID? >> >> >> Well, it seems wrong for the control file to contain a situation like this: >> >> pg_control version number:932 >> Catalog version number: 201211281 >> Database system identifier: 5819228770976387006 >> Database cluster state: shut down in recovery >> pg_control last modified: pe 7. joulukuuta 2012 17.39.57 >> Latest checkpoint location: 0/3023EA8 >> Prior checkpoint location:0/260 >> Latest checkpoint's REDO location:0/3023EA8 >> Latest checkpoint's REDO WAL file:00020003 >> Latest checkpoint's TimeLineID: 2 >> ... >> Time of latest checkpoint:pe 7. joulukuuta 2012 17.39.49 >> Min recovery ending location: 0/3023F08 >> Min recovery ending loc's timeline: 1 >> >> Note the latest checkpoint location and its TimelineID, and compare them >> with the min recovery ending location. The min recovery ending location is >> ahead of latest checkpoint's location; the min recovery ending location >> actually points to the end of the checkpoint record. But how come the min >> recovery ending location's timeline is 1, while the checkpoint record's >> timeline is 2. >> >> Now maybe that would happen to work if remove the sanity check, but it still >> seems horribly confusing. I'm afraid that discrepancy will come back to >> haunt us later if we leave it like that. So I'd like to fix that. >> >> Mulling over this for some more, I propose the attached patch. With the >> patch, we peek into the checkpoint record, and actually perform the timeline >> switch (by changing ThisTimeLineID) before replaying it. That way the >> checkpoint record is really considered to be on the new timeline for all >> purposes. At the moment, the only difference that makes in practice is that >> we set replayEndTLI, and thus minRecoveryPointTLI
Re: [HACKERS] Switching timeline over streaming replication
On 2012-12-20 14:45:05 +0200, Heikki Linnakangas wrote: > On 17.12.2012 15:05, Thom Brown wrote: > >I just set up 120 chained standbys, and for some reason I'm seeing these > >errors: > > > >LOG: replication terminated by primary server > >DETAIL: End of WAL reached on timeline 1 > >LOG: record with zero length at 0/301EC10 > >LOG: fetching timeline history file for timeline 2 from primary server > >LOG: restarted WAL streaming at 0/300 on timeline 1 > >LOG: replication terminated by primary server > >DETAIL: End of WAL reached on timeline 1 > >LOG: new target timeline is 2 > >LOG: restarted WAL streaming at 0/300 on timeline 2 > >LOG: replication terminated by primary server > >DETAIL: End of WAL reached on timeline 2 > >FATAL: error reading result of streaming command: ERROR: requested WAL > >segment 00020003 has already been removed > > > >ERROR: requested WAL segment 00020003 has already been > >removed > >LOG: started streaming WAL from primary at 0/300 on timeline 2 > >ERROR: requested WAL segment 00020003 has already been > >removed > > I just committed a patch that should make the "requested WAL segment > 00020003 has already been removed" errors go away. The trick > was for walsenders to not switch to the new timeline until at least one > record has been replayed on it. That closes the window where the walsender > already considers the new timeline to be the latest, but the WAL file has > not been created yet. I vote for introducing InvalidTimeLineID soon... 0 as a invalid TimeLineID seems to spread and is annoying to grep for. Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Switching timeline over streaming replication
On 17.12.2012 15:05, Thom Brown wrote: I just set up 120 chained standbys, and for some reason I'm seeing these errors: LOG: replication terminated by primary server DETAIL: End of WAL reached on timeline 1 LOG: record with zero length at 0/301EC10 LOG: fetching timeline history file for timeline 2 from primary server LOG: restarted WAL streaming at 0/300 on timeline 1 LOG: replication terminated by primary server DETAIL: End of WAL reached on timeline 1 LOG: new target timeline is 2 LOG: restarted WAL streaming at 0/300 on timeline 2 LOG: replication terminated by primary server DETAIL: End of WAL reached on timeline 2 FATAL: error reading result of streaming command: ERROR: requested WAL segment 00020003 has already been removed ERROR: requested WAL segment 00020003 has already been removed LOG: started streaming WAL from primary at 0/300 on timeline 2 ERROR: requested WAL segment 00020003 has already been removed I just committed a patch that should make the "requested WAL segment 00020003 has already been removed" errors go away. The trick was for walsenders to not switch to the new timeline until at least one record has been replayed on it. That closes the window where the walsender already considers the new timeline to be the latest, but the WAL file has not been created yet. - Heikki -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Switching timeline over streaming replication
Heikki, The next time I get the issue, and I'm not paying for 5 cloud servers by the hour, I'll give you a login. --Josh - Original Message - > On 19.12.2012 17:27, Heikki Linnakangas wrote: > > On 19.12.2012 15:55, Heikki Linnakangas wrote: > >> On 19.12.2012 04:57, Josh Berkus wrote: > >>> Heikki, > >>> > >>> I ran into an unexpected issue while testing. I just wanted to > >>> fire up > >>> a chain of 5 replicas to see if I could connect them in a loop. > >>> However, I ran into a weird issue when starting up "r3": it > >>> refused to > >>> come out of "the database is starting up" mode until I did a > >>> write on > >>> the master. Then it came up fine. > >>> > >>> master-->r1-->r2-->r3-->r4 > >>> > >>> I tried doing the full replication sequence (basebackup, startup, > >>> test) > >>> with it twice and got the exact same results each time. > >>> > >>> This is very strange because I did not encounter the same issues > >>> with r2 > >>> or r4. Nor have I seen this before in my tests. > >> > >> Ok.. I'm going to need some more details on how to reproduce this, > >> I'm > >> not seeing that when I set up four standbys. > > > > Ok, I managed to reproduce this now. > > Hmph, no I didn't, I replied to wrong email. The problem I managed to > reproduce was the one where you get "requested WAL > segment 00020003 has already been removed" errors, > reported by Thom. > > - Heikki > -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Switching timeline over streaming replication
Heikki, > The problem goes away after some time, after the 1st standby has > streamed the contents of 00020003 and written it to > disk, and the cascaded standby reconnects. But it would be nice to > avoid > that situation. I'm not sure how to do that yet, we might need to > track > the timeline we're currently receiving/sending more carefully. Or > perhaps we need to copy the previous WAL segment to the new name when > switching recovery target timeline, like we do when a server is > promoted. I'll try to come up with something... Would it be accurate to say that this issue only happens when all of the replicated servers have no traffic? --Josh -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Switching timeline over streaming replication
On 19.12.2012 17:27, Heikki Linnakangas wrote: On 19.12.2012 15:55, Heikki Linnakangas wrote: On 19.12.2012 04:57, Josh Berkus wrote: Heikki, I ran into an unexpected issue while testing. I just wanted to fire up a chain of 5 replicas to see if I could connect them in a loop. However, I ran into a weird issue when starting up "r3": it refused to come out of "the database is starting up" mode until I did a write on the master. Then it came up fine. master-->r1-->r2-->r3-->r4 I tried doing the full replication sequence (basebackup, startup, test) with it twice and got the exact same results each time. This is very strange because I did not encounter the same issues with r2 or r4. Nor have I seen this before in my tests. Ok.. I'm going to need some more details on how to reproduce this, I'm not seeing that when I set up four standbys. Ok, I managed to reproduce this now. Hmph, no I didn't, I replied to wrong email. The problem I managed to reproduce was the one where you get "requested WAL segment 00020003 has already been removed" errors, reported by Thom. - Heikki -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Switching timeline over streaming replication
On 19.12.2012 15:55, Heikki Linnakangas wrote: On 19.12.2012 04:57, Josh Berkus wrote: Heikki, I ran into an unexpected issue while testing. I just wanted to fire up a chain of 5 replicas to see if I could connect them in a loop. However, I ran into a weird issue when starting up "r3": it refused to come out of "the database is starting up" mode until I did a write on the master. Then it came up fine. master-->r1-->r2-->r3-->r4 I tried doing the full replication sequence (basebackup, startup, test) with it twice and got the exact same results each time. This is very strange because I did not encounter the same issues with r2 or r4. Nor have I seen this before in my tests. Ok.. I'm going to need some more details on how to reproduce this, I'm not seeing that when I set up four standbys. Ok, I managed to reproduce this now. The problem seems to be a timing problem, when a standby switches to follow a new timeline. Four is not a magic number here, it can happen with just one cascading standby too. When the timline switch happens, for example, the standby changes recovery target timeline from 1 to 2, at WAL position 0/30002D8, it has all the WAL up to that WAL position. However, it only has that WAL in file 00010003, corresponding to timeline 1, and not in the file 00020003, corresponding to the new timeline. When a cascaded standby connects, it requests to start streaming from point 0/300 at timeline 2 (we always start streaming from the beginning of a segment, to avoid leaving partially-filled segments in pg_xlog). The walsender in the 1st standby tries to read that from file 00020003, which does not exist yet. The problem goes away after some time, after the 1st standby has streamed the contents of 00020003 and written it to disk, and the cascaded standby reconnects. But it would be nice to avoid that situation. I'm not sure how to do that yet, we might need to track the timeline we're currently receiving/sending more carefully. Or perhaps we need to copy the previous WAL segment to the new name when switching recovery target timeline, like we do when a server is promoted. I'll try to come up with something... - Heikki -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Switching timeline over streaming replication
On 19.12.2012 04:57, Josh Berkus wrote: Heikki, I ran into an unexpected issue while testing. I just wanted to fire up a chain of 5 replicas to see if I could connect them in a loop. However, I ran into a weird issue when starting up "r3": it refused to come out of "the database is starting up" mode until I did a write on the master. Then it came up fine. master-->r1-->r2-->r3-->r4 I tried doing the full replication sequence (basebackup, startup, test) with it twice and got the exact same results each time. This is very strange because I did not encounter the same issues with r2 or r4. Nor have I seen this before in my tests. Ok.. I'm going to need some more details on how to reproduce this, I'm not seeing that when I set up four standbys. I'm also seeing Thom's spurious error message now. Each of r2, r3 and r4 have the following message once in their logs: LOG: database system was interrupted while in recovery at log time 2012-12-19 02:49:34 GMT HINT: If this has occurred more than once some data might be corrupted and you might need to choose an earlier recovery target. This message doesn't seem to signify anything. Yep. You get that message when you start up the system from a base backup that was taken from a standby server. It's just noise, it would be nice if we could dial it down somehow. In general, streaming replication and backups tend to be awfully noisy. I've been meaning to go through all the messages that get printed during normal operation and think carefully which ones are really necessary, which ones could perhaps be merged into more compact messages. But haven't gotten around to it; that would be a great project for someone who actually sets up these systems regularly in production. - Heikki -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Switching timeline over streaming replication
Heikki, I ran into an unexpected issue while testing. I just wanted to fire up a chain of 5 replicas to see if I could connect them in a loop. However, I ran into a weird issue when starting up "r3": it refused to come out of "the database is starting up" mode until I did a write on the master. Then it came up fine. master-->r1-->r2-->r3-->r4 I tried doing the full replication sequence (basebackup, startup, test) with it twice and got the exact same results each time. This is very strange because I did not encounter the same issues with r2 or r4. Nor have I seen this before in my tests. I'm also seeing Thom's spurious error message now. Each of r2, r3 and r4 have the following message once in their logs: LOG: database system was interrupted while in recovery at log time 2012-12-19 02:49:34 GMT HINT: If this has occurred more than once some data might be corrupted and you might need to choose an earlier recovery target. This message doesn't seem to signify anything. -- Josh Berkus PostgreSQL Experts Inc. http://pgexperts.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Switching timeline over streaming replication
Since Thom already did the destruction test, I only chained 7 standbies, just to see if I could reproduce his error. In the process, I accidentally connected one standby to itself. This failed, but the error message wasn't very helpful; it just gave me "FATAL: could not connect, the database system is starting up". Surely there's some way we could tell the user they've tried to connect a standby to itself? Anyway, I was unable to reproduce Thom's error. I did not see the error message he did. Without any read queries running on the standbys, lag from master to replica7 averaged about 0.5 seconds, ranging between 0.1 seconds and 1.2 seconds. -- Josh Berkus PostgreSQL Experts Inc. http://pgexperts.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Switching timeline over streaming replication
On 17 December 2012 12:07, Heikki Linnakangas wrote: > On 15.12.2012 01:09, Josh Berkus wrote: > >> Tested this on yesterday's snapshot. Worked great. >> > > Thanks for the testing! > > > Now I wanna test a chain of cascading replicas ... how far can we chain >> these? >> > > There's no limit in theory. I tested with one master and two chained > standbys myself. Give it a shot, I'm curious to hear how it works with a > chain of a hundred standbys ;-). > I just set up 120 chained standbys, and for some reason I'm seeing these errors: LOG: replication terminated by primary server DETAIL: End of WAL reached on timeline 1 LOG: record with zero length at 0/301EC10 LOG: fetching timeline history file for timeline 2 from primary server LOG: restarted WAL streaming at 0/300 on timeline 1 LOG: replication terminated by primary server DETAIL: End of WAL reached on timeline 1 LOG: new target timeline is 2 LOG: restarted WAL streaming at 0/300 on timeline 2 LOG: replication terminated by primary server DETAIL: End of WAL reached on timeline 2 FATAL: error reading result of streaming command: ERROR: requested WAL segment 00020003 has already been removed ERROR: requested WAL segment 00020003 has already been removed LOG: started streaming WAL from primary at 0/300 on timeline 2 ERROR: requested WAL segment 00020003 has already been removed The "End of WAL reached on timeline 2" appears on all standbys except the one streaming directly from the primary. However, changes continue to cascade to all standbys right to the end of the chain (it takes several minutes to propagate however). -- Thom
Re: [HACKERS] Switching timeline over streaming replication
On 15.12.2012 01:09, Josh Berkus wrote: Tested this on yesterday's snapshot. Worked great. Thanks for the testing! Now I wanna test a chain of cascading replicas ... how far can we chain these? There's no limit in theory. I tested with one master and two chained standbys myself. Give it a shot, I'm curious to hear how it works with a chain of a hundred standbys ;-). - Heikki -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Switching timeline over streaming replication
On Sat, Dec 8, 2012 at 12:51 AM, Heikki Linnakangas wrote: > On 06.12.2012 15:39, Amit Kapila wrote: >> >> On Thursday, December 06, 2012 12:53 AM Heikki Linnakangas wrote: >>> >>> On 05.12.2012 14:32, Amit Kapila wrote: On Tuesday, December 04, 2012 10:01 PM Heikki Linnakangas wrote: > > After some diversions to fix bugs and refactor existing code, I've > committed a couple of small parts of this patch, which just add some > sanity checks to notice incorrect PITR scenarios. Here's a new > version of the main patch based on current HEAD. After testing with the new patch, the following problems are observed. Defect - 1: 1. start primary A 2. start standby B following A 3. start cascade standby C following B. 4. start another standby D following C. 5. Promote standby B. 6. After successful time line switch in cascade standby C& D, >>> >>> stop D. 7. Restart D, Startup is successful and connecting to standby C. 8. Stop C. 9. Restart C, startup is failing. >>> >>> >>> Ok, the error I get in that scenario is: >>> >>> C 2012-12-05 19:55:43.840 EET 9283 FATAL: requested timeline 2 does not >>> contain minimum recovery point 0/3023F08 on timeline 1 C 2012-12-05 >>> 19:55:43.841 EET 9282 LOG: startup process (PID 9283) exited with exit >>> code 1 C 2012-12-05 19:55:43.841 EET 9282 LOG: aborting startup due to >>> startup process failure >>> >> >>> >>> That mismatch causes the error. I'd like to fix this by always treating >>> the checkpoint record to be part of the new timeline. That feels more >>> correct. The most straightforward way to implement that would be to peek >>> at the xlog record before updating replayEndRecPtr and replayEndTLI. If >>> it's a checkpoint record that changes TLI, set replayEndTLI to the new >>> timeline before calling the redo-function. But it's a bit of a >>> modularity violation to peek into the record like that. >>> >>> Or we could just revert the sanity check at beginning of recovery that >>> throws the "requested timeline 2 does not contain minimum recovery point >>> 0/3023F08 on timeline 1" error. The error I added to redo of checkpoint >>> record that says "unexpected timeline ID %u in checkpoint record, before >>> reaching minimum recovery point %X/%X on timeline %u" checks basically >>> the same thing, but at a later stage. However, the way >>> minRecoveryPointTLI is updated still seems wrong to me, so I'd like to >>> fix that. >>> >>> I'm thinking of something like the attached (with some more comments >>> before committing). Thoughts? >> >> >> This has fixed the problem reported. >> However, I am not able to think will there be any problem if we remove >> check >> "requested timeline 2 does not contain minimum recovery point >>> >>> 0/3023F08 on timeline 1" at beginning of recovery and just update >> >> replayEndTLI with ThisTimeLineID? > > > Well, it seems wrong for the control file to contain a situation like this: > > pg_control version number:932 > Catalog version number: 201211281 > Database system identifier: 5819228770976387006 > Database cluster state: shut down in recovery > pg_control last modified: pe 7. joulukuuta 2012 17.39.57 > Latest checkpoint location: 0/3023EA8 > Prior checkpoint location:0/260 > Latest checkpoint's REDO location:0/3023EA8 > Latest checkpoint's REDO WAL file:00020003 > Latest checkpoint's TimeLineID: 2 > ... > Time of latest checkpoint:pe 7. joulukuuta 2012 17.39.49 > Min recovery ending location: 0/3023F08 > Min recovery ending loc's timeline: 1 > > Note the latest checkpoint location and its TimelineID, and compare them > with the min recovery ending location. The min recovery ending location is > ahead of latest checkpoint's location; the min recovery ending location > actually points to the end of the checkpoint record. But how come the min > recovery ending location's timeline is 1, while the checkpoint record's > timeline is 2. > > Now maybe that would happen to work if remove the sanity check, but it still > seems horribly confusing. I'm afraid that discrepancy will come back to > haunt us later if we leave it like that. So I'd like to fix that. > > Mulling over this for some more, I propose the attached patch. With the > patch, we peek into the checkpoint record, and actually perform the timeline > switch (by changing ThisTimeLineID) before replaying it. That way the > checkpoint record is really considered to be on the new timeline for all > purposes. At the moment, the only difference that makes in practice is that > we set replayEndTLI, and thus minRecoveryPointTLI, to the new TLI, but it > feels logically more correct to do it that way. This patch has already been included in HEAD. Right? I found another "requested timelin
Re: [HACKERS] Switching timeline over streaming replication
Heikki, Tested this on yesterday's snapshot. Worked great. Test: 4 Ubuntu 10.04 LTS Cloud Servers (GoGrid) Configuration: Compiled 9.3(12-12-12) with: pg_stat_statements, citext, ISN, btree_gist, pl/perl Setup Test: Master-Master Replicated to: master-replica using pg_basebackup -x. No archiving. Master-Replica replicated to Replica-Replica1 and Replica-Replica2 using pg_basebackup -x All came up on first try, with no issues. Ran customized pgbench (with waits); lag time to cascading replicas was < 1 second. Failover Test: 1. started customized pgbench on master-master. 2. shut down master-master (-fast) 3. promoted master-replica to new master 4. restarted custom pgbench, at master-replica Result: Replication to replica-replica1,2 working fine, no interruptions in existing connections to replica-replicas. Now I wanna test a chain of cascading replicas ... how far can we chain these? -- Josh Berkus PostgreSQL Experts Inc. http://pgexperts.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Switching timeline over streaming replication
> From: Heikki Linnakangas [mailto:hlinnakan...@vmware.com] > Sent: Friday, December 07, 2012 9:22 PM > To: Amit Kapila > Cc: 'PostgreSQL-development'; 'Thom Brown' > Subject: Re: [HACKERS] Switching timeline over streaming replication > > On 06.12.2012 15:39, Amit Kapila wrote: > > On Thursday, December 06, 2012 12:53 AM Heikki Linnakangas wrote: > >> On 05.12.2012 14:32, Amit Kapila wrote: > >>> On Tuesday, December 04, 2012 10:01 PM Heikki Linnakangas wrote: > >>>> After some diversions to fix bugs and refactor existing code, I've > >>>> committed a couple of small parts of this patch, which just add > >>>> some sanity checks to notice incorrect PITR scenarios. Here's a new > >>>> version of the main patch based on current HEAD. > >>> > >>> After testing with the new patch, the following problems are > observed. > >>> > >>> Defect - 1: > >>> > >>> 1. start primary A > >>> 2. start standby B following A > >>> 3. start cascade standby C following B. > >>> 4. start another standby D following C. > >>> 5. Promote standby B. > >>> 6. After successful time line switch in cascade standby C& > D, > >> stop D. > >>> 7. Restart D, Startup is successful and connecting to standby > C. > >>> 8. Stop C. > >>> 9. Restart C, startup is failing. > >> > >> Ok, the error I get in that scenario is: > >> > >> C 2012-12-05 19:55:43.840 EET 9283 FATAL: requested timeline 2 does > >> not contain minimum recovery point 0/3023F08 on timeline 1 C > >> 2012-12-05 > >> 19:55:43.841 EET 9282 LOG: startup process (PID 9283) exited with > >> exit code 1 C 2012-12-05 19:55:43.841 EET 9282 LOG: aborting startup > >> due to startup process failure > >> > > > >> > Well, it seems wrong for the control file to contain a situation like > this: > > pg_control version number:932 > Catalog version number: 201211281 > Database system identifier: 5819228770976387006 > Database cluster state: shut down in recovery > pg_control last modified: pe 7. joulukuuta 2012 17.39.57 > Latest checkpoint location: 0/3023EA8 > Prior checkpoint location:0/260 > Latest checkpoint's REDO location:0/3023EA8 > Latest checkpoint's REDO WAL file:00020003 > Latest checkpoint's TimeLineID: 2 > ... > Time of latest checkpoint:pe 7. joulukuuta 2012 17.39.49 > Min recovery ending location: 0/3023F08 > Min recovery ending loc's timeline: 1 > > Note the latest checkpoint location and its TimelineID, and compare them > with the min recovery ending location. The min recovery ending location > is ahead of latest checkpoint's location; the min recovery ending > location actually points to the end of the checkpoint record. But how > come the min recovery ending location's timeline is 1, while the > checkpoint record's timeline is 2. > > Now maybe that would happen to work if remove the sanity check, but it > still seems horribly confusing. I'm afraid that discrepancy will come > back to haunt us later if we leave it like that. So I'd like to fix > that. > > Mulling over this for some more, I propose the attached patch. With the > patch, we peek into the checkpoint record, and actually perform the > timeline switch (by changing ThisTimeLineID) before replaying it. That > way the checkpoint record is really considered to be on the new timeline > for all purposes. At the moment, the only difference that makes in > practice is that we set replayEndTLI, and thus minRecoveryPointTLI, to > the new TLI, but it feels logically more correct to do it that way. This has fixed both the problems reported in below link: http://archives.postgresql.org/pgsql-hackers/2012-12/msg00267.php The code is also fine. With Regards, Amit Kapila. -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Switching timeline over streaming replication
On 06.12.2012 15:39, Amit Kapila wrote: On Thursday, December 06, 2012 12:53 AM Heikki Linnakangas wrote: On 05.12.2012 14:32, Amit Kapila wrote: On Tuesday, December 04, 2012 10:01 PM Heikki Linnakangas wrote: After some diversions to fix bugs and refactor existing code, I've committed a couple of small parts of this patch, which just add some sanity checks to notice incorrect PITR scenarios. Here's a new version of the main patch based on current HEAD. After testing with the new patch, the following problems are observed. Defect - 1: 1. start primary A 2. start standby B following A 3. start cascade standby C following B. 4. start another standby D following C. 5. Promote standby B. 6. After successful time line switch in cascade standby C& D, stop D. 7. Restart D, Startup is successful and connecting to standby C. 8. Stop C. 9. Restart C, startup is failing. Ok, the error I get in that scenario is: C 2012-12-05 19:55:43.840 EET 9283 FATAL: requested timeline 2 does not contain minimum recovery point 0/3023F08 on timeline 1 C 2012-12-05 19:55:43.841 EET 9282 LOG: startup process (PID 9283) exited with exit code 1 C 2012-12-05 19:55:43.841 EET 9282 LOG: aborting startup due to startup process failure That mismatch causes the error. I'd like to fix this by always treating the checkpoint record to be part of the new timeline. That feels more correct. The most straightforward way to implement that would be to peek at the xlog record before updating replayEndRecPtr and replayEndTLI. If it's a checkpoint record that changes TLI, set replayEndTLI to the new timeline before calling the redo-function. But it's a bit of a modularity violation to peek into the record like that. Or we could just revert the sanity check at beginning of recovery that throws the "requested timeline 2 does not contain minimum recovery point 0/3023F08 on timeline 1" error. The error I added to redo of checkpoint record that says "unexpected timeline ID %u in checkpoint record, before reaching minimum recovery point %X/%X on timeline %u" checks basically the same thing, but at a later stage. However, the way minRecoveryPointTLI is updated still seems wrong to me, so I'd like to fix that. I'm thinking of something like the attached (with some more comments before committing). Thoughts? This has fixed the problem reported. However, I am not able to think will there be any problem if we remove check "requested timeline 2 does not contain minimum recovery point 0/3023F08 on timeline 1" at beginning of recovery and just update replayEndTLI with ThisTimeLineID? Well, it seems wrong for the control file to contain a situation like this: pg_control version number:932 Catalog version number: 201211281 Database system identifier: 5819228770976387006 Database cluster state: shut down in recovery pg_control last modified: pe 7. joulukuuta 2012 17.39.57 Latest checkpoint location: 0/3023EA8 Prior checkpoint location:0/260 Latest checkpoint's REDO location:0/3023EA8 Latest checkpoint's REDO WAL file:00020003 Latest checkpoint's TimeLineID: 2 ... Time of latest checkpoint:pe 7. joulukuuta 2012 17.39.49 Min recovery ending location: 0/3023F08 Min recovery ending loc's timeline: 1 Note the latest checkpoint location and its TimelineID, and compare them with the min recovery ending location. The min recovery ending location is ahead of latest checkpoint's location; the min recovery ending location actually points to the end of the checkpoint record. But how come the min recovery ending location's timeline is 1, while the checkpoint record's timeline is 2. Now maybe that would happen to work if remove the sanity check, but it still seems horribly confusing. I'm afraid that discrepancy will come back to haunt us later if we leave it like that. So I'd like to fix that. Mulling over this for some more, I propose the attached patch. With the patch, we peek into the checkpoint record, and actually perform the timeline switch (by changing ThisTimeLineID) before replaying it. That way the checkpoint record is really considered to be on the new timeline for all purposes. At the moment, the only difference that makes in practice is that we set replayEndTLI, and thus minRecoveryPointTLI, to the new TLI, but it feels logically more correct to do it that way. - Heikki diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c index 2618c8d..9bd7f03 100644 --- a/src/backend/access/transam/xlog.c +++ b/src/backend/access/transam/xlog.c @@ -605,6 +605,7 @@ static void SetLatestXTime(TimestampTz xtime); static void SetCurrentChunkStartTime(TimestampTz xtime); static void CheckRequiredParameterValues(void); static void XLogReportParameters(void); +static void checkTimeLineSwitch(XLogRecPtr lsn, TimeLine
Re: [HACKERS] Switching timeline over streaming replication
On Thursday, December 06, 2012 12:53 AM Heikki Linnakangas wrote: > On 05.12.2012 14:32, Amit Kapila wrote: > > On Tuesday, December 04, 2012 10:01 PM Heikki Linnakangas wrote: > >> After some diversions to fix bugs and refactor existing code, I've > >> committed a couple of small parts of this patch, which just add some > >> sanity checks to notice incorrect PITR scenarios. Here's a new > >> version of the main patch based on current HEAD. > > > > After testing with the new patch, the following problems are observed. > > > > Defect - 1: > > > > 1. start primary A > > 2. start standby B following A > > 3. start cascade standby C following B. > > 4. start another standby D following C. > > 5. Promote standby B. > > 6. After successful time line switch in cascade standby C& D, > stop D. > > 7. Restart D, Startup is successful and connecting to standby C. > > 8. Stop C. > > 9. Restart C, startup is failing. > > Ok, the error I get in that scenario is: > > C 2012-12-05 19:55:43.840 EET 9283 FATAL: requested timeline 2 does not > contain minimum recovery point 0/3023F08 on timeline 1 C 2012-12-05 > 19:55:43.841 EET 9282 LOG: startup process (PID 9283) exited with exit > code 1 C 2012-12-05 19:55:43.841 EET 9282 LOG: aborting startup due to > startup process failure > > > That mismatch causes the error. I'd like to fix this by always treating > the checkpoint record to be part of the new timeline. That feels more > correct. The most straightforward way to implement that would be to peek > at the xlog record before updating replayEndRecPtr and replayEndTLI. If > it's a checkpoint record that changes TLI, set replayEndTLI to the new > timeline before calling the redo-function. But it's a bit of a > modularity violation to peek into the record like that. > > Or we could just revert the sanity check at beginning of recovery that > throws the "requested timeline 2 does not contain minimum recovery point > 0/3023F08 on timeline 1" error. The error I added to redo of checkpoint > record that says "unexpected timeline ID %u in checkpoint record, before > reaching minimum recovery point %X/%X on timeline %u" checks basically > the same thing, but at a later stage. However, the way > minRecoveryPointTLI is updated still seems wrong to me, so I'd like to > fix that. > > I'm thinking of something like the attached (with some more comments > before committing). Thoughts? This has fixed the problem reported. However, I am not able to think will there be any problem if we remove check "requested timeline 2 does not contain minimum recovery point > 0/3023F08 on timeline 1" at beginning of recovery and just update replayEndTLI with ThisTimeLineID? With Regards, Amit Kapila. -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Switching timeline over streaming replication
On 05.12.2012 14:32, Amit Kapila wrote: On Tuesday, December 04, 2012 10:01 PM Heikki Linnakangas wrote: After some diversions to fix bugs and refactor existing code, I've committed a couple of small parts of this patch, which just add some sanity checks to notice incorrect PITR scenarios. Here's a new version of the main patch based on current HEAD. After testing with the new patch, the following problems are observed. Defect - 1: 1. start primary A 2. start standby B following A 3. start cascade standby C following B. 4. start another standby D following C. 5. Promote standby B. 6. After successful time line switch in cascade standby C& D, stop D. 7. Restart D, Startup is successful and connecting to standby C. 8. Stop C. 9. Restart C, startup is failing. Ok, the error I get in that scenario is: C 2012-12-05 19:55:43.840 EET 9283 FATAL: requested timeline 2 does not contain minimum recovery point 0/3023F08 on timeline 1 C 2012-12-05 19:55:43.841 EET 9282 LOG: startup process (PID 9283) exited with exit code 1 C 2012-12-05 19:55:43.841 EET 9282 LOG: aborting startup due to startup process failure It seems that the commits I made to master already: http://archives.postgresql.org/pgsql-committers/2012-12/msg00116.php http://archives.postgresql.org/pgsql-committers/2012-12/msg00111.php were a few bricks shy of a load. The problem is that if recovery stops at a checkpoint record that changes timeline, so that minRecoveryPoint points to the end of the checkpoint record, we still record the old TLI as the TLI of minRecoveryPoint. This is because 1) there's a silly bug in the patch; replayEndTLI is not updated along with replayEndRecPtr. But even after fixing that, we're not good. The problem is that replayEndRecPtr is currently updated *before* calling the redo function. replayEndRecPtr is what becomes minRecoveryPoint when XLogFlush() is called. If the record is a checkpoint record, redoing it will switch recovery to the new timeline, but replayEndTLI will not be updated until the next record. IOW, as far as minRecoveryPoint is concerned, a checkpoint record that switches timeline is considered to be part of the old timeline. But when a server is promoted and a new timeline is created, the checkpoint record is considered to be part of the new timeline; that's what we write in the page header and in the control file. That mismatch causes the error. I'd like to fix this by always treating the checkpoint record to be part of the new timeline. That feels more correct. The most straightforward way to implement that would be to peek at the xlog record before updating replayEndRecPtr and replayEndTLI. If it's a checkpoint record that changes TLI, set replayEndTLI to the new timeline before calling the redo-function. But it's a bit of a modularity violation to peek into the record like that. Or we could just revert the sanity check at beginning of recovery that throws the "requested timeline 2 does not contain minimum recovery point 0/3023F08 on timeline 1" error. The error I added to redo of checkpoint record that says "unexpected timeline ID %u in checkpoint record, before reaching minimum recovery point %X/%X on timeline %u" checks basically the same thing, but at a later stage. However, the way minRecoveryPointTLI is updated still seems wrong to me, so I'd like to fix that. I'm thinking of something like the attached (with some more comments before committing). Thoughts? - Heikki diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c index 702ea7c..bdae7a4 100644 --- a/src/backend/access/transam/xlog.c +++ b/src/backend/access/transam/xlog.c @@ -5822,6 +5822,7 @@ StartupXLOG(void) */ do { +TimeLineID EndTLI; #ifdef WAL_DEBUG if (XLOG_DEBUG || (rmid == RM_XACT_ID && trace_recovery_messages <= DEBUG2) || @@ -5895,8 +5896,20 @@ StartupXLOG(void) * Update shared replayEndRecPtr before replaying this record, * so that XLogFlush will update minRecoveryPoint correctly. */ +if (record->xl_rmid == RM_XLOG_ID && + (record->xl_info & ~XLR_INFO_MASK) == XLOG_CHECKPOINT_SHUTDOWN) +{ + CheckPoint checkPoint; + + memcpy(&checkPoint, XLogRecGetData(record), sizeof(CheckPoint)); + EndTLI = checkPoint.ThisTimeLineID; +} +else + EndTLI = ThisTimeLineID; + SpinLockAcquire(&xlogctl->info_lck); xlogctl->replayEndRecPtr = EndRecPtr; +xlogctl->replayEndTLI = EndTLI; recoveryPause = xlogctl->recoveryPause; SpinLockRelease(&xlogctl->info_lck); -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Switching timeline over streaming replication
On Tuesday, December 04, 2012 10:01 PM Heikki Linnakangas wrote: > After some diversions to fix bugs and refactor existing code, I've > committed a couple of small parts of this patch, which just add some > sanity checks to notice incorrect PITR scenarios. Here's a new version > of the main patch based on current HEAD. After testing with the new patch, the following problems are observed. Defect - 1: 1. start primary A 2. start standby B following A 3. start cascade standby C following B. 4. start another standby D following C. 5. Promote standby B. 6. After successful time line switch in cascade standby C & D, stop D. 7. Restart D, Startup is successful and connecting to standby C. 8. Stop C. 9. Restart C, startup is failing. Defect-2: 1. start primary A 2. start standby B following A 3. start cascade standby C following B. 4. Start another standby D following C. 5. Execute the following commands in the primary A. create table tbl(f int); insert into tbl values(generate_series(1,1000)); 6. Promote standby B. 7. Execute the following commands in the primary B. insert into tbl values(generate_series(1001,2000)); insert into tbl values(generate_series(2001,3000)); 8. Stop standby D normally and restart D. Restart is failing. 9. Stop standby C normally and restart C. Restart is failing. Note: Stop the node means doing a smart shutdown. With Regards, Amit Kapila. -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Switching timeline over streaming replication
After some diversions to fix bugs and refactor existing code, I've committed a couple of small parts of this patch, which just add some sanity checks to notice incorrect PITR scenarios. Here's a new version of the main patch based on current HEAD. - Heikki streaming-tli-switch-8.patch.gz Description: GNU Zip compressed data -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Switching timeline over streaming replication
On 03.12.2012 14:21, senthilnathan wrote: Is this patch available in version 9.2.1 ? Nope, this is for 9.3. - Heikki -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Switching timeline over streaming replication
Is this patch available in version 9.2.1 ? Senthil -- View this message in context: http://postgresql.1045698.n5.nabble.com/Switching-timeline-over-streaming-replication-tp5723547p5734744.html Sent from the PostgreSQL - hackers mailing list archive at Nabble.com. -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: Plugging fd leaks (was Re: [HACKERS] Switching timeline over streaming replication)
On 26.11.2012 14:53, Amit Kapila wrote: On Friday, November 23, 2012 7:03 PM Heikki Linnakangas This is what I came up with. It adds a new function, OpenFile, that returns a raw file descriptor like BasicOpenFile, but the file descriptor is associated with the current subtransaction and automatically closed at end-of-xact, in the same way that AllocateFile and AllocateDir work. In other words, OpenFile is to open() what AllocateFile is to fopen(). BasicOpenFile is unchanged, it returns a raw fd and it's solely the caller's responsibility to close it, but many of the places that used to call BasicOpenFile now use the safer OpenFile function instead. This patch plugs three existing fd (or virtual fd) leaks: 1. copy_file() - fixed by by using OpenFile instead of BasicOpenFile 2. XLogFileLinit() - fixed by adding close() calls to the error cases. Can't use OpenFile here because the fd is supposed to persist over transaction boundaries. 3. lo_import/lo_export - fixed by using OpenFile instead of PathNameOpenFile. I have gone through the patch and find it okay except for one minor suggestion 1. Can we put below log in OpenFile as well +DO_DB(elog(LOG, "CloseFile: Allocated %d", numAllocatedDescs)); Thanks. Added that and committed. I didn't dare to backpatch this, even though it could be fairly easily backpatched. The leaks exist in older versions too, but since they're extremely rare (zero complaints from the field and it's been like that forever), I didn't want to take the risk. Maybe later, after this has had more testing in master. One thing I'm not too fond of is the naming. I'm calling the new functions OpenFile and CloseFile. There's some danger of confusion there, as the function to close a virtual file opened with PathNameOpenFile is called FileClose. OpenFile is really the same kind of operation as AllocateFile and AllocateDir, but returns an unbuffered fd. So it would be nice if it was called AllocateSomething, too. But AllocateFile is already taken. And I don't much like the Allocate* naming for these anyway, you really would expect the name to contain "open". OpenFileInTrans OpenTransactionAwareFile In anycase OpenFile is also okay. I ended up calling the functions OpenTransientFile and CloseTransientFile. Windows has a library function called "OpenFile", so that was a pretty bad choice after all. - Heikki -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: Plugging fd leaks (was Re: [HACKERS] Switching timeline over streaming replication)
Amit Kapila writes: > On Monday, November 26, 2012 7:01 PM Heikki Linnakangas wrote: >> Hmm, if it's just for locking purposes, how about using a lwlock or a >> heavy-weight lock instead? > Its not only for lock, the main idea is that we create temp file and write > modified configuration in that temp file. > In end if it's success, then we rename temp file to .conf file but if it > error out then at abort we need to delete temp file. > So in short, main point is to close/rename the file in case of success (at > end of command) and remove in case of abort. I'd go with the TRY/CATCH solution. It would be worth extending the fd.c infrastructure if there were multiple users of the feature, but there are not, nor do I see likely new candidates on the horizon. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: Plugging fd leaks (was Re: [HACKERS] Switching timeline over streaming replication)
On Monday, November 26, 2012 7:01 PM Heikki Linnakangas wrote: > On 26.11.2012 14:53, Amit Kapila wrote: > > I have one usecase in feature (SQL Command to edit postgresql.conf) > very > > similar to OpenFile/CloseFile, but I want that when CloseFile is > called from > > abort, it should remove(unlink) the file as well and during open it > has to > > retry few times if open is not success. > > I have following options: > > 1. Extend OpenFile/CloseFile or PathNameOpenFile > > 2. Write new functions similar to OpenFile/CloseFile, something like > > OpenConfLockFile/CloseConfLockFile > > 3. Use OpenFile/CloseFile and handle my specific case with PG_TRY .. > > PG_CATCH > > > > Any suggestions? > > Hmm, if it's just for locking purposes, how about using a lwlock or a > heavy-weight lock instead? Its not only for lock, the main idea is that we create temp file and write modified configuration in that temp file. In end if it's success, then we rename temp file to .conf file but if it error out then at abort we need to delete temp file. So in short, main point is to close/rename the file in case of success (at end of command) and remove in case of abort. With Regards, Amit Kapila. -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: Plugging fd leaks (was Re: [HACKERS] Switching timeline over streaming replication)
On 26.11.2012 14:53, Amit Kapila wrote: I have one usecase in feature (SQL Command to edit postgresql.conf) very similar to OpenFile/CloseFile, but I want that when CloseFile is called from abort, it should remove(unlink) the file as well and during open it has to retry few times if open is not success. I have following options: 1. Extend OpenFile/CloseFile or PathNameOpenFile 2. Write new functions similar to OpenFile/CloseFile, something like OpenConfLockFile/CloseConfLockFile 3. Use OpenFile/CloseFile and handle my specific case with PG_TRY .. PG_CATCH Any suggestions? Hmm, if it's just for locking purposes, how about using a lwlock or a heavy-weight lock instead? - Heikki -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: Plugging fd leaks (was Re: [HACKERS] Switching timeline over streaming replication)
On Friday, November 23, 2012 7:03 PM Heikki Linnakangas > On 15.11.2012 17:16, Heikki Linnakangas wrote: > > On 15.11.2012 16:55, Tom Lane wrote: > >> Heikki Linnakangas writes: > >>> This is a fairly general issue, actually. Looking around, I can see > >>> at least two similar cases in existing code, with BasicOpenFile, > >>> where we will leak file descriptors on error: > >> > >> Um, don't we automatically clean those up during transaction abort? > > > > Not the ones allocated with PathNameOpenFile or BasicOpenFile. Files > > allocated with AllocateFile() and OpenTemporaryFile() are cleaned up > > at abort. > > > >> If we don't, we ought to think about that, not about cluttering > >> calling code with certain-to-be-inadequate cleanup in error cases. > > > > Agreed. Cleaning up at end-of-xact won't help walsender or other > > non-backend processes, though, because they don't do transactions. But > > a top-level ResourceOwner that's reset in the sigsetjmp() cleanup > > routine would work. > > This is what I came up with. It adds a new function, OpenFile, that > returns a raw file descriptor like BasicOpenFile, but the file > descriptor is associated with the current subtransaction and > automatically closed at end-of-xact, in the same way that AllocateFile > and AllocateDir work. In other words, OpenFile is to open() what > AllocateFile is to fopen(). BasicOpenFile is unchanged, it returns a raw > fd and it's solely the caller's responsibility to close it, but many of > the places that used to call BasicOpenFile now use the safer OpenFile > function instead. > > This patch plugs three existing fd (or virtual fd) leaks: > > 1. copy_file() - fixed by by using OpenFile instead of BasicOpenFile 2. > XLogFileLinit() - fixed by adding close() calls to the error cases. > Can't use OpenFile here because the fd is supposed to persist over > transaction boundaries. > 3. lo_import/lo_export - fixed by using OpenFile instead of > PathNameOpenFile. I have gone through the patch and find it okay except for one minor suggestion 1. Can we put below log in OpenFile as well +DO_DB(elog(LOG, "CloseFile: Allocated %d", numAllocatedDescs)); > One thing I'm not too fond of is the naming. I'm calling the new > functions OpenFile and CloseFile. There's some danger of confusion > there, as the function to close a virtual file opened with > PathNameOpenFile is called FileClose. OpenFile is really the same kind > of operation as AllocateFile and AllocateDir, but returns an unbuffered > fd. So it would be nice if it was called AllocateSomething, too. But > AllocateFile is already taken. And I don't much like the Allocate* > naming for these anyway, you really would expect the name to contain > "open". OpenFileInTrans OpenTransactionAwareFile In anycase OpenFile is also okay. I have one usecase in feature (SQL Command to edit postgresql.conf) very similar to OpenFile/CloseFile, but I want that when CloseFile is called from abort, it should remove(unlink) the file as well and during open it has to retry few times if open is not success. I have following options: 1. Extend OpenFile/CloseFile or PathNameOpenFile 2. Write new functions similar to OpenFile/CloseFile, something like OpenConfLockFile/CloseConfLockFile 3. Use OpenFile/CloseFile and handle my specific case with PG_TRY .. PG_CATCH Any suggestions? With Regards, Amit Kapila. -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Plugging fd leaks (was Re: [HACKERS] Switching timeline over streaming replication)
On 15.11.2012 17:16, Heikki Linnakangas wrote: On 15.11.2012 16:55, Tom Lane wrote: Heikki Linnakangas writes: This is a fairly general issue, actually. Looking around, I can see at least two similar cases in existing code, with BasicOpenFile, where we will leak file descriptors on error: Um, don't we automatically clean those up during transaction abort? Not the ones allocated with PathNameOpenFile or BasicOpenFile. Files allocated with AllocateFile() and OpenTemporaryFile() are cleaned up at abort. If we don't, we ought to think about that, not about cluttering calling code with certain-to-be-inadequate cleanup in error cases. Agreed. Cleaning up at end-of-xact won't help walsender or other non-backend processes, though, because they don't do transactions. But a top-level ResourceOwner that's reset in the sigsetjmp() cleanup routine would work. This is what I came up with. It adds a new function, OpenFile, that returns a raw file descriptor like BasicOpenFile, but the file descriptor is associated with the current subtransaction and automatically closed at end-of-xact, in the same way that AllocateFile and AllocateDir work. In other words, OpenFile is to open() what AllocateFile is to fopen(). BasicOpenFile is unchanged, it returns a raw fd and it's solely the caller's responsibility to close it, but many of the places that used to call BasicOpenFile now use the safer OpenFile function instead. This patch plugs three existing fd (or virtual fd) leaks: 1. copy_file() - fixed by by using OpenFile instead of BasicOpenFile 2. XLogFileLinit() - fixed by adding close() calls to the error cases. Can't use OpenFile here because the fd is supposed to persist over transaction boundaries. 3. lo_import/lo_export - fixed by using OpenFile instead of PathNameOpenFile. In addition, this replaces many BasicOpenFile() calls with OpenFile() that were not leaking, because the code meticulously closed the file on error. That wasn't strictly necessary, but IMHO it's good for robustness. One thing I'm not too fond of is the naming. I'm calling the new functions OpenFile and CloseFile. There's some danger of confusion there, as the function to close a virtual file opened with PathNameOpenFile is called FileClose. OpenFile is really the same kind of operation as AllocateFile and AllocateDir, but returns an unbuffered fd. So it would be nice if it was called AllocateSomething, too. But AllocateFile is already taken. And I don't much like the Allocate* naming for these anyway, you really would expect the name to contain "open". Do we want to backpatch this? We've had zero complaints, but this seems fairly safe to backpatch, and at least the leak in copy_file() can be quite annoying. If you run out of disk space in CREATE DATABASE, the target file is kept open even though it's deleted, so the space isn't reclaimed until you disconnect. - Heikki diff --git a/src/backend/access/transam/slru.c b/src/backend/access/transam/slru.c index dd69c23..cd60dd8 100644 --- a/src/backend/access/transam/slru.c +++ b/src/backend/access/transam/slru.c @@ -531,7 +531,7 @@ SlruInternalWritePage(SlruCtl ctl, int slotno, SlruFlush fdata) int i; for (i = 0; i < fdata->num_files; i++) - close(fdata->fd[i]); + CloseFile(fdata->fd[i]); } /* Re-acquire control lock and update page state */ @@ -593,7 +593,7 @@ SlruPhysicalReadPage(SlruCtl ctl, int pageno, int slotno) * SlruPhysicalWritePage). Hence, if we are InRecovery, allow the case * where the file doesn't exist, and return zeroes instead. */ - fd = BasicOpenFile(path, O_RDWR | PG_BINARY, S_IRUSR | S_IWUSR); + fd = OpenFile(path, O_RDWR | PG_BINARY, S_IRUSR | S_IWUSR); if (fd < 0) { if (errno != ENOENT || !InRecovery) @@ -614,7 +614,7 @@ SlruPhysicalReadPage(SlruCtl ctl, int pageno, int slotno) { slru_errcause = SLRU_SEEK_FAILED; slru_errno = errno; - close(fd); + CloseFile(fd); return false; } @@ -623,11 +623,11 @@ SlruPhysicalReadPage(SlruCtl ctl, int pageno, int slotno) { slru_errcause = SLRU_READ_FAILED; slru_errno = errno; - close(fd); + CloseFile(fd); return false; } - if (close(fd)) + if (CloseFile(fd)) { slru_errcause = SLRU_CLOSE_FAILED; slru_errno = errno; @@ -740,7 +740,7 @@ SlruPhysicalWritePage(SlruCtl ctl, int pageno, int slotno, SlruFlush fdata) * don't use O_EXCL or O_TRUNC or anything like that. */ SlruFileName(ctl, path, segno); - fd = BasicOpenFile(path, O_RDWR | O_CREAT | PG_BINARY, + fd = OpenFile(path, O_RDWR | O_CREAT | PG_BINARY, S_IRUSR | S_IWUSR); if (fd < 0) { @@ -773,7 +773,7 @@ SlruPhysicalWritePage(SlruCtl ctl, int pageno, int slotno, SlruFlush fdata) slru_errcause = SLRU_SEEK_FAILED; slru_errno = errno; if (!fdata) - close(fd); + CloseFile(fd); return false; } @@ -786,7 +786,7 @@ SlruPhysicalWritePage(SlruCtl ctl, int pageno, int slotno, SlruFlush fdata) slru_errcause = SLRU_WRITE_FAILED; slru_e
Re: [HACKERS] Switching timeline over streaming replication
On Wednesday, November 21, 2012 11:36 PM Heikki Linnakangas wrote: > On 20.11.2012 15:33, Amit Kapila wrote: > > Defect-2: > > 1. start primary A > > 2. start standby B following A > > 3. start cascade standby C following B. > > 4. Start another standby D following C. > > 5. Execute the following commands in the primary A. > > create table tbl(f int); > > insert into tbl values(generate_series(1,1000)); > > 6. Promote standby B. > > 7. Execute the following commands in the primary B. > > insert into tbl values(generate_series(1001,2000)); > > insert into tbl values(generate_series(2001,3000)); > > > > The following logs are observed on standby C: > > > > LOG: restarted WAL streaming at position 0/700 on tli 2 > > ERROR: requested WAL segment 00020007 has > > already been removed > > LOG: record with zero length at 0/7028190 > > LOG: record with zero length at 0/7048540 > > LOG: out-of-sequence timeline ID 1 (after 2) in log segment > > 00020007, offset 0 > > I propose the attached patch (against 9.2) to fix that. This should be > backpatched to 9.0, where standby_mode was introduced. The code was the > same in 8.4, too, but AFAICS there was no problem there because 8.4 > never tried to re-open the same WAL segment after replaying some of it. Fixed. With Regards, Amit Kapila. -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Switching timeline over streaming replication
On 20.11.2012 15:33, Amit Kapila wrote: Defect-2: 1. start primary A 2. start standby B following A 3. start cascade standby C following B. 4. Start another standby D following C. 5. Execute the following commands in the primary A. create table tbl(f int); insert into tbl values(generate_series(1,1000)); 6. Promote standby B. 7. Execute the following commands in the primary B. insert into tbl values(generate_series(1001,2000)); insert into tbl values(generate_series(2001,3000)); The following logs are observed on standby C: LOG: restarted WAL streaming at position 0/700 on tli 2 ERROR: requested WAL segment 00020007 has already been removed LOG: record with zero length at 0/7028190 LOG: record with zero length at 0/7048540 LOG: out-of-sequence timeline ID 1 (after 2) in log segment 00020007, offset 0 Hmm, this one is actually a pre-existing bug. There's a sanity check that the sequence of timeline IDs that are seen in the XLOG page headers doesn't go backwards. In other words, if the last XLOG page that was read had timeline id X, the next page must have a tli >= X. The startup process keeps track of the last seen timeline id in lastPageTLI. In standby_mode, when the startup process is reading from a pre-existing file in pg_xlog (typically put there by streaming replication) and it reaches the end of valid WAL (marked by an error in decoding it, ie. "record with zero length" in your case), it sleeps for five seconds and retries. At retry, the WAL file is re-opened, and as part of sanity checking it, the first page header in the file is validated. Now, if there was a timeline change in the current WAL segment, and we've already replayed past that point, lastPageTLI will already be set to the new TLI, but the first page on the file contains the old TLI. When the file is re-opened, and the first page is validated, you get the error. The fix is quite straightforward: we should refrain from checking the TLI when we re-open a WAL file. Or better yet, compare it against the TLI we saw at the beginning of the last WAL segment, not the last WAL page. I propose the attached patch (against 9.2) to fix that. This should be backpatched to 9.0, where standby_mode was introduced. The code was the same in 8.4, too, but AFAICS there was no problem there because 8.4 never tried to re-open the same WAL segment after replaying some of it. - Heikki diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c index 8614907..045d21d 100644 --- a/src/backend/access/transam/xlog.c +++ b/src/backend/access/transam/xlog.c @@ -572,6 +572,7 @@ static uint32 readRecordBufSize = 0; static XLogRecPtr ReadRecPtr; /* start of last record read */ static XLogRecPtr EndRecPtr; /* end+1 of last record read */ static TimeLineID lastPageTLI = 0; +static TimeLineID lastSegmentTLI = 0; static XLogRecPtr minRecoveryPoint; /* local copy of * ControlFile->minRecoveryPoint */ @@ -655,7 +656,7 @@ static void CleanupBackupHistory(void); static void UpdateMinRecoveryPoint(XLogRecPtr lsn, bool force); static XLogRecord *ReadRecord(XLogRecPtr *RecPtr, int emode, bool fetching_ckpt); static void CheckRecoveryConsistency(void); -static bool ValidXLOGHeader(XLogPageHeader hdr, int emode); +static bool ValidXLOGHeader(XLogPageHeader hdr, int emode, bool segmentonly); static XLogRecord *ReadCheckpointRecord(XLogRecPtr RecPtr, int whichChkpt); static List *readTimeLineHistory(TimeLineID targetTLI); static bool existsTimeLineHistory(TimeLineID probeTLI); @@ -3927,7 +3928,7 @@ ReadRecord(XLogRecPtr *RecPtr, int emode, bool fetching_ckpt) * to go backwards (but we can't reset that variable right here, since * we might not change files at all). */ - lastPageTLI = 0; /* see comment in ValidXLOGHeader */ + lastPageTLI = lastSegmentTLI = 0; /* see comment in ValidXLOGHeader */ randAccess = true; /* allow curFileTLI to go backwards too */ } @@ -4190,7 +4191,7 @@ next_record_is_invalid: * ReadRecord. It's not intended for use from anywhere else. */ static bool -ValidXLOGHeader(XLogPageHeader hdr, int emode) +ValidXLOGHeader(XLogPageHeader hdr, int emode, bool segmentonly) { XLogRecPtr recaddr; @@ -4285,18 +4286,31 @@ ValidXLOGHeader(XLogPageHeader hdr, int emode) * successive pages of a consistent WAL sequence. * * Of course this check should only be applied when advancing sequentially - * across pages; therefore ReadRecord resets lastPageTLI to zero when - * going to a random page. + * across pages; therefore ReadRecord resets lastPageTLI and + * lastSegmentTLI to zero when going to a random page. + * + * Sometimes we re-open a segment that's already been partially replayed. + * In that case we cannot perform the normal TLI check: if there is a + * timeline switch within the segment, the fir
Re: [HACKERS] Switching timeline over streaming replication
On Monday, November 19, 2012 10:54 PM Heikki Linnakangas wrote: > On 10.10.2012 17:54, Thom Brown wrote: > > Hmm... I get something different. When I promote standby B, standby > > C's log shows: > > > > The following problems are observed while testing of the patch. > > Defect-1: > > > >1. start primary A > >2. start standby B following A > >3. start cascade standby C following B. > >4. Promote standby B. > >5. After successful time line switch in cascade standby C, stop > C. > >6. Restart C, startup is failing with the following error. > > > > LOG: database system was shut down in recovery at 2012-11-16 > > 16:26:29 IST > > FATAL: requested timeline 2 does not contain minimum > > recovery point 0/30143A0 on timeline 1 > > LOG: startup process (PID 415) exited with exit code 1 > > LOG: aborting startup due to startup process failure > > > > The above defect is already discussed in the following link. > > http://archives.postgresql.org/message-id/00a801cda6f3$4aba27b0$e02e77 > > 10$@ka > > p...@huawei.com > > Fixed now, sorry for neglecting this earlier. The problem was that if > the primary switched to a new timeline at position X, and the standby > followed that switch, on restart it would set minRecoveryPoint to X, and > the new Not sure, if above is fixed as I don't see any code change for this and in test also it again fails. Below is result of further testing: Some strange logs are observed during testing. Note: Stop the node means doing a smart shutdown. Scenario-1: 1. start primary A 2. start standby B following A 3. start cascade standby C following B. 4. Execute the following commands in the primary A. create table tbl(f int); insert into tbl values(generate_series(1,1000)); 5. Promote standby B. 6. Execute the following commands in the primary B. insert into tbl values(generate_series(1001,2000)); insert into tbl values(generate_series(2001,3000)); The following logs are presents on the following standby C. please check these are proper or not? LOG: restarted WAL streaming at position 0/B00 on tli 2 LOG: record with zero length at 0/B024C68 LOG: record with zero length at 0/B035528 LOG: out-of-sequence timeline ID 1 (after 2) in log segment 0002000B, offset 0 Following two defects are found while testing the new patch. Defect - 1: 1. start primary A 2. start standby B following A 3. start cascade standby C following B. 4. start another standby D following C. 5. Promote standby B. 6. After successful time line switch in cascade standby C & D, stop D. 7. Restart D, Startup is successful and connecting to standby C. 8. Stop C. 9. Restart C, startup is failing. Defect-2: 1. start primary A 2. start standby B following A 3. start cascade standby C following B. 4. Start another standby D following C. 5. Execute the following commands in the primary A. create table tbl(f int); insert into tbl values(generate_series(1,1000)); 6. Promote standby B. 7. Execute the following commands in the primary B. insert into tbl values(generate_series(1001,2000)); insert into tbl values(generate_series(2001,3000)); The following logs are observed on standby C: LOG: restarted WAL streaming at position 0/700 on tli 2 ERROR: requested WAL segment 00020007 has already been removed LOG: record with zero length at 0/7028190 LOG: record with zero length at 0/7048540 LOG: out-of-sequence timeline ID 1 (after 2) in log segment 00020007, offset 0 The following logs are observed on standby D: LOG: restarted WAL streaming at position 0/700 on tli 2 LOG: replication terminated by primary server DETAIL: End of WAL reached on timeline 2 FATAL: error reading result of streaming command: ERROR: requested WAL segment 00020007 has already been removed LOG: streaming replication successfully connected to primary 8. Stop standby D normally and restart D. Restart is failing. Code Review -- 1. > Agreed. Cleaning up at end-of-xact won't help walsender or other non-backend processes, though, because they don't do > transactions. But a top-level ResourceOwner that's reset in the sigsetjmp() cleanup routine would work. Do you think cleanup of files be done as part of this patch or should it be handled separately, as it already exists in other paths of code. In that case may be one ToDo item can be added. 2. Also for forbidden_in_wal_sender(firstchar);, instead of handling it as part of each message, isn't it better if we call only once, something like is_command_allowed(firstchar); switch (fi
Re: [HACKERS] Switching timeline over streaming replication
On 10.10.2012 17:54, Thom Brown wrote: > Hmm... I get something different. When I promote standby B, standby > C's log shows: > > LOG: walreceiver ended streaming and awaits new instructions > LOG: re-handshaking at position 0/400 on tli 1 > LOG: fetching timeline history file for timeline 2 from primary server > LOG: walreceiver ended streaming and awaits new instructions > LOG: new target timeline is 2 > > Then when I stop then start standby C I get: > > FATAL: timeline history was not contiguous > LOG: startup process (PID 22986) exited with exit code 1 > LOG: aborting startup due to startup process failure Found & fixed this one. A paren was misplaced in tliOfPointInHistory() function.. On 16.11.2012 16:01, Amit Kapila wrote: The following problems are observed while testing of the patch. Defect-1: 1. start primary A 2. start standby B following A 3. start cascade standby C following B. 4. Promote standby B. 5. After successful time line switch in cascade standby C, stop C. 6. Restart C, startup is failing with the following error. LOG: database system was shut down in recovery at 2012-11-16 16:26:29 IST FATAL: requested timeline 2 does not contain minimum recovery point 0/30143A0 on timeline 1 LOG: startup process (PID 415) exited with exit code 1 LOG: aborting startup due to startup process failure The above defect is already discussed in the following link. http://archives.postgresql.org/message-id/00a801cda6f3$4aba27b0$e02e7710$@ka p...@huawei.com Fixed now, sorry for neglecting this earlier. The problem was that if the primary switched to a new timeline at position X, and the standby followed that switch, on restart it would set minRecoveryPoint to X, and the new Defect-2: 1. start primary A 2. start standby B following A 3. start cascade standby C following B with 'recovery_target_timeline' option in recovery.conf is disabled. 4. Promote standby B. 5. Cascade Standby C is not able to follow the new master B because of timeline difference. 6. Try to stop the cascade standby C (which is failing and the server is not stopping, observations are as WAL Receiver process is still running and clients are not allowing to connect). The defect-2 is happened only once in my test environment, I will try to reproduce it. Found it. When restarting the streaming, I reused the WALRCV_STARTING state. But if you then exited recovery, WalRcvRunning() would think that the walreceiver is stuck starting up, because it's been longer than 10 seconds since it was launched and it's still in WALRCV_STARTING state, so it put it into WALRCV_STOPPED state. And walreceiver didn't expect to be put into STOPPED state after having started up successfully already. I added a new explicit WALRCV_RESTARTING state to handle that. In addition to the above bug fixes, there's some small changes since last patch version: * I changed the LOG messages printed in various stages a bit, hopefully making it easier to follow what's happening. Feedback is welcome on when and how we should log, and whether some error messages need clarification. * 'ps' display is updated when the walreceiver enters and exits idle mode * Updated pg_controldata and pg_resetxlog to handle the new minRecoveryPointTLI field I added to the control file. * startup process wakes up walsenders at the end of recovery, so that cascading standbys are notified immediately when the timeline changes. That removes some of the delay in the process. - Heikki streaming-tli-switch-7.patch.gz Description: GNU Zip compressed data -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Switching timeline over streaming replication
On Thursday, November 15, 2012 6:05 PM Heikki Linnakangas wrote: > On 15.11.2012 12:44, Heikki Linnakangas wrote: > > Here's an updated version of this patch, rebased with master, > > including the recent replication timeout changes, and some other > cleanup. > > > > On 12.10.2012 09:34, Amit Kapila wrote: > >> The test is finished from myside. > >> > >> one more issue: > > > ... > >> ./pg_basebackup -P -D ../../data_sub -X fetch -p 2303 > >> pg_basebackup: COPY stream ended before last file was finished > > > > Fixed this. > > > > However, the test scenario you point to here: > > http://archives.postgresql.org/message-id/00a801cda6f3$4aba27b0$e02e77 > > 10$@kap...@huawei.com still seems to be broken, although I get a > > different error message now. > > I'll dig into this.. > > Ok, here's an updated patch again, with that bug fixed. First, I started with test of this Patch. Basic stuff: - Patch applies OK - Compiles cleanly with no warnings - Regression tests pass except the "standbycheck". >From a glance view of the "standbycheck" regression failures are because of sql scripts and expected outputs are little old. The following problems are observed while testing of the patch. Defect-1: 1. start primary A 2. start standby B following A 3. start cascade standby C following B. 4. Promote standby B. 5. After successful time line switch in cascade standby C, stop C. 6. Restart C, startup is failing with the following error. LOG: database system was shut down in recovery at 2012-11-16 16:26:29 IST FATAL: requested timeline 2 does not contain minimum recovery point 0/30143A0 on timeline 1 LOG: startup process (PID 415) exited with exit code 1 LOG: aborting startup due to startup process failure The above defect is already discussed in the following link. http://archives.postgresql.org/message-id/00a801cda6f3$4aba27b0$e02e7710$@ka p...@huawei.com Defect-2: 1. start primary A 2. start standby B following A 3. start cascade standby C following B with 'recovery_target_timeline' option in recovery.conf is disabled. 4. Promote standby B. 5. Cascade Standby C is not able to follow the new master B because of timeline difference. 6. Try to stop the cascade standby C (which is failing and the server is not stopping, observations are as WAL Receiver process is still running and clients are not allowing to connect). The defect-2 is happened only once in my test environment, I will try to reproduce it. With Regards, Amit Kapila. -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Switching timeline over streaming replication
On 15.11.2012 16:55, Tom Lane wrote: Heikki Linnakangas writes: This is a fairly general issue, actually. Looking around, I can see at least two similar cases in existing code, with BasicOpenFile, where we will leak file descriptors on error: Um, don't we automatically clean those up during transaction abort? Not the ones allocated with PathNameOpenFile or BasicOpenFile. Files allocated with AllocateFile() and OpenTemporaryFile() are cleaned up at abort. If we don't, we ought to think about that, not about cluttering calling code with certain-to-be-inadequate cleanup in error cases. Agreed. Cleaning up at end-of-xact won't help walsender or other non-backend processes, though, because they don't do transactions. But a top-level ResourceOwner that's reset in the sigsetjmp() cleanup routine would work. - Heikki -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Switching timeline over streaming replication
Heikki Linnakangas writes: > This is a fairly general issue, actually. Looking around, I can see at > least two similar cases in existing code, with BasicOpenFile, where we > will leak file descriptors on error: Um, don't we automatically clean those up during transaction abort? If we don't, we ought to think about that, not about cluttering calling code with certain-to-be-inadequate cleanup in error cases. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Switching timeline over streaming replication
On Thursday, November 15, 2012 6:05 PM Heikki Linnakangas wrote: > On 15.11.2012 12:44, Heikki Linnakangas wrote: > > Here's an updated version of this patch, rebased with master, > > including the recent replication timeout changes, and some other > cleanup. > > > > On 12.10.2012 09:34, Amit Kapila wrote: > >> The test is finished from myside. > >> > >> one more issue: > > > ... > >> ./pg_basebackup -P -D ../../data_sub -X fetch -p 2303 > >> pg_basebackup: COPY stream ended before last file was finished > > > > Fixed this. > > > > However, the test scenario you point to here: > > http://archives.postgresql.org/message-id/00a801cda6f3$4aba27b0$e02e77 > > 10$@kap...@huawei.com still seems to be broken, although I get a > > different error message now. > > I'll dig into this.. > > Ok, here's an updated patch again, with that bug fixed. I shall review and test the updated Patch in Commit Fest. With Regards, Amit Kapila. -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Switching timeline over streaming replication
On 15.11.2012 12:44, Heikki Linnakangas wrote: Here's an updated version of this patch, rebased with master, including the recent replication timeout changes, and some other cleanup. On 12.10.2012 09:34, Amit Kapila wrote: The test is finished from myside. one more issue: > ... ./pg_basebackup -P -D ../../data_sub -X fetch -p 2303 pg_basebackup: COPY stream ended before last file was finished Fixed this. However, the test scenario you point to here: http://archives.postgresql.org/message-id/00a801cda6f3$4aba27b0$e02e7710$@kap...@huawei.com still seems to be broken, although I get a different error message now. I'll dig into this.. Ok, here's an updated patch again, with that bug fixed. - Heikki streaming-tli-switch-6.patch.gz Description: GNU Zip compressed data -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Switching timeline over streaming replication
On 10.10.2012 17:26, Amit Kapila wrote: 36.+SendTimeLineHistory(TimeLineHistoryCmd *cmd) { .. if (nread<= 0) +ereport(ERROR, +(errcode_for_file_access(), + errmsg("could not read file \"%s\": %m", +path))); FileClose should be done in error case as well. Hmm, I think you're right. The straightforward fix to just call FileClose() before the ereport()s in that function would not be enough, though. You might run out of memory in pq_sendbytes(), for example, which would throw an error. We could use PG_TRY/CATCH for this, but seems like overkill. Perhaps the simplest fix is to use a global (static) variable for the fd, and clean it up in WalSndErrorCleanup(). This is a fairly general issue, actually. Looking around, I can see at least two similar cases in existing code, with BasicOpenFile, where we will leak file descriptors on error: copy_file: there are several error cases, including out-of-disk space, with no attempt to close the fds. XLogFileInit: again, no attempt to close the file descriptor on failure. This is called at checkpoint from the checkpointer process, to preallocate new xlog files. Given that we haven't heard any complaints of anyone running into these, these are not a big deal in practice, but in theory at least the XLogFileInit leak could lead to serious problems, as it could cause the checkpointer to run out of file descriptors. - Heikki -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Switching timeline over streaming replication
Here's an updated version of this patch, rebased with master, including the recent replication timeout changes, and some other cleanup. On 12.10.2012 09:34, Amit Kapila wrote: The test is finished from myside. one more issue: > ... ./pg_basebackup -P -D ../../data_sub -X fetch -p 2303 pg_basebackup: COPY stream ended before last file was finished Fixed this. However, the test scenario you point to here: http://archives.postgresql.org/message-id/00a801cda6f3$4aba27b0$e02e7710$@kap...@huawei.com still seems to be broken, although I get a different error message now. I'll dig into this.. - Heikki streaming-tli-switch-5.patch.gz Description: GNU Zip compressed data -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Switching timeline over streaming replication
Heikki Linnakangas wrote: > Attached is a new version of the patch. I committed the refactoring > of XLogPageRead() already, as that was a readability improvement > even without this patch. All the reported issues should be fixed > now, although I will continue testing this tomorrow. I added various > checks that that the correct timeline is followed during recovery. > minRecoveryPoint is now accompanied by a timeline ID, so that when > we restart recovery, we check that we recover back to > minRecoveryPoint along the same timeline as last time. Also, it now > checks at beginning of recovery that the checkpoint record comes > from the correct timeline. That fixes the problem that you reported > above. I also adjusted the error messages on timeline history > problems to be more clear. Heikki, I see Amit found a problem with this patch. I assume you're going to work a bit more on it and submit/commit another version. I'm marking this one Returned with Feedback. Thanks. -- Álvaro Herrerahttp://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Switching timeline over streaming replication
> -Original Message- > From: pgsql-hackers-ow...@postgresql.org [mailto:pgsql-hackers- > ow...@postgresql.org] On Behalf Of Amit Kapila > Sent: Wednesday, October 10, 2012 7:57 PM > To: 'Heikki Linnakangas' > Cc: 'PostgreSQL-development' > Subject: Re: [HACKERS] Switching timeline over streaming replication > > On Tuesday, October 09, 2012 10:32 PM Heikki Linnakangas wrote: > > On 06.10.2012 15:58, Amit Kapila wrote: > > > One more test seems to be failed. Apart from this, other tests are > > passed. > > > > It seems there is one more defect, please check the same > Defect: > The test is finished from myside. one more issue: Steps to reproduce the defect: 1. Do initdb 2. Set port=2303, wal_level=hot_standby, hot_standby=off, max_walsenders=3 in the postgresql.conf file 3. Enable the replication connection in pg_hba.conf 4. Start the server. Executing the following commands is leading failure. ./pg_basebackup -P -D ../../data_sub -X fetch -p 2303 pg_basebackup: COPY stream ended before last file was finished rm -fr ../../data_sub ./pg_basebackup -P -D ../../data_sub -X fetch -p 2303 pg_basebackup: COPY stream ended before last file was finished The following logs are observed in the server console. ERROR: requested WAL segment 0002 has already been removed ERROR: requested WAL segment 0003 has already been removed With Regards, Amit Kapila. -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Switching timeline over streaming replication
On 10 October 2012 15:26, Amit Kapila wrote: > On Tuesday, October 09, 2012 10:32 PM Heikki Linnakangas wrote: >> On 06.10.2012 15:58, Amit Kapila wrote: >> > One more test seems to be failed. Apart from this, other tests are >> passed. >> > > It seems there is one more defect, please check the same > Defect: > > 1. start primary A > 2. start standby B following A > 3. start cascade standby C following B. > 4. Promote standby B. > 5. After successful time line switch in cascade standby C, stop C. > 6. Restart C, startup is failing with the following error. > FATAL: requested timeline 2 does not contain minimum recovery point > 0/300 on timeline 1 Hmm... I get something different. When I promote standby B, standby C's log shows: LOG: walreceiver ended streaming and awaits new instructions LOG: re-handshaking at position 0/400 on tli 1 LOG: fetching timeline history file for timeline 2 from primary server LOG: walreceiver ended streaming and awaits new instructions LOG: new target timeline is 2 Then when I stop then start standby C I get: FATAL: timeline history was not contiguous LOG: startup process (PID 22986) exited with exit code 1 LOG: aborting startup due to startup process failure -- Thom -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Switching timeline over streaming replication
On Tuesday, October 09, 2012 10:32 PM Heikki Linnakangas wrote: > On 06.10.2012 15:58, Amit Kapila wrote: > > One more test seems to be failed. Apart from this, other tests are > passed. > > It seems there is one more defect, please check the same Defect: 1. start primary A 2. start standby B following A 3. start cascade standby C following B. 4. Promote standby B. 5. After successful time line switch in cascade standby C, stop C. 6. Restart C, startup is failing with the following error. FATAL: requested timeline 2 does not contain minimum recovery point 0/300 on timeline 1 Review: The following statement is present in the hig-availability.sgml file, which is also needs to be modified in the patch. Promoting a cascading standby terminates the immediate downstream replication connections which it serves. This is because the timeline becomes different between standbys, and they can no longer continue replication. The affected standby(s) may reconnect to reestablish streaming replication. I felt some of minor comments are still not handled: 35. +SendTimeLineHistory(TimeLineHistoryCmd *cmd) { .. + fd = PathNameOpenFile(path, O_RDONLY | PG_BINARY, 0666); error handling for fd < 0 is missing. 36.+SendTimeLineHistory(TimeLineHistoryCmd *cmd) { .. if (nread <= 0) +ereport(ERROR, +(errcode_for_file_access(), + errmsg("could not read file \"%s\": %m", +path))); FileClose should be done in error case as well. With Regards, Amit Kapila. -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Switching timeline over streaming replication
On 06.10.2012 15:58, Amit Kapila wrote: One more test seems to be failed. Apart from this, other tests are passed. 2. a. Master M-1 b. Standby S-1 follows M-1 c. insert 10 records on M-1. verify all records are visible on M-1,S-1 d. Stop S-1 e. insert 2 records on M-1. f. Stop M-1 g. Start S-1 h. Promote S-1 i. Make M-1 recovery.conf such that it should connect to S-1 j. Start M-1. Below error comes on M-1 which is expected as M-1 has more data. LOG: database system was shut down at 2012-10-05 16:45:39 IST LOG: entering standby mode LOG: consistent recovery state reached at 0/176A070 LOG: record with zero length at 0/176A070 LOG: database system is ready to accept read only connections LOG: streaming replication successfully connected to primary LOG: fetching timeline history file for timeline 2 from primary server LOG: replication terminated by primary server DETAIL: End of WAL reached on timeline 1 LOG: walreceiver ended streaming and awaits new instructions LOG: new timeline 2 forked off current database system timeline 1 before current recovery point 0/176A070 LOG: re-handshaking at position 0/100 on tli 1 LOG: replication terminated by primary server DETAIL: End of WAL reached on timeline 1 LOG: walreceiver ended streaming and awaits new instructions LOG: new timeline 2 forked off current database system timeline 1 before current recovery point 0/176A070 k. Stop M-1. Start M-1. It is able to successfully connect to S-1 which is a problem. l. check in S-1. Records inserted in step-e are not present. m. Now insert records in S-1. M-1 doesn't recieve any records. On M-1 server following log is getting printed. LOG: out-of-sequence timeline ID 1 (after 2) in log segment 00020001, offset 0 LOG: out-of-sequence timeline ID 1 (after 2) in log segment 00020001, offset 0 LOG: out-of-sequence timeline ID 1 (after 2) in log segment 00020001, offset 0 LOG: out-of-sequence timeline ID 1 (after 2) in log segment 00020001, offset 0 LOG: out-of-sequence timeline ID 1 (after 2) in log segment 00020001, offset 0 Hmm, seems we need to keep track of which timeline we've used to recover before. Before restart, the master correctly notices that timeline 2 forked off earlier in its history, so it cannot recover to that timeline. But after restart the master begins recovery from the previous checkpoint, and because timeline 2 forked off timeline 1 after the checkpoint, it concludes that it can follow that timeline. It doesn't realize that it had some already recovered/flushed some WAL in timeline 1 after the fork-point. Attached is a new version of the patch. I committed the refactoring of XLogPageRead() already, as that was a readability improvement even without this patch. All the reported issues should be fixed now, although I will continue testing this tomorrow. I added various checks that that the correct timeline is followed during recovery. minRecoveryPoint is now accompanied by a timeline ID, so that when we restart recovery, we check that we recover back to minRecoveryPoint along the same timeline as last time. Also, it now checks at beginning of recovery that the checkpoint record comes from the correct timeline. That fixes the problem that you reported above. I also adjusted the error messages on timeline history problems to be more clear. - Heikki streaming-tli-switch-4.patch.gz Description: GNU Zip compressed data -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: Promoting a standby during base backup (was Re: [HACKERS] Switching timeline over streaming replication)
On 4 October 2012 18:07, Fujii Masao wrote: > On Thu, Oct 4, 2012 at 4:59 PM, Heikki Linnakangas > wrote: >> On 03.10.2012 18:15, Amit Kapila wrote: >>> >>> On Tuesday, October 02, 2012 4:21 PM Heikki Linnakangas wrote: Hmm, should a base backup be aborted when the standby is promoted? Does the promotion render the backup corrupt? >>> >>> >>> I think currently it does so. Pls refer >>> 1. >>> do_pg_stop_backup(char *labelfile, bool waitforarchive) >>> { >>> .. >>> if (strcmp(backupfrom, "standby") == 0&& !backup_started_in_recovery) >>> ereport(ERROR, >>> >>> (errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE), >>> errmsg("the standby was promoted during >>> online backup"), >>> errhint("This means that the backup >>> being >>> taken is corrupt " >>> "and should not be used. >>> " >>> "Try taking another >>> online >>> backup."))); >>> .. >>> >>> } >> >> >> Okay. I think that check in do_pg_stop_backup() actually already ensures >> that you don't end up with a corrupt backup, even if the standby is promoted >> while a backup is being taken. Admittedly it would be nicer to abort it >> immediately rather than error out at the end. >> >> But I wonder why promoting a standby renders the backup invalid in the first >> place? Fujii, Simon, can you explain that? > > Simon had the same question and I answered it before. > > http://archives.postgresql.org/message-id/cahgqgwfu04oo8yl5sucdjvq3brni7wtfzty9wa2kxtznhic...@mail.gmail.com > --- >> You say >> "If the standby is promoted to the master during online backup, the >> backup fails." >> but no explanation of why? >> >> I could work those things out, but I don't want to have to, plus we >> may disagree if I did. > > If the backup succeeds in that case, when we start an archive recovery from > that > backup, the recovery needs to cross between two timelines. Which means that > we need to set recovery_target_timeline before starting recovery. Whether > recovery_target_timeline needs to be set or not depends on whether the standby > was promoted during taking the backup. Leaving such a decision to a user seems > fragile. I accepted your answer before, but I think it should be challenged now. This is definitely a time when you really want that backup, so invalidating it for such a weak reason is not useful, even if I understand your original thought. Something that has concerned me is that we don't have an explicit "timeline change record". We *say* we do that at shutdown checkpoints, but that is recorded in the new timeline. So we have the strange situation of changing timeline at two separate places. When we change timeline we really should generate one last WAL on the old timeline that marks an explicit change of timeline and a single exact moment when the timeline change takes place. With PITR we are unable to do that, because any timeline can fork at any point. With smooth switchover we have a special case that is not "anything goes" and there is a good case for not incrementing the timeline at all. This is still a half-formed thought, but at least you should know I'm in the debate. -- Simon Riggs http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: Promoting a standby during base backup (was Re: [HACKERS] Switching timeline over streaming replication)
On 04.10.2012 20:07, Fujii Masao wrote: On Thu, Oct 4, 2012 at 4:59 PM, Heikki Linnakangas But I wonder why promoting a standby renders the backup invalid in the first place? Fujii, Simon, can you explain that? Simon had the same question and I answered it before. http://archives.postgresql.org/message-id/cahgqgwfu04oo8yl5sucdjvq3brni7wtfzty9wa2kxtznhic...@mail.gmail.com --- You say "If the standby is promoted to the master during online backup, the backup fails." but no explanation of why? I could work those things out, but I don't want to have to, plus we may disagree if I did. If the backup succeeds in that case, when we start an archive recovery from that backup, the recovery needs to cross between two timelines. Which means that we need to set recovery_target_timeline before starting recovery. Whether recovery_target_timeline needs to be set or not depends on whether the standby was promoted during taking the backup. Leaving such a decision to a user seems fragile. pg_control is backed up last, it would contain the new timeline. No need to set recovery_target_timeline. pg_basebackup -x ensures that all required files are included in the backup and we can start recovery without restoring any file from the archive. But if the standby is promoted during the backup, the timeline history file would become an essential file for recovery, but it's not included in the backup. That is true. We could teach it to include the timeline history file, though. The situation may change if your switching-timeline patch has been committed. It's useful if we can continue the backup even if the standby is promoted. It wouldn't help with pg_basebackup -x, although it would allow streaming replication to fetch the timeline history file. I guess it's best to keep that restriction for now. But I'll add a TODO item for this. - Heikki -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Switching timeline over streaming replication
On Thursday, October 04, 2012 7:22 PM Heikki Linnakangas wrote: > > On Wednesday, October 03, 2012 8:45 PM Heikki Linnakangas wrote: > > On Tuesday, October 02, 2012 4:21 PM Heikki Linnakangas wrote: > > > Thanks for the thorough review! I committed the xlog.c refactoring > > patch > > > now. Attached is a new version of the main patch, comments on > specific > > > points below. I didn't adjust the docs per your comments yet, will > do > > > that next. > > > > I have some doubts regarding the comments fixed by you and some more > new > > review comments. > > After this I shall focus majorly towards testing of this Patch. > > > > Testing > --- One more test seems to be failed. Apart from this, other tests are passed. 2. a. Master M-1 b. Standby S-1 follows M-1 c. insert 10 records on M-1. verify all records are visible on M-1,S-1 d. Stop S-1 e. insert 2 records on M-1. f. Stop M-1 g. Start S-1 h. Promote S-1 i. Make M-1 recovery.conf such that it should connect to S-1 j. Start M-1. Below error comes on M-1 which is expected as M-1 has more data. LOG: database system was shut down at 2012-10-05 16:45:39 IST LOG: entering standby mode LOG: consistent recovery state reached at 0/176A070 LOG: record with zero length at 0/176A070 LOG: database system is ready to accept read only connections LOG: streaming replication successfully connected to primary LOG: fetching timeline history file for timeline 2 from primary server LOG: replication terminated by primary server DETAIL: End of WAL reached on timeline 1 LOG: walreceiver ended streaming and awaits new instructions LOG: new timeline 2 forked off current database system timeline 1 before current recovery point 0/176A070 LOG: re-handshaking at position 0/100 on tli 1 LOG: replication terminated by primary server DETAIL: End of WAL reached on timeline 1 LOG: walreceiver ended streaming and awaits new instructions LOG: new timeline 2 forked off current database system timeline 1 before current recovery point 0/176A070 k. Stop M-1. Start M-1. It is able to successfully connect to S-1 which is a problem. l. check in S-1. Records inserted in step-e are not present. m. Now insert records in S-1. M-1 doesn't recieve any records. On M-1 server following log is getting printed. LOG: out-of-sequence timeline ID 1 (after 2) in log segment 00020001, offset 0 LOG: out-of-sequence timeline ID 1 (after 2) in log segment 00020001, offset 0 LOG: out-of-sequence timeline ID 1 (after 2) in log segment 00020001, offset 0 LOG: out-of-sequence timeline ID 1 (after 2) in log segment 00020001, offset 0 LOG: out-of-sequence timeline ID 1 (after 2) in log segment 00020001, offset 0 With Regards, Amit Kapila. -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: Sharing more infrastructure between walsenders and regular backends (was Re: [HACKERS] Switching timeline over streaming replication)
On Thursday, October 04, 2012 8:40 PM Heikki Linnakangas wrote: > On 03.10.2012 18:15, Amit Kapila wrote: > > 35.WalSenderMain(void) > > { > > .. > > +if (walsender_shutdown_requested) > > +ereport(FATAL, > > + > (errcode(ERRCODE_ADMIN_SHUTDOWN), > > + errmsg("terminating > > + replication > > connection due to administrator command"))); > > + > > +/* Tell the client that we are ready to receive > > + commands */ > > > > +ReadyForQuery(DestRemote); > > + > > .. > > > > +if (walsender_shutdown_requested) > > +ereport(FATAL, > > + > (errcode(ERRCODE_ADMIN_SHUTDOWN), > > + errmsg("terminating > > + replication > > connection due to administrator command"))); > > + > > > > is it necessary to check walsender_shutdown_requested 2 times in a > > loop, if yes, then can we write comment why it is important to check > > it again. > > The idea was to check for shutdown request before and after the > pq_getbyte() call, because that can block for a long time. > > Looking closer, we don't currently (ie. without this patch) make any > effort to react to SIGTERM in a walsender, while it's waiting for a > command from the client. After starting replication, it does check > walsender_shutdown_requested in the loop, and it's also checked during a > base backup (although only when switching to send next file, which seems > too seldom). This issue is orthogonal to handling timeline changes over > streaming replication, although that patch will make it more important > to handle SIGTERM quickly while waiting for a command, because you stay > in that mode for longer and more often. > > I think walsender needs to share more infrastructure with regular > backends to handle this better. When we first implemented streaming > replication in 9.0, it made sense to implement just the bare minimum > needed to accept the handshake commands before entering the Copy state, > but now that the replication command set has grown to cover base > backups, and fetching timelines with the patch being discussed, we > should bite the bullet and make the command loop more feature-complete > and robust. Certainly there are benefits of making walsender and backend infrastructure similar as mentioned by Simon, Andres and Tom. However shall we not do this as a separate feature along with some other requirements and let switchtimeline patch get committed first as this is altogether a different feature. With Regards, Amit Kapila. -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: Sharing more infrastructure between walsenders and regular backends (was Re: [HACKERS] Switching timeline over streaming replication)
Simon Riggs writes: > On 4 October 2012 17:23, Heikki Linnakangas wrote: >> Perhaps we could make walsenders even more like regular backends than what I >> was proposing, so that the replication commands are parsed and executed just >> like regular utility commands. However, that'd require some transaction >> support in walsender, for starters, which seems messy. It might become >> sensible in the future if the replication command set gets even more >> complicated, but it doesn't seem like a good idea at the moment. > It's come up a few times now that people want to run a few queries > either before or after running a base backup. ... > Andres suggested to me the other day we make walsender more like > regular backends. At the time I wasn't sure I agreed, but reading this > it looks like a sensible way to go. That was what I was thinking too, but on reflection there's at least one huge problem: how could we run queries without being connected to a specific database? Which walsender isn't. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: Sharing more infrastructure between walsenders and regular backends (was Re: [HACKERS] Switching timeline over streaming replication)
On 4 October 2012 17:23, Heikki Linnakangas wrote: > On 04.10.2012 19:00, Tom Lane wrote: >> >> Heikki Linnakangas writes: >>> >>> So I propose the attached patch. I made small changes to postgres.c to >>> make it call exec_replication_command() instead of exec_simple_query(), >>> and reject extend query protocol, in a WAL sender process. A lot of code >>> related to handling the main command loop and signals is removed from >>> walsender.c. >> >> >> Why do we need the forbidden_in_wal_sender stuff? If we're going in >> this direction, I suggest there is little reason to restrict what the >> replication client can do. This seems to be both ugly and a drag on >> the performance of normal backends. > > > Well, there's not much need for parameterized queries or cursors with the > replication command set at the moment. I don't think it's worth it to try to > support them. Fastpath function calls make no sense either, as you can't > call user-defined functions in a walsender anyway. > > Perhaps we could make walsenders even more like regular backends than what I > was proposing, so that the replication commands are parsed and executed just > like regular utility commands. However, that'd require some transaction > support in walsender, for starters, which seems messy. It might become > sensible in the future if the replication command set gets even more > complicated, but it doesn't seem like a good idea at the moment. It's come up a few times now that people want to run a few queries either before or after running a base backup. Since the pg_basebackup stuff uses walsender, this make such things impossible. So to support that, we need to allow two kinds of connection, one to "replication" and one to something else, and since the something else is not guaranteed to exist that makes it even harder. Andres suggested to me the other day we make walsender more like regular backends. At the time I wasn't sure I agreed, but reading this it looks like a sensible way to go. -- Simon Riggs http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: Promoting a standby during base backup (was Re: [HACKERS] Switching timeline over streaming replication)
On Thu, Oct 4, 2012 at 4:59 PM, Heikki Linnakangas wrote: > On 03.10.2012 18:15, Amit Kapila wrote: >> >> On Tuesday, October 02, 2012 4:21 PM Heikki Linnakangas wrote: >>> >>> Hmm, should a base backup be aborted when the standby is promoted? Does >>> the promotion render the backup corrupt? >> >> >> I think currently it does so. Pls refer >> 1. >> do_pg_stop_backup(char *labelfile, bool waitforarchive) >> { >> .. >> if (strcmp(backupfrom, "standby") == 0&& !backup_started_in_recovery) >> ereport(ERROR, >> >> (errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE), >> errmsg("the standby was promoted during >> online backup"), >> errhint("This means that the backup >> being >> taken is corrupt " >> "and should not be used. >> " >> "Try taking another >> online >> backup."))); >> .. >> >> } > > > Okay. I think that check in do_pg_stop_backup() actually already ensures > that you don't end up with a corrupt backup, even if the standby is promoted > while a backup is being taken. Admittedly it would be nicer to abort it > immediately rather than error out at the end. > > But I wonder why promoting a standby renders the backup invalid in the first > place? Fujii, Simon, can you explain that? Simon had the same question and I answered it before. http://archives.postgresql.org/message-id/cahgqgwfu04oo8yl5sucdjvq3brni7wtfzty9wa2kxtznhic...@mail.gmail.com --- > You say > "If the standby is promoted to the master during online backup, the > backup fails." > but no explanation of why? > > I could work those things out, but I don't want to have to, plus we > may disagree if I did. If the backup succeeds in that case, when we start an archive recovery from that backup, the recovery needs to cross between two timelines. Which means that we need to set recovery_target_timeline before starting recovery. Whether recovery_target_timeline needs to be set or not depends on whether the standby was promoted during taking the backup. Leaving such a decision to a user seems fragile. pg_basebackup -x ensures that all required files are included in the backup and we can start recovery without restoring any file from the archive. But if the standby is promoted during the backup, the timeline history file would become an essential file for recovery, but it's not included in the backup. --- The situation may change if your switching-timeline patch has been committed. It's useful if we can continue the backup even if the standby is promoted. Regards, -- Fujii Masao -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: Sharing more infrastructure between walsenders and regular backends (was Re: [HACKERS] Switching timeline over streaming replication)
On 04.10.2012 19:00, Tom Lane wrote: Heikki Linnakangas writes: So I propose the attached patch. I made small changes to postgres.c to make it call exec_replication_command() instead of exec_simple_query(), and reject extend query protocol, in a WAL sender process. A lot of code related to handling the main command loop and signals is removed from walsender.c. Why do we need the forbidden_in_wal_sender stuff? If we're going in this direction, I suggest there is little reason to restrict what the replication client can do. This seems to be both ugly and a drag on the performance of normal backends. Well, there's not much need for parameterized queries or cursors with the replication command set at the moment. I don't think it's worth it to try to support them. Fastpath function calls make no sense either, as you can't call user-defined functions in a walsender anyway. Perhaps we could make walsenders even more like regular backends than what I was proposing, so that the replication commands are parsed and executed just like regular utility commands. However, that'd require some transaction support in walsender, for starters, which seems messy. It might become sensible in the future if the replication command set gets even more complicated, but it doesn't seem like a good idea at the moment. - Heikki -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: Sharing more infrastructure between walsenders and regular backends (was Re: [HACKERS] Switching timeline over streaming replication)
Heikki Linnakangas writes: > So I propose the attached patch. I made small changes to postgres.c to > make it call exec_replication_command() instead of exec_simple_query(), > and reject extend query protocol, in a WAL sender process. A lot of code > related to handling the main command loop and signals is removed from > walsender.c. Why do we need the forbidden_in_wal_sender stuff? If we're going in this direction, I suggest there is little reason to restrict what the replication client can do. This seems to be both ugly and a drag on the performance of normal backends. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Sharing more infrastructure between walsenders and regular backends (was Re: [HACKERS] Switching timeline over streaming replication)
On 03.10.2012 18:15, Amit Kapila wrote: 35.WalSenderMain(void) { .. +if (walsender_shutdown_requested) +ereport(FATAL, +(errcode(ERRCODE_ADMIN_SHUTDOWN), + errmsg("terminating replication connection due to administrator command"))); + +/* Tell the client that we are ready to receive commands */ +ReadyForQuery(DestRemote); + .. +if (walsender_shutdown_requested) +ereport(FATAL, +(errcode(ERRCODE_ADMIN_SHUTDOWN), + errmsg("terminating replication connection due to administrator command"))); + is it necessary to check walsender_shutdown_requested 2 times in a loop, if yes, then can we write comment why it is important to check it again. The idea was to check for shutdown request before and after the pq_getbyte() call, because that can block for a long time. Looking closer, we don't currently (ie. without this patch) make any effort to react to SIGTERM in a walsender, while it's waiting for a command from the client. After starting replication, it does check walsender_shutdown_requested in the loop, and it's also checked during a base backup (although only when switching to send next file, which seems too seldom). This issue is orthogonal to handling timeline changes over streaming replication, although that patch will make it more important to handle SIGTERM quickly while waiting for a command, because you stay in that mode for longer and more often. I think walsender needs to share more infrastructure with regular backends to handle this better. When we first implemented streaming replication in 9.0, it made sense to implement just the bare minimum needed to accept the handshake commands before entering the Copy state, but now that the replication command set has grown to cover base backups, and fetching timelines with the patch being discussed, we should bite the bullet and make the command loop more feature-complete and robust. In a regular backend, the command loop reacts to SIGTERM immediately, setting ImmediateInterruptOK at the right places, and calling CHECK_FOR_INTERRUPTS() at strategic places. I propose that we let PostgresMain handle the main command loop for walsender processes too, like it does for regular backends, and use ProcDiePending and the regular die() signal handler for SIGTERM in walsender as well. So I propose the attached patch. I made small changes to postgres.c to make it call exec_replication_command() instead of exec_simple_query(), and reject extend query protocol, in a WAL sender process. A lot of code related to handling the main command loop and signals is removed from walsender.c. - Heikki *** a/src/backend/replication/basebackup.c --- b/src/backend/replication/basebackup.c *** *** 22,27 --- 22,28 #include "lib/stringinfo.h" #include "libpq/libpq.h" #include "libpq/pqformat.h" + #include "miscadmin.h" #include "nodes/pg_list.h" #include "replication/basebackup.h" #include "replication/walsender.h" *** *** 30,36 #include "storage/ipc.h" #include "utils/builtins.h" #include "utils/elog.h" - #include "utils/memutils.h" #include "utils/ps_status.h" typedef struct --- 31,36 *** *** 370,388 void SendBaseBackup(BaseBackupCmd *cmd) { DIR *dir; - MemoryContext backup_context; - MemoryContext old_context; basebackup_options opt; parse_basebackup_options(cmd->options, &opt); - backup_context = AllocSetContextCreate(CurrentMemoryContext, - "Streaming base backup context", - ALLOCSET_DEFAULT_MINSIZE, - ALLOCSET_DEFAULT_INITSIZE, - ALLOCSET_DEFAULT_MAXSIZE); - old_context = MemoryContextSwitchTo(backup_context); - WalSndSetState(WALSNDSTATE_BACKUP); if (update_process_title) --- 370,379 *** *** 403,411 SendBaseBackup(BaseBackupCmd *cmd) perform_base_backup(&opt, dir); FreeDir(dir); - - MemoryContextSwitchTo(old_context); - MemoryContextDelete(backup_context); } static void --- 394,399 *** *** 606,612 sendDir(char *path, int basepathlen, bool sizeonly) * error in that case. The error handler further up will call * do_pg_abort_backup() for us. */ ! if (walsender_shutdown_requested || walsender_ready_to_stop) ereport(ERROR, (errmsg("shutdown requested, aborting active base backup"))); --- 594,600 * error in that case. The error handler further up will call * do_pg_abort_backup() for us. */ ! if (ProcDiePending || walsender_ready_to_stop) ereport(ERROR, (errmsg("shutdown requested, aborting active base backup"))); *** a/src/backend/replication/walsender.c --- b/src/backend/replication/w
Re: [HACKERS] Switching timeline over streaming replication
> On Wednesday, October 03, 2012 8:45 PM Heikki Linnakangas wrote: > On Tuesday, October 02, 2012 4:21 PM Heikki Linnakangas wrote: > > Thanks for the thorough review! I committed the xlog.c refactoring > patch > > now. Attached is a new version of the main patch, comments on specific > > points below. I didn't adjust the docs per your comments yet, will do > > that next. > > I have some doubts regarding the comments fixed by you and some more new > review comments. > After this I shall focus majorly towards testing of this Patch. > Testing --- Failed Case -- 1. promotion of standby to master and follow standby to new master. 2. Stop standby and master. Restart standby first and then master 3. Restart of standby gives below errors E:\pg_git_code\installation\bin>LOG: database system was shut down in recovery at 2012-10-04 18:36:00 IST LOG: entering standby mode LOG: consistent recovery state reached at 0/176B800 LOG: redo starts at 0/176B800 LOG: record with zero length at 0/176BD68 LOG: database system is ready to accept read only connections LOG: streaming replication successfully connected to primary LOG: out-of-sequence timeline ID 1 (after 2) in log segment 0002000 1, offset 0 FATAL: terminating walreceiver process due to administrator command LOG: out-of-sequence timeline ID 1 (after 2) in log segment 0002000 1, offset 0 LOG: out-of-sequence timeline ID 1 (after 2) in log segment 0002000 1, offset 0 LOG: out-of-sequence timeline ID 1 (after 2) in log segment 0002000 1, offset 0 LOG: out-of-sequence timeline ID 1 (after 2) in log segment 0002000 1, offset 0 Once this error comes, restart master/standby in any order or do some operations on master, always there is above error On standby. Passed Cases - 1. After promoting standby as new master, try to make previous master (having same WAL as new master) as standby. In this case recovery.conf recovery_target_timeline set to latest. It ables to connect to new master and started streaming as per expectation. - As per expected behavior. 2. After promoting standby as new master, try to make previous master (having more WAL compare to new master) as standby, error is displayed. - As per expected behavior 3. After promoting standby as new master, try to make previous master (having same WAL as new master) as standby. In this case recovery.conf recovery_target_timeline is not set. Following LOG is displayed. LOG: fetching timeline history file for timeline 2 from primary server LOG: replication terminated by primary server DETAIL: End of WAL reached on timeline 1 LOG: walreceiver ended streaming and awaits new instructions LOG: re-handshaking at position 0/100 on tli 1 LOG: replication terminated by primary server DETAIL: End of WAL reached on timeline 1 LOG: walreceiver ended streaming and awaits new instructions LOG: re-handshaking at position 0/100 on tli 1 LOG: replication terminated by primary server DETAIL: End of WAL reached on timeline 1 - As per expected behavior Pending Cases which needs to be tested (these are scenarios, some more testing I will do based on these scenarios) --- 1. a. Master M-1 b. Standby S-1 follows M-1 c. Standby S-2 follows M-1 d. Promote S-1 as master e. Try to follow S-2 to S-1 -- operation should be success 2. a. Master M-1 b. Standby S-1 follows M-1 c. Stop S-1, M-1 d. Do the PITR in M-1 2 times. This is to increment timeline in M-1 e. try to follow standby S-1 to M-1 -- it should be success. 3. a. Master M-1 b. Standby S-1, S-2 follows M1 c. Standby S-3, S-4 follows S-1 d. Promote Standby which has highest WAL. e. follow all standby's to the new master. 4. a. Master M-1 b. Synchronous Standby S-1, S-2 c. Promote S-1 d. Follow M-1, S-2 to S-1 -- this operation should be success. Concurrent Operations --- 1. a. Master M-1 , Standby S-1 follows M-1, Standby S-2 follows M-1 b. Many concurrent operations on master M-1 c. During concurrent ops, Promote S-1 d. try S-2 to follow S-1 -- it should happen successfully. 2. During Promotion, call pg_basebackup 3. During Promotion, try to connect client Resource Testing -- 1. a.Make standby follow master which is many time lines ahead b. Observe if there is any resource leak c. Allow the streaming replication for 30 mins d. Observe if there is any resource leak Code Review - Libpqrcv_readtimelinehistoryfile() { .. .. + if (PQnfields(res) != 2 || PQntuples(res) != 1) + { + int ntuples = PQntuples(res); + int
Promoting a standby during base backup (was Re: [HACKERS] Switching timeline over streaming replication)
On 03.10.2012 18:15, Amit Kapila wrote: On Tuesday, October 02, 2012 4:21 PM Heikki Linnakangas wrote: Hmm, should a base backup be aborted when the standby is promoted? Does the promotion render the backup corrupt? I think currently it does so. Pls refer 1. do_pg_stop_backup(char *labelfile, bool waitforarchive) { .. if (strcmp(backupfrom, "standby") == 0&& !backup_started_in_recovery) ereport(ERROR, (errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE), errmsg("the standby was promoted during online backup"), errhint("This means that the backup being taken is corrupt " "and should not be used. " "Try taking another online backup."))); .. } Okay. I think that check in do_pg_stop_backup() actually already ensures that you don't end up with a corrupt backup, even if the standby is promoted while a backup is being taken. Admittedly it would be nicer to abort it immediately rather than error out at the end. But I wonder why promoting a standby renders the backup invalid in the first place? Fujii, Simon, can you explain that? - Heikki -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Switching timeline over streaming replication
On Tuesday, October 02, 2012 4:21 PM Heikki Linnakangas wrote: > Thanks for the thorough review! I committed the xlog.c refactoring patch > now. Attached is a new version of the main patch, comments on specific > points below. I didn't adjust the docs per your comments yet, will do > that next. I have some doubts regarding the comments fixed by you and some more new review comments. After this I shall focus majorly towards testing of this Patch. > On 01.10.2012 05:25, Amit kapila wrote: > > 1. In function readTimeLineHistory(), > >two mechanisms are used to fetch timeline from history file > >+sscanf(fline, "%u\t%X/%X", &tli, &switchpoint_hi, > > &switchpoint_lo); > > + > > > 8. In function writeTimeLineHistoryFile(), will it not be better to > > directly write rather than to later do pg_fsync(). > >as it is just one time write. > > Not sure I understood this right, but writeTimeLineHistoryFile() needs > to avoid putting a corrupt, e.g incomplete, file in pg_xlog. The same as > writeTimeLineHistory(). That's why the write+fsync+rename dance is > needed. > Why fsync, isn't the above purpose be resolved if write directly writes to file and then rename. > > 21. @@ -2411,27 +2411,6 @@ reaper(SIGNAL_ARGS) > > > > a. won't it impact stop of online basebackup functionality > because earlier on promotion > > I think this code will stop walsenders and basebackup stop > will also give error in such cases. > > Hmm, should a base backup be aborted when the standby is promoted? Does > the promotion render the backup corrupt? I think currently it does so. Pls refer 1. do_pg_stop_backup(char *labelfile, bool waitforarchive) { .. if (strcmp(backupfrom, "standby") == 0 && !backup_started_in_recovery) ereport(ERROR, (errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE), errmsg("the standby was promoted during online backup"), errhint("This means that the backup being taken is corrupt " "and should not be used. " "Try taking another online backup."))); .. } 2. In documentation of pg_basebackup there is a Note: .If the standby is promoted to the master during online backup, the backup fails. New Ones --- 35.WalSenderMain(void) { .. +if (walsender_shutdown_requested) +ereport(FATAL, +(errcode(ERRCODE_ADMIN_SHUTDOWN), + errmsg("terminating replication connection due to administrator command"))); + +/* Tell the client that we are ready to receive commands */ +ReadyForQuery(DestRemote); + .. +if (walsender_shutdown_requested) +ereport(FATAL, +(errcode(ERRCODE_ADMIN_SHUTDOWN), + errmsg("terminating replication connection due to administrator command"))); + is it necessary to check walsender_shutdown_requested 2 times in a loop, if yes, then can we write comment why it is important to check it again. 35. +SendTimeLineHistory(TimeLineHistoryCmd *cmd) { .. + fd = PathNameOpenFile(path, O_RDONLY | PG_BINARY, 0666); error handling for fd < 0 is missing. 36.+SendTimeLineHistory(TimeLineHistoryCmd *cmd) { .. if (nread <= 0) +ereport(ERROR, +(errcode_for_file_access(), + errmsg("could not read file \"%s\": %m", +path))); FileClose should be done in error case as well. 37. static void XLogSend(char *msgbuf, bool *caughtup) { .. if (currentTimeLineIsHistoric && XLByteLE(currentTimeLineValidUpto, sentPtr)) { /* * This was a historic timeline, and we've reached the point where * we switched to the next timeline. Let the client know we've * reached the end of this timeline, and what the next timeline is. */ /* close the current file. */ if (sendFile >= 0) close(sendFile); sendFile = -1; *caughtup = true; /* Send CopyDone */ pq_putmessage_noblock('c', NULL, 0); streamingDoneSending = true; return; } } I am not able to understand after sending CopyDone message from above code, how walreceiver is handling it and then replying it a CopyDone message. Basically I want to know the walreceiver code which handles it? 38. st
Re: [HACKERS] Switching timeline over streaming replication
> On Friday, September 28, 2012 6:38 PM Amit Kapila wrote: > On Tuesday, September 25, 2012 6:29 PM Heikki Linnakangas wrote: > On 25.09.2012 10:08, Heikki Linnakangas wrote: > > On 24.09.2012 16:33, Amit Kapila wrote: > >> In any case, it will be better if you can split it into multiple > patches: > >> 1. Having new functionality of "Switching timeline over streaming > >> replication" > >> 2. Refactoring related changes. > Ok, here you go. xlog-c-split-1.patch contains the refactoring of existing code, with no user-visible changes. > streaming-tli-switch-2.patch applies over xlog-c-split-1.patch, and contains the new functionality. > Please find the initial review of the patch. Still more review is pending, > but I thought whatever is done I shall post Some more review: 11. In function readTimeLineHistory() ereport(DEBUG3, (errmsg_internal("history of timeline %u is %s", targetTLI, nodeToString(result; Don't nodeToString(result) needs to be changed as it contain list of structure TimeLineHistoryEntry 12. In function @@ -3768,6 +3773,8 @@ rescanLatestTimeLine(void) + * The current timeline must be found in the history file, and the + * next timeline must've forked off from it *after* the current + * recovery location. */ - if (!list_member_int(newExpectedTLIs, - (int) recoveryTargetTLI)) - ereport(LOG, - (errmsg("new timeline %u is not a child of database system timeline %u", - newtarget, - ThisTimeLineID))); is there any logic in the current patch which ensures that above check is not require now? 13. In function @@ -3768,6 +3773,8 @@ rescanLatestTimeLine(void) + found = false; + foreach (cell, newExpectedTLIs) .. .. - list_free(expectedTLIs); + list_free_deep(expectedTLIs); whats the use of the found variable and freeing expectedTLIs in loop might cause problem. 14. In function @@ -3768,6 +3773,8 @@ rescanLatestTimeLine(void) newExpectedTLIs = readTimeLineHistory(newtarget); Shouldn't this variable be declared as newExpectedTLEs as the list returned by readTimeLineHistory contains target list entry 15. StartupXLOG /* Now we can determine the list of expected TLIs */ expectedTLIs = readTimeLineHistory(recoveryTargetTLI); Should expectedTLIs be changed to expectedTLEs as the list returned by readTimeLineHistory contains target list entry 16.@@ -5254,8 +5252,8 @@ StartupXLOG(void) writeTimeLineHistory(ThisTimeLineID, recoveryTargetTLI, - curFileTLI, endLogSegNo, reason); + curFileTLI, EndRecPtr, reason); if endLogSegNo is not used here, it needs to be removd from function declaration as well. 17.@@ -5254,8 +5252,8 @@ StartupXLOG(void) if (InArchiveRecovery) .. .. + + /* The history file can be archived immediately. */ + TLHistoryFileName(histfname, ThisTimeLineID); + XLogArchiveNotify(histfname); Shouldn't this be done archive_mode is on. Right now InArchiveRecovery is true even when we do recovery for standby 18. +static bool +WaitForWALToBecomeAvailable(XLogRecPtr RecPtr, bool randAccess, bool fetching_ckpt) { .. + if (XLByteLT(RecPtr, receivedUpto)) + havedata = true; + else + { + XLogRecPtr latestChunkStart; + + receivedUpto = GetWalRcvWriteRecPtr(&latestChunkStart, &receiveTLI); + if (XLByteLT(RecPtr, receivedUpto) && receiveTLI == curFileTLI) + { + havedata = true; + if (!XLByteLT(RecPtr, latestChunkStart)) + { + XLogReceiptTime = GetCurrentTimestamp(); + SetCurrentChunkStartTime(XLogReceiptTime); + } + } + else + havedata = false; + } In the above logic, it seems there is inconsistency in setting havedata = true; In the above code in else loop, let us say cond. receiveTLI == curFileTLI is false but XLByteLT(RecPtr, receivedUpto) is true, then in next round in for loop, the check if (XLByteLT(RecPtr, receivedUpto)) will get true and will set havedata = true; which seems to be contradictory. 19. +static bool +WaitForWALToBecomeAvailable(XLogRecPtr RecPtr, bool randAccess, bool fetching_ckpt) { .. + if (PrimaryConnInfo) + { + XLogRecPtr ptr = fetching_ckpt ? RedoStartLSN : RecPtr; + TimeLineID tli = timelineOfPointInHistory(ptr, expectedTLIs); + + if (tli < curFileTLI) I think in some cases if (tli < curFileTLI) might not make sense, as for case where curFileTLI =0 for randAccess. 20. Function name WaitForWALToBecomeAvailable suggests that it waits for WAL, but it also returns true when trigger file is present, which can be little misleading. 21. @@ -2411,27 +2411,6 @@ reaper(SIGNAL_ARGS) a. won't it impact stop of online basebackup functionality because earlier on promotion I think this code will stop walsenders and basebackup stop will also give error in such cases. 22. @@ -63,10 +66,17 @@ void _PG_init(void) { /* Tell walreceiver how to reach us */ - if (walrcv_connect != NULL || walrcv_receive != NULL || - walrcv_send != NULL || walrcv_disconnect != NULL) + if (walrcv_connect != NULL || walrcv_identify_system || + walrcv_readtimelinehistoryfile != NULL || check for walrcv
Re: [HACKERS] Switching timeline over streaming replication
> On Tuesday, September 25, 2012 6:29 PM Heikki Linnakangas wrote: > On 25.09.2012 10:08, Heikki Linnakangas wrote: > > On 24.09.2012 16:33, Amit Kapila wrote: > >> In any case, it will be better if you can split it into multiple > patches: > >> 1. Having new functionality of "Switching timeline over streaming > >> replication" > >> 2. Refactoring related changes. > Ok, here you go. xlog-c-split-1.patch contains the refactoring of existing code, with no user-visible changes. > streaming-tli-switch-2.patch applies over xlog-c-split-1.patch, and contains the new functionality. Please find the initial review of the patch. Still more review is pending, but I thought whatever is done I shall post Basic stuff: -- - Patch applies OK - Compiles cleanly with no warnings - Regression tests pass. - Documentation changes are mostly fine. - Basic replication tests works. Testing - Start primary server Start standby server Start cascade standby server Stopped the primary server Promoted the standby server with ./pg_ctl -D data_repl promote In postgresql.conf file archive_mode = off The following logs are observing in the cascade standby server. LOG: replication terminated by primary server DETAIL: End of WAL reached on timeline 1 LOG: walreceiver ended streaming and awaits new instructions LOG: record with zero length at 0/17E3888 LOG: re-handshaking at position 0/100 on tli 1 LOG: fetching timeline history file for timeline 2 from primary server LOG: replication terminated by primary server DETAIL: End of WAL reached on timeline 1 LOG: walreceiver ended streaming and awaits new instructions LOG: re-handshaking at position 0/100 on tli 1 LOG: replication terminated by primary server DETAIL: End of WAL reached on timeline 1 In postgresql.conf file archive_mode = on The following logs are observing in the cascade standby server. LOG: replication terminated by primary server DETAIL: End of WAL reached on timeline 1 LOG: walreceiver ended streaming and awaits new instructions sh: /home/amit/installation/bin/data_sub/pg_xlog/archive_status/0001 0002: No such file or directory LOG: record with zero length at 0/20144B8 sh: /home/amit/installation/bin/data_sub/pg_xlog/archive_status/0001 0002: No such file or directory LOG: re-handshaking at position 0/200 on tli 1 LOG: fetching timeline history file for timeline 2 from primary server LOG: replication terminated by primary server DETAIL: End of WAL reached on timeline 1 LOG: walreceiver ended streaming and awaits new instructions sh: /home/amit/installation/bin/data_sub/pg_xlog/archive_status/0001 0002: No such file or directory sh: /home/amit/installation/bin/data_sub/pg_xlog/archive_status/0001 0002: No such file or directory LOG: re-handshaking at position 0/200 on tli 1 LOG: replication terminated by primary server DETAIL: End of WAL reached on timeline 1 LOG: walreceiver ended streaming and awaits new instructions Verified that files are present in respective directories. Code Review 1. In function readTimeLineHistory(), two mechanisms are used to fetch timeline from history file +sscanf(fline, "%u\t%X/%X", &tli, &switchpoint_hi, &switchpoint_lo); + +/* expect a numeric timeline ID as first field of line */ +tli = (TimeLineID) strtoul(ptr, &endptr, 0); If we use new mechanism, it will not be able to detect error as it is doing in current case. 2. In function readTimeLineHistory(), +fd = AllocateFile(path, "r"); +if (fd == NULL) +{ +if (errno != ENOENT) +ereport(FATAL, +(errcode_for_file_access(), + errmsg("could not open file \"%s\": %m", path))); +/* Not there, so assume no parents */ +return list_make1_int((int) targetTLI); +} still return list_make1_int((int) targetTLI); is used. 3. Function timelineOfPointInHistory(), should return the timeline of recptr passed to it. a. is it okay to decide based on xlog recordpointer that which timeline it belongs to, as different timelines can have same xlog recordpointer? b. it seems from logic that it will return timeline previous to the timeline of recptr passed. For example if the timeline 3's switchpoint is equal to recptr passed then it will return timeline 2. 4. In writeTimeLineHistory function variable endTLI is never used. 5. In header of function writeTimeLineHistory(), can give explanation about XLogRecPtr switchpoint 6. @@ -6869,11 +5947,35 @@ StartupXLOG(void) */ if (InArchiveRecovery) { +charreason[200]; + +/* +
Re: [HACKERS] Switching timeline over streaming replication
On 27-09-2012 01:30, Amit Kapila wrote: > I understood this point, but currently in documentation of Timelines, this > usecase is not documented (Section 24.3.5). > Timeline documentation was written during PITR implementation. There wasn't SR yet. AFAICS it doesn't cite SR but is sufficiently generic (it use 'wal records' term to explain the feature). Feel free to reword those paragraphs mentioning SR. -- Euler Taveira de Oliveira - Timbira http://www.timbira.com.br/ PostgreSQL: Consultoria, Desenvolvimento, Suporte 24x7 e Treinamento -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Switching timeline over streaming replication
On 09/26/2012 01:02 AM, m...@rpzdesign.com wrote: John: Who has the money for oracle RAC or funding arrogant bastard Oracle CEO Ellison to purchase another island? Postgres needs CHEAP, easy to setup, self healing, master-master-master-master and it needs it yesterday. I was able to patch the 9.2.0 code base in 1 day and change my entire architecture strategy for replication into self healing async master-master-master and the tiniest bit of sharding code imaginable Tell us about the compromises you had to make. It is an established fact that you can either have it replicate fast and loose or slow and correct. In the fast and loose case you have to be ready to do a lot of mopping-up in case of conflicts. That is why I suggest something to replace OIDs with ROIDs for replication ID. (CREATE TABLE with ROIDS) I implement ROIDs as a uniform design pattern for the table structures. Synchronous replication maybe between 2 local machines if absolutely no local hardware failure is acceptable, but cheap, scaleable synchronous, Scaleable / synchronous is probably doable, if we are ready to take the initial performance hit of lock propagation. TRANSACTIONAL, master-master-master-master is a real tough slog. I could implement global locks in the external replication layer if I choose, but there are much easier ways in routing requests thru the load balancer and request sharding than trying to manage global locks across the WAN. Good luck with your HA patch for Postgres. Thanks for all of the responses! You guys are 15 times more active than the MySQL developer group, likely because they do not have a single db engine that meets all the requirements like PG. marco On 9/25/2012 5:10 PM, John R Pierce wrote: On 09/25/12 11:01 AM, m...@rpzdesign.com wrote: At some point, every master - slave replicator gets to the point where they need to start thinking about master-master replication. master-master and transactional integrity are mutually exclusive, except perhaps in special cases like Oracle RAC, where the masters share a coherent cache and implement global locks. -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Switching timeline over streaming replication
On Thursday, September 27, 2012 6:30 AM Josh Berkus wrote: > > Yes that is correct. I thought timeline change happens only when > somebody > > does PITR. > > Can you please tell me why we change timeline after promotion, > because the > > original > > Timeline concept was for PITR and I am not able to trace from code > the > > reason > > why on promotion it is required? > > The idea behind the timeline switch is to prevent a server from > subscribing to a master which is actually behind it. For example, > consider this sequence: > > 1. M1->async->S1 > 2. M1 is at xid 2001 and fails. > 3. S1 did not receive transaction 2001 and is at xid 2000 > 4. S1 is promoted. > 5. S1 processed an new, different transaction 2001 > 6. M1 is repaired and brought back up > 7. M1 is subscribed to S1 > 8. M1 is now corrupt. > > That's why we need the timeline switch. Thanks. I understood this point, but currently in documentation of Timelines, this usecase is not documented (Section 24.3.5). With Regards, Amit Kapila. -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Switching timeline over streaming replication
Josh: The good part is you are the first person to ask for a copy and I will send you the hook code that I have and you can be a good sport and put it on GitHub, that is great, you can give us both credit for a joint effort, I do the code, you put it GitHub. The not so good part is that the community has a bunch of other trigger work and other stuff going on, so there was not much interest in non-WAL replication hook code. I do not have time to debate implementation nor wait for release of 9.3 with my needs not met, so I will just keep patching the hook code into whatever release code base comes along. The bad news is that I have not implemented the logic of the external replication daemon. The other good and bad news is that you are free to receive the messages from the hook code thru the unix socket and implement replication any way you want and the bad news is that you are free to IMPLEMENT replication any way you want. I am going to implement master-master-master-master SELF HEALING replication, but that is just my preference. Should take about a week to get it operational and another week to see how it works in my geographically dispersed servers in the cloud. Send me a note if it is ok to send you a zip file with the source code files that I touched in the 9.2 code base so you can shove it up on GitHub. Cheers, marco On 9/26/2012 6:48 PM, Josh Berkus wrote: I was able to patch the 9.2.0 code base in 1 day and change my entire architecture strategy for replication into self healing async master-master-master and the tiniest bit of sharding code imaginable Sounds cool. Do you have a fork available on Github? I'll try it out. -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Switching timeline over streaming replication
> Yes that is correct. I thought timeline change happens only when somebody > does PITR. > Can you please tell me why we change timeline after promotion, because the > original > Timeline concept was for PITR and I am not able to trace from code the > reason > why on promotion it is required? The idea behind the timeline switch is to prevent a server from subscribing to a master which is actually behind it. For example, consider this sequence: 1. M1->async->S1 2. M1 is at xid 2001 and fails. 3. S1 did not receive transaction 2001 and is at xid 2000 4. S1 is promoted. 5. S1 processed an new, different transaction 2001 6. M1 is repaired and brought back up 7. M1 is subscribed to S1 8. M1 is now corrupt. That's why we need the timeline switch. -- Josh Berkus PostgreSQL Experts Inc. http://pgexperts.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Switching timeline over streaming replication
> I was able to patch the 9.2.0 code base in 1 day and change my entire > architecture strategy for replication > into self healing async master-master-master and the tiniest bit of > sharding code imaginable Sounds cool. Do you have a fork available on Github? I'll try it out. -- Josh Berkus PostgreSQL Experts Inc. http://pgexperts.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Switching timeline over streaming replication
John: Who has the money for oracle RAC or funding arrogant bastard Oracle CEO Ellison to purchase another island? Postgres needs CHEAP, easy to setup, self healing, master-master-master-master and it needs it yesterday. I was able to patch the 9.2.0 code base in 1 day and change my entire architecture strategy for replication into self healing async master-master-master and the tiniest bit of sharding code imaginable That is why I suggest something to replace OIDs with ROIDs for replication ID. (CREATE TABLE with ROIDS) I implement ROIDs as a uniform design pattern for the table structures. Synchronous replication maybe between 2 local machines if absolutely no local hardware failure is acceptable, but cheap, scaleable synchronous, TRANSACTIONAL, master-master-master-master is a real tough slog. I could implement global locks in the external replication layer if I choose, but there are much easier ways in routing requests thru the load balancer and request sharding than trying to manage global locks across the WAN. Good luck with your HA patch for Postgres. Thanks for all of the responses! You guys are 15 times more active than the MySQL developer group, likely because they do not have a single db engine that meets all the requirements like PG. marco On 9/25/2012 5:10 PM, John R Pierce wrote: On 09/25/12 11:01 AM, m...@rpzdesign.com wrote: At some point, every master - slave replicator gets to the point where they need to start thinking about master-master replication. master-master and transactional integrity are mutually exclusive, except perhaps in special cases like Oracle RAC, where the masters share a coherent cache and implement global locks. -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Switching timeline over streaming replication
On 09/25/12 11:01 AM, m...@rpzdesign.com wrote: At some point, every master - slave replicator gets to the point where they need to start thinking about master-master replication. master-master and transactional integrity are mutually exclusive, except perhaps in special cases like Oracle RAC, where the masters share a coherent cache and implement global locks. -- john r pierceN 37, W 122 santa cruz ca mid-left coast -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Switching timeline over streaming replication
On Tue, Sep 25, 2012 at 11:01 AM, m...@rpzdesign.com wrote: > Amit: > > At some point, every master - slave replicator gets to the point where they > need > to start thinking about master-master replication. Even in a master-master system, the ability to cleanly swap leaders managing a member of the master-master cluster is very useful. This patch can make writing HA software for Postgres a lot less ridiculous. > Instead of getting stuck in the weeds to finally realize that master-master > is the ONLY way > to go, many developers do not start out planning for master - master, but > they should, out of habit. > > You can save yourself a lot of grief just be starting with master-master > architecture. I've seen more projects get stuck spinning their wheels on the one Master-Master system to rule them all then succeed and move on. It doesn't help that master-master does not have a single definition, and different properties are possible with different logical models, too, so that pervades its way up to the language layer. As-is, managing single-master HA Postgres is a huge pain without this patch. If there is work to be done on master-master, the logical replication and event trigger work are probably more relevant, and I know the authors of those projects are keen to make it more feasible to experiment. -- fdr -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Switching timeline over streaming replication
Amit: At some point, every master - slave replicator gets to the point where they need to start thinking about master-master replication. Instead of getting stuck in the weeds to finally realize that master-master is the ONLY way to go, many developers do not start out planning for master - master, but they should, out of habit. You can save yourself a lot of grief just be starting with master-master architecture. But you don't have to USE it, you can just not send WRITE traffic to the servers that you do not want to WRITE to, but all of them should be WRITE servers. That way, the only timeline you ever need is your decision to send WRITE traffic request to them, but there is nothing that prevents you from running MASTER - MASTER all the time and skip the whole slave thing entirely. At this point, I think synchronous replication is only for immediate local replication needs and async for all the master - master stuff. cheers, marco On 9/24/2012 9:44 PM, Amit Kapila wrote: On Monday, September 24, 2012 9:08 PM m...@rpzdesign.com wrote: What a disaster waiting to happen. Maybe the only replication should be master-master replication so there is no need to sequence timelines or anything, all servers are ready masters, no backups or failovers. If you really do not want a master serving, then it should only be handled in the routing of traffic to that server and not the replication logic itself. The only thing that ever came about from failovers was the failure to turn over. The above is opinion only. This feature is for users who want to use master-standby configurations. What do you mean by : "then it should only be handled in the routing of traffic to that server and not the replication logic itself." Do you have any idea other than proposed implementation or do you see any problem in currently proposed solution? On 9/24/2012 7:33 AM, Amit Kapila wrote: On Tuesday, September 11, 2012 10:53 PM Heikki Linnakangas wrote: I've been working on the often-requested feature to handle timeline changes over streaming replication. At the moment, if you kill the master and promote a standby server, and you have another standby server that you'd like to keep following the new master server, you need a WAL archive in addition to streaming replication to make it cross the timeline change. Streaming replication will just error out. Having a WAL archive is usually a good idea in complex replication scenarios anyway, but it would be good to not require it. Confirm my understanding of this feature: This feature is for case when standby-1 who is going to be promoted to master has archive mode 'on'. As in that case only its timeline will change. If above is right, then there can be other similar scenario's where it can be used: Scenario-1 (1 Master, 1 Stand-by) 1. Master (archive_mode=on) goes down. 2. Master again comes up 3. Stand-by tries to follow it Now in above scenario also due to timeline mismatch it gives error, but your patch should fix it. Some parts of this patch are just refactoring that probably make sense regardless of the new functionality. For example, I split off the timeline history file related functions to a new file, timeline.c. That's not very much code, but it's fairly isolated, and xlog.c is massive, so I feel that anything that we can move off from xlog.c is a good thing. I also moved off the two functions RestoreArchivedFile() and ExecuteRecoveryCommand(), to a separate file. Those are also not much code, but are fairly isolated. If no-one objects to those changes, and the general direction this work is going to, I'm going split off those refactorings to separate patches and commit them separately. I also made the timeline history file a bit more detailed: instead of recording just the WAL segment where the timeline was changed, it now records the exact XLogRecPtr. That was required for the walsender to know the switchpoint, without having to parse the XLOG records (it reads and parses the history file, instead) IMO separating timeline history file related functions to a new file is good. However I am not sure about splitting for RestoreArchivedFile() and ExecuteRecoveryCommand() into separate file. How about splitting for all Archive related functions: static void XLogArchiveNotify(const char *xlog); static void XLogArchiveNotifySeg(XLogSegNo segno); static bool XLogArchiveCheckDone(const char *xlog); static bool XLogArchiveIsBusy(const char *xlog); static void XLogArchiveCleanup(const char *xlog); .. .. In any case, it will be better if you can split it into multiple patches: 1. Having new functionality of "Switching timeline over streaming replication" 2. Refactoring related changes. It can make my testing and review for new feature patch little easier. With Regards, Amit Kapila. -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref
Re: [HACKERS] Switching timeline over streaming replication
> On Tuesday, September 25, 2012 6:29 PM Heikki Linnakangas wrote: > On 25.09.2012 10:08, Heikki Linnakangas wrote: > > On 24.09.2012 16:33, Amit Kapila wrote: > >> In any case, it will be better if you can split it into multiple > patches: > >> 1. Having new functionality of "Switching timeline over streaming > >> replication" > >> 2. Refactoring related changes. > >> > >> It can make my testing and review for new feature patch little > easier. > > > > Yep, I'll go ahead and split the patch. Thanks! > > Ok, here you go. xlog-c-split-1.patch contains the refactoring of > existing code, with no user-visible changes. > streaming-tli-switch-2.patch applies over xlog-c-split-1.patch, and > contains the new functionality. Thanks, it will make my review easier than previous. With Regards, Amit Kapila. -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Switching timeline over streaming replication
On 25.09.2012 14:10, Amit Kapila wrote: On Tuesday, September 25, 2012 12:39 PM Heikki Linnakangas wrote: On 24.09.2012 16:33, Amit Kapila wrote: On Tuesday, September 11, 2012 10:53 PM Heikki Linnakangas wrote: I've been working on the often-requested feature to handle timeline changes over streaming replication. At the moment, if you kill the master and promote a standby server, and you have another standby server that you'd like to keep following the new master server, you need a WAL archive in addition to streaming replication to make it cross the timeline change. Streaming replication will just error out. Having a WAL archive is usually a good idea in complex replication scenarios anyway, but it would be good to not require it. Confirm my understanding of this feature: This feature is for case when standby-1 who is going to be promoted to master has archive mode 'on'. No. This is for the case where there is no WAL archive. archive_mode='off' on all servers. Or to be precise, you can also have a WAL archive, but this patch doesn't affect that in any way. This is strictly about streaming replication. As in that case only its timeline will change. The timeline changes whenever you promote a standby. It's not related to whether you have a WAL archive or not. Yes that is correct. I thought timeline change happens only when somebody does PITR. Can you please tell me why we change timeline after promotion, because the original Timeline concept was for PITR and I am not able to trace from code the reason why on promotion it is required? Bumping the timeline helps to avoid confusion if, for example, the master crashes, and the standby isn't fully in sync with it. In that situation, there are some WAL records in the master that are not in the standby, so promoting the standby is effectively the same as doing PITR. If you promote the standby, and later try to turn the old master into a standby server that connects to the new master, things will go wrong. Assigning the new master a new timeline ID helps the system and the administrator to notice that. It's not bulletproof, for example you can easily avoid the timeline change if you just remove recovery.conf and restart the server, but the timelines help to manage such situations. If above is right, then there can be other similar scenario's where it can be used: Scenario-1 (1 Master, 1 Stand-by) 1. Master (archive_mode=on) goes down. 2. Master again comes up 3. Stand-by tries to follow it Now in above scenario also due to timeline mismatch it gives error, but your patch should fix it. If the master simply crashes or is shut down, and then restarted, the timeline doesn't change. The standby will reconnect / poll the archive, and sync up just fine, even without this patch. How about when Master does PITR when it comes again? Then the timeline will be bumped and this patch will be helpful. Assuming the standby is behind the point in time that the master was recovered to, it will be able to follow the master to the new timeline. - Heikki -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Switching timeline over streaming replication
On Tuesday, September 25, 2012 12:39 PM Heikki Linnakangas wrote: > On 24.09.2012 16:33, Amit Kapila wrote: > > On Tuesday, September 11, 2012 10:53 PM Heikki Linnakangas wrote: > >> I've been working on the often-requested feature to handle timeline > >> changes over streaming replication. At the moment, if you kill the > >> master and promote a standby server, and you have another standby > >> server that you'd like to keep following the new master server, you > >> need a WAL archive in addition to streaming replication to make it > >> cross the timeline change. Streaming replication will just error > out. > >> Having a WAL archive is usually a good idea in complex replication > >> scenarios anyway, but it would be good to not require it. > > > > Confirm my understanding of this feature: > > > > This feature is for case when standby-1 who is going to be promoted > to > > master has archive mode 'on'. > > No. This is for the case where there is no WAL archive. > archive_mode='off' on all servers. > > Or to be precise, you can also have a WAL archive, but this patch > doesn't affect that in any way. This is strictly about streaming > replication. > > > As in that case only its timeline will change. > > The timeline changes whenever you promote a standby. It's not related > to > whether you have a WAL archive or not. Yes that is correct. I thought timeline change happens only when somebody does PITR. Can you please tell me why we change timeline after promotion, because the original Timeline concept was for PITR and I am not able to trace from code the reason why on promotion it is required? > > > If above is right, then there can be other similar scenario's where > it can > > be used: > > > > Scenario-1 (1 Master, 1 Stand-by) > > 1. Master (archive_mode=on) goes down. > > 2. Master again comes up > > 3. Stand-by tries to follow it > > > > Now in above scenario also due to timeline mismatch it gives error, > but your > > patch should fix it. > > If the master simply crashes or is shut down, and then restarted, the > timeline doesn't change. The standby will reconnect / poll the archive, > and sync up just fine, even without this patch. How about when Master does PITR when it comes again? With Regards, Amit Kapila. -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Switching timeline over streaming replication
On 24.09.2012 16:33, Amit Kapila wrote: On Tuesday, September 11, 2012 10:53 PM Heikki Linnakangas wrote: I've been working on the often-requested feature to handle timeline changes over streaming replication. At the moment, if you kill the master and promote a standby server, and you have another standby server that you'd like to keep following the new master server, you need a WAL archive in addition to streaming replication to make it cross the timeline change. Streaming replication will just error out. Having a WAL archive is usually a good idea in complex replication scenarios anyway, but it would be good to not require it. Confirm my understanding of this feature: This feature is for case when standby-1 who is going to be promoted to master has archive mode 'on'. No. This is for the case where there is no WAL archive. archive_mode='off' on all servers. Or to be precise, you can also have a WAL archive, but this patch doesn't affect that in any way. This is strictly about streaming replication. As in that case only its timeline will change. The timeline changes whenever you promote a standby. It's not related to whether you have a WAL archive or not. If above is right, then there can be other similar scenario's where it can be used: Scenario-1 (1 Master, 1 Stand-by) 1. Master (archive_mode=on) goes down. 2. Master again comes up 3. Stand-by tries to follow it Now in above scenario also due to timeline mismatch it gives error, but your patch should fix it. If the master simply crashes or is shut down, and then restarted, the timeline doesn't change. The standby will reconnect / poll the archive, and sync up just fine, even without this patch. However I am not sure about splitting for RestoreArchivedFile() and ExecuteRecoveryCommand() into separate file. How about splitting for all Archive related functions: static void XLogArchiveNotify(const char *xlog); static void XLogArchiveNotifySeg(XLogSegNo segno); static bool XLogArchiveCheckDone(const char *xlog); static bool XLogArchiveIsBusy(const char *xlog); static void XLogArchiveCleanup(const char *xlog); Hmm, sounds reasonable. In any case, it will be better if you can split it into multiple patches: 1. Having new functionality of "Switching timeline over streaming replication" 2. Refactoring related changes. It can make my testing and review for new feature patch little easier. Yep, I'll go ahead and split the patch. Thanks! - Heikki -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Switching timeline over streaming replication
> On Monday, September 24, 2012 9:08 PM m...@rpzdesign.com wrote: > What a disaster waiting to happen. Maybe the only replication should be > master-master replication > so there is no need to sequence timelines or anything, all servers are > ready masters, no backups or failovers. > If you really do not want a master serving, then it should only be > handled in the routing > of traffic to that server and not the replication logic itself. The > only thing that ever came about > from failovers was the failure to turn over. The above is opinion > only. This feature is for users who want to use master-standby configurations. What do you mean by : "then it should only be handled in the routing of traffic to that server and not the replication logic itself." Do you have any idea other than proposed implementation or do you see any problem in currently proposed solution? > > On 9/24/2012 7:33 AM, Amit Kapila wrote: > > On Tuesday, September 11, 2012 10:53 PM Heikki Linnakangas wrote: > >> I've been working on the often-requested feature to handle timeline > >> changes over streaming replication. At the moment, if you kill the > >> master and promote a standby server, and you have another standby > >> server that you'd like to keep following the new master server, you > >> need a WAL archive in addition to streaming replication to make it > >> cross the timeline change. Streaming replication will just error > out. > >> Having a WAL archive is usually a good idea in complex replication > >> scenarios anyway, but it would be good to not require it. > > Confirm my understanding of this feature: > > > > This feature is for case when standby-1 who is going to be promoted > to > > master has archive mode 'on'. > > As in that case only its timeline will change. > > > > If above is right, then there can be other similar scenario's where > it can > > be used: > > > > Scenario-1 (1 Master, 1 Stand-by) > > 1. Master (archive_mode=on) goes down. > > 2. Master again comes up > > 3. Stand-by tries to follow it > > > > Now in above scenario also due to timeline mismatch it gives error, > but your > > patch should fix it. > > > > > >> > >> Some parts of this patch are just refactoring that probably make > sense > >> regardless of the new functionality. For example, I split off the > >> timeline history file related functions to a new file, timeline.c. > >> That's not very much code, but it's fairly isolated, and xlog.c is > >> massive, so I feel that anything that we can move off from xlog.c is > a > >> good thing. I also moved off the two functions RestoreArchivedFile() > >> and ExecuteRecoveryCommand(), to a separate file. Those are also not > >> much code, but are fairly isolated. If no-one objects to those > changes, > >> and the general direction this work is going to, I'm going split off > >> those refactorings to separate patches and commit them separately. > >> > >> I also made the timeline history file a bit more detailed: instead > of > >> recording just the WAL segment where the timeline was changed, it > now > >> records the exact XLogRecPtr. That was required for the walsender to > >> know the switchpoint, without having to parse the XLOG records (it > >> reads and parses the history file, instead) > > IMO separating timeline history file related functions to a new file > is > > good. > > However I am not sure about splitting for RestoreArchivedFile() and > > ExecuteRecoveryCommand() into separate file. > > How about splitting for all Archive related functions: > > static void XLogArchiveNotify(const char *xlog); > > static void XLogArchiveNotifySeg(XLogSegNo segno); > > static bool XLogArchiveCheckDone(const char *xlog); > > static bool XLogArchiveIsBusy(const char *xlog); > > static void XLogArchiveCleanup(const char *xlog); > > .. > > .. > > > > In any case, it will be better if you can split it into multiple > patches: > > 1. Having new functionality of "Switching timeline over streaming > > replication" > > 2. Refactoring related changes. > > > > It can make my testing and review for new feature patch little > easier. > > > > With Regards, > > Amit Kapila. > > > > > > -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Switching timeline over streaming replication
On Tuesday, September 11, 2012 10:53 PM Heikki Linnakangas wrote: > I've been working on the often-requested feature to handle timeline > changes over streaming replication. At the moment, if you kill the > master and promote a standby server, and you have another standby > server that you'd like to keep following the new master server, you > need a WAL archive in addition to streaming replication to make it > cross the timeline change. Streaming replication will just error out. > Having a WAL archive is usually a good idea in complex replication > scenarios anyway, but it would be good to not require it. Confirm my understanding of this feature: This feature is for case when standby-1 who is going to be promoted to master has archive mode 'on'. As in that case only its timeline will change. If above is right, then there can be other similar scenario's where it can be used: Scenario-1 (1 Master, 1 Stand-by) 1. Master (archive_mode=on) goes down. 2. Master again comes up 3. Stand-by tries to follow it Now in above scenario also due to timeline mismatch it gives error, but your patch should fix it. > > > Some parts of this patch are just refactoring that probably make sense > regardless of the new functionality. For example, I split off the > timeline history file related functions to a new file, timeline.c. > That's not very much code, but it's fairly isolated, and xlog.c is > massive, so I feel that anything that we can move off from xlog.c is a > good thing. I also moved off the two functions RestoreArchivedFile() > and ExecuteRecoveryCommand(), to a separate file. Those are also not > much code, but are fairly isolated. If no-one objects to those changes, > and the general direction this work is going to, I'm going split off > those refactorings to separate patches and commit them separately. > > I also made the timeline history file a bit more detailed: instead of > recording just the WAL segment where the timeline was changed, it now > records the exact XLogRecPtr. That was required for the walsender to > know the switchpoint, without having to parse the XLOG records (it > reads and parses the history file, instead) IMO separating timeline history file related functions to a new file is good. However I am not sure about splitting for RestoreArchivedFile() and ExecuteRecoveryCommand() into separate file. How about splitting for all Archive related functions: static void XLogArchiveNotify(const char *xlog); static void XLogArchiveNotifySeg(XLogSegNo segno); static bool XLogArchiveCheckDone(const char *xlog); static bool XLogArchiveIsBusy(const char *xlog); static void XLogArchiveCleanup(const char *xlog); .. .. In any case, it will be better if you can split it into multiple patches: 1. Having new functionality of "Switching timeline over streaming replication" 2. Refactoring related changes. It can make my testing and review for new feature patch little easier. With Regards, Amit Kapila. -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers