Re: [HACKERS] Point in Time Recovery
I was wondering about this point - might it not be just as reasonable for the copied file to *be* an exact image of pg_control? Then a very simple variant of pg_controldata (or maybe even just adding switches to pg_controldata itself) would enable the relevant info to be extracted We didn't do that so admins could easily read the file contents. If you use a readable file you will also need a feature for restore (or a tool) to create an appropriate pg_control file, or are you intending to still require that pg_control be the first file backed up. Another possibility would be that the start function writes the readable file and also copies pg_control. Andreas ---(end of broadcast)--- TIP 7: don't forget to increase your free space map settings
Re: [HACKERS] Point in Time Recovery
Zeugswetter Andreas SB SD wrote: I was wondering about this point - might it not be just as reasonable for the copied file to *be* an exact image of pg_control? Then a very simple variant of pg_controldata (or maybe even just adding switches to pg_controldata itself) would enable the relevant info to be extracted We didn't do that so admins could easily read the file contents. If you use a readable file you will also need a feature for restore (or a tool) to create an appropriate pg_control file, or are you intending to still require that pg_control be the first file backed up. Another possibility would be that the start function writes the readable file and also copies pg_control. We will back up pg_control in the tar file but it doesn't have to have all correct information. The WAL replay will set it properly I think. In fact it has to start recovery checkpoint settings, not the backup setting at all. -- Bruce Momjian| http://candle.pha.pa.us [EMAIL PROTECTED] | (610) 359-1001 + If your life is a hard drive, | 13 Roberts Road + Christ can be your backup.| Newtown Square, Pennsylvania 19073 ---(end of broadcast)--- TIP 4: Don't 'kill -9' the postmaster
Re: [HACKERS] Point in Time Recovery
Zeugswetter Andreas SB SD [EMAIL PROTECTED] writes: If you use a readable file you will also need a feature for restore (or a tool) to create an appropriate pg_control file, or are you intending to still require that pg_control be the first file backed up. No, the entire point of this exercise is to get rid of that assumption. You do need *a* copy of pg_control, but the only reason you'd need to back it up first rather than later is so that its checkpoint pointer points to the last checkpoint before the dump starts. Which is the information we want to put in the archive-label file insted. If a copy of pg_control were sufficient then I'd be all for using it as the archive-label file, but it's *not* sufficient because you also need the ending WAL offset. So we need a different file layout in any case, and we may as well take some pity on the poor DBA and make the file easily human-readable. regards, tom lane ---(end of broadcast)--- TIP 3: if posting/reading through Usenet, please send an appropriate subscribe-nomail command to [EMAIL PROTECTED] so that your message can get through to the mailing list cleanly
Re: [HACKERS] Point in Time Recovery
Ok - that is a much better way of doing it! regards Mark Tom Lane wrote: Zeugswetter Andreas SB SD [EMAIL PROTECTED] writes: If you use a readable file you will also need a feature for restore (or a tool) to create an appropriate pg_control file, or are you intending to still require that pg_control be the first file backed up. No, the entire point of this exercise is to get rid of that assumption. You do need *a* copy of pg_control, but the only reason you'd need to back it up first rather than later is so that its checkpoint pointer points to the last checkpoint before the dump starts. Which is the information we want to put in the archive-label file insted. If a copy of pg_control were sufficient then I'd be all for using it as the archive-label file, but it's *not* sufficient because you also need the ending WAL offset. So we need a different file layout in any case, and we may as well take some pity on the poor DBA and make the file easily human-readable. regards, tom lane ---(end of broadcast)--- TIP 3: if posting/reading through Usenet, please send an appropriate subscribe-nomail command to [EMAIL PROTECTED] so that your message can get through to the mailing list cleanly ---(end of broadcast)--- TIP 8: explain analyze is your friend
Re: [HACKERS] Point in Time Recovery
I was wondering about this point - might it not be just as reasonable for the copied file to *be* an exact image of pg_control? Then a very simple variant of pg_controldata (or maybe even just adding switches to pg_controldata itself) would enable the relevant info to be extracted P.s : would love to be that volunteer - however up to the eyeballs in Business Objects (cringe) and Db2 for the next week or so regards Mark Bruce Momjian wrote: We need someone to code two backend functions to complete PITR. snippage However, once you decide to do things like that, there is no reason why the copied file has to be an exact image of pg_control. I claim it would be more useful if the copied file were plain text so that you could just cat it to find out the starting WAL position; that would let you determine without any special tools what range of WAL archive files you are going to need to bring back from your archives. ---(end of broadcast)--- TIP 9: the planner will ignore your desire to choose an index scan if your joining column's datatypes do not match
Re: [HACKERS] Point in Time Recovery
Mark Kirkwood wrote: I was wondering about this point - might it not be just as reasonable for the copied file to *be* an exact image of pg_control? Then a very simple variant of pg_controldata (or maybe even just adding switches to pg_controldata itself) would enable the relevant info to be extracted We didn't do that so admins could easily read the file contents. -- Bruce Momjian| http://candle.pha.pa.us [EMAIL PROTECTED] | (610) 359-1001 + If your life is a hard drive, | 13 Roberts Road + Christ can be your backup.| Newtown Square, Pennsylvania 19073 ---(end of broadcast)--- TIP 8: explain analyze is your friend
Re: [HACKERS] Point in Time Recovery
Quoting Bruce Momjian [EMAIL PROTECTED]: Mark Kirkwood wrote: I was wondering about this point - might it not be just as reasonable for the copied file to *be* an exact image of pg_control? Then a very simple variant of pg_controldata (or maybe even just adding switches to pg_controldata itself) would enable the relevant info to be extracted We didn't do that so admins could easily read the file contents. Ease of reading is a good thing, no argument there. However using 'pg_controldata' (or similar) to perform the read is not really that much harder than using 'cat' - (it is a wee bit harder, I grant you) When I posted the original mail I was thinking that the pg_control image is good because it has much more information than just the last wal offset, and could be used to perform a recovery in the advent of the actual pg_control being unsuitable (e.g. backed up last instead of first on a busy system). Of couse this thinking didn't make it into the original mail, sorry about that! regards Mark ---(end of broadcast)--- TIP 6: Have you searched our list archives? http://archives.postgresql.org
Re: [HACKERS] Point in Time Recovery
We need someone to code two backend functions to complete PITR. The function would be called at start/stop of backup of the data directory. The functions would be checked during restore to make sure the requested xid is not between the start/stop xids of the backup. They would also contain timestamps so the admin can easily review the archive directory. The start needs to call checkpoint and create file in the data directory that contains a few server parameters. At backup stop the function needs to move the file to pg_xlog and set the *.ready archive flag so it is archived. As for checking during recover, the file needs to be retrieved and checked to see the xid recovery is valid. Tom and I can help you with that detail. DOn't worry about all the details of the email below. It is just a general summary. We can give you details once you volunteer. --- Bruce Momjian wrote: OK, I think I have some solid ideas and reasons for them. First, I think we need server-side functions to call when we start/stop the backup. The advantage of these server-side functions is that they will do the required work of recording the pg_control values and creating needed files with little chance for user error. It also allows us to change the internal operations in later releases without requiring admins to change their procedures. We are even able to adjust the internal operation in minor releases without forcing a new procedure on users. Second, I think once we start a restore, we should rename recovery.conf to recovery.in_progress, and when complete rename that to recovery.done. If the postmaster starts and sees recovery.in_progress, it will fail to start knowing its recovery was interrupted. This allows the admin to take appropriate action. (I am not sure what that action would be. Does he bring back the backup files or just keep going?) Third, I think we need to put a file in the archive location once we complete a backup, recording the start/stop xid and wal/offsets. This gives the admin documentation on what archive logs to keep and what xids are available for recovery. Ideally the recover program would read that file and check the recover xid to make sure it is after the stop xid recorded in the file. How would the recover program know the name of that file? We need to create it in /data with start contents before the backup, then complete it with end contents and archive it. What should we name it? Ideally it would be named by the WAL name/offset of the start so it orders in the proper spot in the archive file listing, e.g.: 093a 093b 093b.032b9.start 093c Are people going to know they need 093b for 093b.032b9.start? I hope so. Another idea is to do: 093a.xlog 093b.032b9.start 093b.xlog 093c.xlog This would order properly. It might be a very good idea to add extensions to these log files now that we are archiving them in strange places. In fact, maybe we should use *.pg_xlog to document the directory they came from. --- Simon Riggs wrote: On Fri, 2004-07-16 at 16:47, Tom Lane wrote: As far as the business about copying pg_control first goes: there is another way to think about it, which is to copy pg_control to another place that will be included in your backup. For example the standard backup procedure could be 1. [somewhat optional] Issue CHECKPOINT and wait till it finishes. 2. cp $PGDATA/global/pg_control $PGDATA/pg_control.dump 3. tar cf /dev/mt $PGDATA 4. do something to record ending WAL position If we standardized on this way, then the tar archive would automatically contain the pre-backup checkpoint position in ./pg_control.dump, and there is no need for any special assumptions about the order in which tar processes things. Sounds good. That would be familiar to Oracle DBAs doing BACKUP CONTROLFILE. We can document that and off it as a suggested procedure. However, once you decide to do things like that, there is no reason why the copied file has to be an exact image of pg_control. I claim it would be more useful if the copied file were plain text so that you could just cat it to find out the starting WAL position; that would let you determine without any special tools what range of WAL archive files you are going to need to bring back from your archives. I wouldn't be in favour of a manual mechanism. If you want an automated mechanism, whats wrong with using the one thats already there? You can use pg_controldata to read the controlfile, again whats wrong with that? We agreed some time back that an
Re: [HACKERS] Point in Time Recovery
Oh, here is something else we need to add --- a GUC to control whether pg_xlog is clean on recovery start. --- Tom Lane wrote: Bruce and I had another phone chat about the problems that can ensue if you restore a tar backup that contains old (incompletely filled) versions of WAL segment files. While the current code will ignore them during the recovery-from-archive run, leaving them laying around seems awfully dangerous. One nasty possibility is that the archiving mechanism will pick up these files and overwrite good copies in the archive area with the obsolete ones from the backup :-(. Bruce earlier proposed that we simply rm pg_xlog/* at the start of a recovery-from-archive run, but as I said I'm scared to death of code that does such a thing automatically. In particular this would make it impossible to handle scenarios where you want to do a PITR recovery but you need to use some recent WAL segments that didn't make it into your archive yet. (Maybe you could get around this by forcibly transferring such segments into the archive, but that seems like a bad idea for incomplete segments.) It would really be best for the DBA to make sure that the starting condition for the recovery run does not have any obsolete segment files in pg_xlog. He could do this either by setting up his backup policy so that pg_xlog isn't included in the tar backup in the first place, or by manually removing the included files just after restoring the backup, before he tries to start the recovery run. Of course the objection to that is what if the DBA forgets to do it? The idea that we came to on the phone was for the postmaster, when it enters recovery mode because a recovery.conf file exists, to look in pg_xlog for existing segment files and refuse to start if any are there --- *unless* the user has put a special, non-default overriding flag into recovery.conf. Call it use_unarchived_files or something like that. We'd have to provide good documentation and an extensive HINT of course, but basically the DBA would have two choices when he gets this refusal to start: 1. Remove all the segment files in pg_xlog. (This would be the right thing to do if he knows they all came off the backup.) 2. Verify that pg_xlog contains only segment files that are newer than what's stored in the WAL archive, and then set the override flag in recovery.conf. In this case the DBA is taking responsibility for leaving only segment files that are good to use. One interesting point is that with such a policy, we could use locally available WAL segments in preference to pulling the same segments from archive, which would be at least marginally more efficient, and seems logically cleaner anyway. In particular it seems that this would be a useful arrangement in cases where you have questionable WAL segments --- you're not sure if they're good or not. Rather than having to push questionable data into your WAL archive, you can leave it local, try a recovery run, and see if you like the resulting state. If not, it's a lot easier to do-over when you have not corrupted your archive area. Comments? Better ideas? regards, tom lane -- Bruce Momjian| http://candle.pha.pa.us [EMAIL PROTECTED] | (610) 359-1001 + If your life is a hard drive, | 13 Roberts Road + Christ can be your backup.| Newtown Square, Pennsylvania 19073 ---(end of broadcast)--- TIP 7: don't forget to increase your free space map settings
Re: [ADMIN] [HACKERS] Point in Time Recovery
[ Sorry, sent to hackers now.] Here is another open PITR issue that I think will have to be addressed in 7.6. If you do a critical transaction, but do nothing else for eight hours, that critical transaction hasn't been archived yet. It is still sitting in pg_xlog until the WAL file fills. I think we will need to document this behavior and address it in some way in 7.6. We can't assume that we can send multiple copies of pg_xlog to the archive (partial and full ones) because we might be going to a tape drive. However, this is a non-intuitive behavior of our archiver. We might need to tell people to copy the most recent WAL file every minute to some other location or something. --- Tom Lane wrote: Bruce Momjian [EMAIL PROTECTED] writes: I think we should push the partially complete WAL file to the archive location before shutdown. ... When you are running and finally fill up the WAL file it would then overwrite the one in the archive but I think that is OK. I don't think this can fly at all. Here are some off-the-top-of-the-head objections: 1. We don't have the luxury of spending indefinite amounts of time to do a database shutdown. Commonly we are under a twenty-second sentence of death from init. I don't want to spend the 20 seconds waiting to see if the archiver will manage to push 16MB onto a slow tape drive. Also, if the archiver does fail to push the data in time, it'll likely leave a broken (partial) xlog file in the archive, which would be really bad news if the user then relies on that. 2. What if the archiver process entirely fails to push the file? (Maybe there's not enough disk space, for instance.) In normal operation we'll just retry every so often. We definitely can't do that during shutdown. 3. You're blithely assuming that the archival process can easily provide overwrite semantics for multiple pushes of the same xlog filename. Stop thinking about cp to some directory and start thinking dump to tape or burn onto CD or something like that. We'll be raising the ante considerably if we require the archive_command to deal with this. I think the last one is really the most significant issue. We have to keep the archiver API as simple as possible. regards, tom lane -- Bruce Momjian| http://candle.pha.pa.us [EMAIL PROTECTED] | (610) 359-1001 + If your life is a hard drive, | 13 Roberts Road + Christ can be your backup.| Newtown Square, Pennsylvania 19073 ---(end of broadcast)--- TIP 7: don't forget to increase your free space map settings
Re: [PATCHES] [HACKERS] Point in Time Recovery
On 26 Jul, To: [EMAIL PROTECTED] wrote: Sorry I wasn't clearer. I think I have a better idea about what's going on now. With the archiving enabled, it looks like the database is able to complete 1 transaction per database connection, but doesn't complete any subsequent transactions. I'm not sure how to see what's going on. Perhaps I should try a newer snapshot from CVS while I'm at it? I tried to do an strace on the postmaster (and child processes) to see if that might show something, but when the postmaster starts the database isn't accepting any connections. I have the feeling it's not really starting up. Trying to shut it down seems to agrees with that. My wild guess is that the database is sitting waiting for something when a stored function is called but I'm not sure how to verify that. Mark ---(end of broadcast)--- TIP 8: explain analyze is your friend
Re: [HACKERS] Point in Time Recovery
Hang on, are you supposed to MOVE or COPY away WAL segments? Copy. pg will delete them once they are archived. Copy. pg will recycle them once they are archived. Andreas ---(end of broadcast)--- TIP 9: the planner will ignore your desire to choose an index scan if your joining column's datatypes do not match
Re: [HACKERS] Point in Time Recovery
Simon Riggs wrote: On Sat, 2004-07-17 at 00:57, Bruce Momjian wrote: OK, I think I have some solid ideas and reasons for them. Sorry for taking so long to reply... First, I think we need server-side functions to call when we start/stop the backup. The advantage of these server-side functions is that they will do the required work of recording the pg_control values and creating needed files with little chance for user error. It also allows us to change the internal operations in later releases without requiring admins to change their procedures. We are even able to adjust the internal operation in minor releases without forcing a new procedure on users. Yes, I think we should go down this route. there's a but and that is we don't absolutely need it for correctnessand so I must decline adding it to THIS release. I don't imagine I'll stop be associated with this code for a while yet Can we recommend that users should expect to have to call a start and end backup routine in later releases? Don't expect you'll agree to that.. I guess my big question is that if we don't do this for 7.5 how will people doing restores know if the xid they specify is valid for the backup they used. If we recover to most recent time, is there any check that will tell them their backup is invalid because there are no archive records that span the time of their backup? -- Bruce Momjian| http://candle.pha.pa.us [EMAIL PROTECTED] | (610) 359-1001 + If your life is a hard drive, | 13 Roberts Road + Christ can be your backup.| Newtown Square, Pennsylvania 19073 ---(end of broadcast)--- TIP 3: if posting/reading through Usenet, please send an appropriate subscribe-nomail command to [EMAIL PROTECTED] so that your message can get through to the mailing list cleanly
Re: [HACKERS] Point in Time Recovery
On Sat, 2004-07-17 at 00:57, Bruce Momjian wrote: OK, I think I have some solid ideas and reasons for them. Sorry for taking so long to reply... First, I think we need server-side functions to call when we start/stop the backup. The advantage of these server-side functions is that they will do the required work of recording the pg_control values and creating needed files with little chance for user error. It also allows us to change the internal operations in later releases without requiring admins to change their procedures. We are even able to adjust the internal operation in minor releases without forcing a new procedure on users. Yes, I think we should go down this route. there's a but and that is we don't absolutely need it for correctnessand so I must decline adding it to THIS release. I don't imagine I'll stop be associated with this code for a while yet Can we recommend that users should expect to have to call a start and end backup routine in later releases? Don't expect you'll agree to that.. Second, I think once we start a restore, we should rename recovery.conf to recovery.in_progress, and when complete rename that to recovery.done. If the postmaster starts and sees recovery.in_progress, it will fail to start knowing its recovery was interrupted. This allows the admin to take appropriate action. (I am not sure what that action would be. Does he bring back the backup files or just keep going?) Superceded by Tom's actions. Two states are required: start and stop. Recovery isn't going to be checkpoint-restartable anytime soon, IMHO. Best regards, Simon Riggs ---(end of broadcast)--- TIP 3: if posting/reading through Usenet, please send an appropriate subscribe-nomail command to [EMAIL PROTECTED] so that your message can get through to the mailing list cleanly
Re: [HACKERS] Point in Time Recovery
Bruce and I had another phone chat about the problems that can ensue if you restore a tar backup that contains old (incompletely filled) versions of WAL segment files. While the current code will ignore them during the recovery-from-archive run, leaving them laying around seems awfully dangerous. One nasty possibility is that the archiving mechanism will pick up these files and overwrite good copies in the archive area with the obsolete ones from the backup :-(. Bruce earlier proposed that we simply rm pg_xlog/* at the start of a recovery-from-archive run, but as I said I'm scared to death of code that does such a thing automatically. In particular this would make it impossible to handle scenarios where you want to do a PITR recovery but you need to use some recent WAL segments that didn't make it into your archive yet. (Maybe you could get around this by forcibly transferring such segments into the archive, but that seems like a bad idea for incomplete segments.) It would really be best for the DBA to make sure that the starting condition for the recovery run does not have any obsolete segment files in pg_xlog. He could do this either by setting up his backup policy so that pg_xlog isn't included in the tar backup in the first place, or by manually removing the included files just after restoring the backup, before he tries to start the recovery run. Of course the objection to that is what if the DBA forgets to do it? The idea that we came to on the phone was for the postmaster, when it enters recovery mode because a recovery.conf file exists, to look in pg_xlog for existing segment files and refuse to start if any are there --- *unless* the user has put a special, non-default overriding flag into recovery.conf. Call it use_unarchived_files or something like that. We'd have to provide good documentation and an extensive HINT of course, but basically the DBA would have two choices when he gets this refusal to start: 1. Remove all the segment files in pg_xlog. (This would be the right thing to do if he knows they all came off the backup.) 2. Verify that pg_xlog contains only segment files that are newer than what's stored in the WAL archive, and then set the override flag in recovery.conf. In this case the DBA is taking responsibility for leaving only segment files that are good to use. One interesting point is that with such a policy, we could use locally available WAL segments in preference to pulling the same segments from archive, which would be at least marginally more efficient, and seems logically cleaner anyway. In particular it seems that this would be a useful arrangement in cases where you have questionable WAL segments --- you're not sure if they're good or not. Rather than having to push questionable data into your WAL archive, you can leave it local, try a recovery run, and see if you like the resulting state. If not, it's a lot easier to do-over when you have not corrupted your archive area. Comments? Better ideas? regards, tom lane ---(end of broadcast)--- TIP 4: Don't 'kill -9' the postmaster
Re: [HACKERS] Point in Time Recovery
I've got a PITR set up here that's happily scp'ing WAL files across to another machine. However, the NIC in the machine is currently stuffed, so it gets like 50k/s :) What happens in general if you are generating WAL file bytes faster always than they can be copied off? Also, does the archive dir just basically keep filling up forever? How do I know when I can prune some files? Anything older than the last full backup? Chris ---(end of broadcast)--- TIP 6: Have you searched our list archives? http://archives.postgresql.org
Re: [HACKERS] Point in Time Recovery
Christopher Kings-Lynne [EMAIL PROTECTED] writes: I've got a PITR set up here that's happily scp'ing WAL files across to another machine. However, the NIC in the machine is currently stuffed, so it gets like 50k/s :) What happens in general if you are generating WAL file bytes faster always than they can be copied off? If you keep falling further and further behind, eventually your pg_xlog directory will fill the space available on its disk, and I think at that point PG will panic and shut down because it can't create any more xlog segments. Also, does the archive dir just basically keep filling up forever? How do I know when I can prune some files? Anything older than the last full backup? Anything older than the starting checkpoint of the last full backup that you might want to restore to. We need to adjust the backup procedure so that the starting segment number for a backup is more readily visible; see recent discussions about logging that explicitly in some fashion. regards, tom lane ---(end of broadcast)--- TIP 4: Don't 'kill -9' the postmaster
Re: [HACKERS] Point in Time Recovery
If you keep falling further and further behind, eventually your pg_xlog directory will fill the space available on its disk, and I think at that point PG will panic and shut down because it can't create any more xlog segments. Hang on, are you supposed to MOVE or COPY away WAL segments? Chris ---(end of broadcast)--- TIP 1: subscribe and unsubscribe commands go to [EMAIL PROTECTED]
Re: [HACKERS] Point in Time Recovery
Christopher Kings-Lynne [EMAIL PROTECTED] writes: If you keep falling further and further behind, eventually your pg_xlog directory will fill the space available on its disk, and I think at that point PG will panic and shut down because it can't create any more xlog segments. Hang on, are you supposed to MOVE or COPY away WAL segments? COPY. The checkpoint code will then delete or recycle the segment file, as appropriate. regards, tom lane ---(end of broadcast)--- TIP 5: Have you checked our extensive FAQ? http://www.postgresql.org/docs/faqs/FAQ.html
Re: [HACKERS] Point in Time Recovery
Christopher Kings-Lynne wrote: If you keep falling further and further behind, eventually your pg_xlog directory will fill the space available on its disk, and I think at that point PG will panic and shut down because it can't create any more xlog segments. Hang on, are you supposed to MOVE or COPY away WAL segments? Copy. pg will delete them once they are archived. -- Bruce Momjian| http://candle.pha.pa.us [EMAIL PROTECTED] | (610) 359-1001 + If your life is a hard drive, | 13 Roberts Road + Christ can be your backup.| Newtown Square, Pennsylvania 19073 ---(end of broadcast)--- TIP 5: Have you checked our extensive FAQ? http://www.postgresql.org/docs/faqs/FAQ.html
Re: [HACKERS] Point in Time Recovery
Hang on, are you supposed to MOVE or COPY away WAL segments? COPY. The checkpoint code will then delete or recycle the segment file, as appropriate. So what happens if you just move it? Postgres breaks? Chris ---(end of broadcast)--- TIP 8: explain analyze is your friend
Re: [HACKERS] Point in Time Recovery
Christopher Kings-Lynne [EMAIL PROTECTED] writes: Hang on, are you supposed to MOVE or COPY away WAL segments? COPY. The checkpoint code will then delete or recycle the segment file, as appropriate. So what happens if you just move it? Postgres breaks? I don't think so, but it seems like a much less robust way to do things. What happens if you have a failure partway through? For instance archive machine dies and loses recent data right after you've rm'd the source file. The recommended COPY procedure at least provides some breathing room between when you install the data on the archive and when the original file is removed. It's not like you save any effort by using a MOVE anyway. You're not going to have the archive on the same machine as the database (or if you are, you ain't gonna be *my* DBA ...) regards, tom lane ---(end of broadcast)--- TIP 2: you can get off all lists at once with the unregister command (send unregister YourEmailAddressHere to [EMAIL PROTECTED])
Re: [HACKERS] Point in Time Recovery
I don't think so, but it seems like a much less robust way to do things. What happens if you have a failure partway through? For instance archive machine dies and loses recent data right after you've rm'd the source file. The recommended COPY procedure at least provides some breathing room between when you install the data on the archive and when the original file is removed. Well, I tried it in 'cross your fingers' mode and it works, at least: archive_command = 'rm %p' :) Chris ---(end of broadcast)--- TIP 8: explain analyze is your friend
Re: [HACKERS] Point in Time Recovery
On Fri, 2004-07-16 at 04:49, Tom Lane wrote: Simon Riggs [EMAIL PROTECTED] writes: On Fri, 2004-07-16 at 00:01, Alvaro Herrera wrote: My manpage for signal(2) says that you shouldn't assign SIG_IGN to SIGCHLD, according to POSIX. So - I should be setting this to SIG_DFL and thats good for everyone? Yeah, we learned the same lesson in the backend not too many releases back. SIG_IGN'ing SIGCHLD is bad voodoo; it'll work on some platforms but not others. Many thanks all, Best Regards Simon Riggs ---(end of broadcast)--- TIP 8: explain analyze is your friend
Re: [HACKERS] Point in Time Recovery
I'm aiming for the minimum feature set - which means we do need to take care over whether that set is insufficient and also to pull any part that doesn't stand up to close scrutiny over the next few days. As you can see, we are still chewing on NT. What PITR features are missing? I assume because we can't stop file system writes during backup that we will need a backup parameter file like I described. Is there anything else that PITR needs? No, we don't need to stop writes ! Not even to split a mirror, other db's need that to be able to restore, but we dont. We only need to tell people to backup pg_control first. The rest was only intended to enforce 1. that pg_control is the first file backed up 2. the dba uses a large enough PIT (or xid) for restore I think the idea with an extra file with WAL start position was overly complicated, since all you need is pg_control (+ WAL end position to enforce 2.). If we don't want to tell people to backup pg_control first, imho the next best plan would be to add a WAL start input (e.g. xlog name) parameter to recovery.conf, that fixes pg_control. Andreas ---(end of broadcast)--- TIP 2: you can get off all lists at once with the unregister command (send unregister YourEmailAddressHere to [EMAIL PROTECTED])
Re: [HACKERS] Point in Time Recovery
Zeugswetter Andreas SB SD [EMAIL PROTECTED] writes: We only need to tell people to backup pg_control first. The rest was only intended to enforce 1. that pg_control is the first file backed up 2. the dba uses a large enough PIT (or xid) for restore Right, but I think Bruce's point is that it is far too easy to get those things wrong; especially point 2 for which a straight tar dump will simply not contain the information you need to determine what is a safe stopping point. I agree with Bruce that we should have some mechanism that doesn't rely on the DBA to get this right. Exactly what the mechanism should be is certainly open for discussion... regards, tom lane ---(end of broadcast)--- TIP 4: Don't 'kill -9' the postmaster
Re: [HACKERS] Point in Time Recovery
Zeugswetter Andreas SB SD [EMAIL PROTECTED] writes: We only need to tell people to backup pg_control first. The rest was only intended to enforce 1. that pg_control is the first file backed up 2. the dba uses a large enough PIT (or xid) for restore Right, but I think Bruce's point is that it is far too easy to get those things wrong; especially point 2 for which a straight tar dump will simply not contain the information you need to determine what is a safe stopping point. I agree with Bruce that we should have some mechanism that doesn't rely on the DBA to get this right. Exactly what the mechanism should be is certainly open for discussion... Right. I am wondering what process people would use to backup pg_control first? If they do: tar -f $TAPE ./global/pg_control . They will get two copies or pg_control, the early one, and one as part of the directory scan. On restore, they would restore the early one, but the directory scan would overwrite it. I suppose they could do: cp global/pg_control global/pg_control.backup tar -f $TAPE . then on restore once all the files are restored move the pg_control.backup to its original name. That gives us the checkpoint wal/offset but how do we get the start/stop information. Is that not required? Maybe we should just have a start/stop server-side functions that create a file in the archive directory describing the start/stop counters and time and the admin would then have to find those values. Why are the start/stop wal/offset values needed anyway? I know why we need the checkpoint value. Do we need a checkpoint after the archiving starts but before the backup begins? Also, when you are in recovery mode, how do you get out of recovery mode, meaning if you have a power failure, how do you prevent the system from doing another recovery? Do you remove the recovery.conf file? -- Bruce Momjian| http://candle.pha.pa.us [EMAIL PROTECTED] | (610) 359-1001 + If your life is a hard drive, | 13 Roberts Road + Christ can be your backup.| Newtown Square, Pennsylvania 19073 ---(end of broadcast)--- TIP 3: if posting/reading through Usenet, please send an appropriate subscribe-nomail command to [EMAIL PROTECTED] so that your message can get through to the mailing list cleanly
Re: [HACKERS] Point in Time Recovery
Bruce Momjian [EMAIL PROTECTED] writes: Also, when you are in recovery mode, how do you get out of recovery mode, meaning if you have a power failure, how do you prevent the system from doing another recovery? Do you remove the recovery.conf file? I do not care for the idea of a recovery.conf file at all, and have been intending to look to see what we'd need to do to not have one. I find it hard to believe that there is anything one would put in it that is really persistent state. The above concern shows why it shouldn't be treated as a persistent configuration file. regards, tom lane ---(end of broadcast)--- TIP 1: subscribe and unsubscribe commands go to [EMAIL PROTECTED]
Re: [HACKERS] Point in Time Recovery
then on restore once all the files are restored move the pg_control.backup to its original name. That gives us the checkpoint wal/offset but how do we get the start/stop information. Is that not required? The checkpoint wal/offset is in pg_control, that is sufficient start information. The stop info is only necessary as a safeguard. Do we need a checkpoint after the archiving starts but before the backup begins? No. Also, when you are in recovery mode, how do you get out of recovery mode, meaning if you have a power failure, how do you prevent the system from doing another recovery? Do you remove the recovery.conf file? pg_control could be updated during rollforward (only if that actually does a checkpoint). So if pg_control is also the recovery start info, then we can continue from there if we have a power failure. For the first release it would imho also be ok to simply start over if you loose power. I think the filename 'recovery.conf' is misleading, since it is not a static configuration file, but a command file for one recovery. How about 'recovery.command' then 'recovery.inprogress', and on recovery completion it should be renamed to 'recovery.done' Andreas ---(end of broadcast)--- TIP 2: you can get off all lists at once with the unregister command (send unregister YourEmailAddressHere to [EMAIL PROTECTED])
Re: [HACKERS] Point in Time Recovery
Zeugswetter Andreas SB SD [EMAIL PROTECTED] writes: Do we need a checkpoint after the archiving starts but before the backup begins? No. Actually yes. You have to start at a checkpoint record when replaying the log, so if no checkpoint occurred between starting to archive WAL and starting the tar backup, you have a useless backup. It would be reasonable to issue a CHECKPOINT just before starting the backup as part of the standard operating procedure for taking PITR dumps. We need not require this, but it would help to avoid this particular sort of mistake; and of course it might save a little bit of replay effort if the backup is ever used. As far as the business about copying pg_control first goes: there is another way to think about it, which is to copy pg_control to another place that will be included in your backup. For example the standard backup procedure could be 1. [somewhat optional] Issue CHECKPOINT and wait till it finishes. 2. cp $PGDATA/global/pg_control $PGDATA/pg_control.dump 3. tar cf /dev/mt $PGDATA 4. do something to record ending WAL position If we standardized on this way, then the tar archive would automatically contain the pre-backup checkpoint position in ./pg_control.dump, and there is no need for any special assumptions about the order in which tar processes things. However, once you decide to do things like that, there is no reason why the copied file has to be an exact image of pg_control. I claim it would be more useful if the copied file were plain text so that you could just cat it to find out the starting WAL position; that would let you determine without any special tools what range of WAL archive files you are going to need to bring back from your archives. This is pretty much the same chain of reasoning that Bruce and I went through yesterday to come up with the idea of putting a label file inside the tar backups. We concluded that it'd be worth putting both the backup starting time and the checkpoint WAL position into the label file --- the starting time isn't needed for restore but might be really helpful as documentation, if you needed to verify which dump file was which. regards, tom lane ---(end of broadcast)--- TIP 8: explain analyze is your friend
Re: [HACKERS] Point in Time Recovery
Do we need a checkpoint after the archiving starts but before the backup begins? No. Actually yes. Sorry, I did incorrectly not connect 'archiving' with the backed up xlogs :-( So yes, you need one checkpoint after archiving starts. Imho turning on xlog archiving should issue such a checkpoint just to be sure. Andreas ---(end of broadcast)--- TIP 7: don't forget to increase your free space map settings
Re: [HACKERS] Point in Time Recovery
On Fri, 2004-07-16 at 16:58, Zeugswetter Andreas SB SD wrote: Do we need a checkpoint after the archiving starts but before the backup begins? No. Actually yes. Sorry, I did incorrectly not connect 'archiving' with the backed up xlogs :-( So yes, you need one checkpoint after archiving starts. Imho turning on xlog archiving should issue such a checkpoint just to be sure. By agreement, archive_mode can only be turned on at postmaster startup, which means you always have a checkpoint - either because you shut it down cleanly, or you didn't and it recovers, then writes one. There is always something to start the rollforward. So, non-issue. Best regards, Simon Riggs ---(end of broadcast)--- TIP 8: explain analyze is your friend
Re: [HACKERS] Point in Time Recovery
On Fri, 2004-07-16 at 15:27, Bruce Momjian wrote: Also, when you are in recovery mode, how do you get out of recovery mode, meaning if you have a power failure, how do you prevent the system from doing another recovery? Do you remove the recovery.conf file? That was the whole point of the recovery.conf file: it prevents you from re-entering recovery accidentally, as would occur if the parameters were set in the normal .conf file. Best Regards, Simon Riggs ---(end of broadcast)--- TIP 2: you can get off all lists at once with the unregister command (send unregister YourEmailAddressHere to [EMAIL PROTECTED])
Re: [HACKERS] Point in Time Recovery
On Fri, 2004-07-16 at 16:25, Zeugswetter Andreas SB SD wrote: I think the filename 'recovery.conf' is misleading, since it is not a static configuration file, but a command file for one recovery. How about 'recovery.command' then 'recovery.inprogress', and on recovery completion it should be renamed to 'recovery.done' You understand this and your assessment is correct. recovery.conf isn't an attempt to persist information. It is a means of delivering a set of parameters to the recovery process, as well as signalling overall that archive recovery is required (because the system default remains the same, which is to recover from the logs it has locally available to it). I originally offered a design which used a command, similar to DB2/Oracle...that was overruled as too complex. The (whatever you call it) file is just a very simple way of specifying whats required. There is more to be said here...clearly some explanations are required and I will provide those later... Best regards, Simon Riggs ---(end of broadcast)--- TIP 7: don't forget to increase your free space map settings
Re: [HACKERS] Point in Time Recovery
Simon Riggs wrote: On Fri, 2004-07-16 at 16:58, Zeugswetter Andreas SB SD wrote: Do we need a checkpoint after the archiving starts but before the backup begins? No. Actually yes. Sorry, I did incorrectly not connect 'archiving' with the backed up xlogs :-( So yes, you need one checkpoint after archiving starts. Imho turning on xlog archiving should issue such a checkpoint just to be sure. By agreement, archive_mode can only be turned on at postmaster startup, which means you always have a checkpoint - either because you shut it down cleanly, or you didn't and it recovers, then writes one. There is always something to start the rollforward. So, non-issue. I don't think so. I can imagine many cases where you want to do a nightly tar backup without turning archiving on/off or restarting the postmaster. In those cases, a manual checkpoint would have to be issued before the backup begins. Imagine a system that is up for a month, and they don't have enough archive space to keep a months worth of WAL files. They would probably do nightly or weekend tar backups, and then discard the WAL archives. What procedure would they use? I assume they would copy all their old WAL files to a save directory, issue a checkpoint, do a tar backup, then they can delete the saved WAL files. Is that correct? -- Bruce Momjian| http://candle.pha.pa.us [EMAIL PROTECTED] | (610) 359-1001 + If your life is a hard drive, | 13 Roberts Road + Christ can be your backup.| Newtown Square, Pennsylvania 19073 ---(end of broadcast)--- TIP 5: Have you checked our extensive FAQ? http://www.postgresql.org/docs/faqs/FAQ.html
Re: [HACKERS] Point in Time Recovery
On Fri, 2004-07-16 at 19:30, Bruce Momjian wrote: Simon Riggs wrote: On Fri, 2004-07-16 at 16:58, Zeugswetter Andreas SB SD wrote: Do we need a checkpoint after the archiving starts but before the backup begins? No. Actually yes. Sorry, I did incorrectly not connect 'archiving' with the backed up xlogs :-( So yes, you need one checkpoint after archiving starts. Imho turning on xlog archiving should issue such a checkpoint just to be sure. By agreement, archive_mode can only be turned on at postmaster startup, which means you always have a checkpoint - either because you shut it down cleanly, or you didn't and it recovers, then writes one. There is always something to start the rollforward. So, non-issue. I was discussing the claim that there might not be a checkpoint to begin the rollforward from. There always is: if you are in archive_mode=true then you will always have a checkpoint that can be used for recovery. It may be a long way in the past, if there has been no write activity, but the rollforward will very very quick, since there will be no log records. I don't think so. I can imagine many cases where you want to do a nightly tar backup without turning archiving on/off or restarting the postmaster. This is a misunderstanding. I strongly agree with what you say: the whole system has been designed to avoid any benefit from turning on/off archiving and there is no requirement to restart postmaster to take backups. In those cases, a manual checkpoint would have to be issued before the backup begins. A manual checkpoint doesn't HAVE TO be issued. Presumably most systems will be running checkpoint every few minutes. Wherever the last one was is where the rollforward would start from. But you can if thats the way you want to do things, just wait long enough for the checkpoint to have completed, otherwise your objective of reducing rollforward time will not be met. (please note my earlier reported rollback performance of approximately x10 rate of recovery v elapsed time - will require testing on your own systems). Imagine a system that is up for a month, and they don't have enough archive space to keep a months worth of WAL files. They would probably do nightly or weekend tar backups, and then discard the WAL archives. Yes, that would be normal practice. I would recommend keeping at least the last 3 full backups and all of the WAL logs to cover that period. What procedure would they use? I assume they would copy all their old WAL files to a save directory, issue a checkpoint, do a tar backup, then they can delete the saved WAL files. Is that correct? PITR is designed to interface with a wide range of systems, through the extensible archive/recovery program interface. We shouldn't focus on just tar backups - if you do, then the whole thing seems less feature-rich. The current design allows interfacing with tape, remote backup, internet backup providers, automated standby servers and the dozen major storage/archive vendors' solutions. Writing a procedure to backup, assign filenames, keep track of stuff isn't too difficult if you're a competent DBA with a mild knowledge of shell or perl scripting. But if data is important, people will want to invest the time and trouble to adopt one of the open source or licenced vendors that provide solutions in this area. Systems management is a discipline and procedures should be in place for everything. I fully agree with the automate everything dictum, but just don't want to constrain people too much to a particular way of doing things. -o-o- Overall, for first release, I think the complexity of this design is acceptable. PITR is similar to Oracle7 Backup/Recovery, and easily recognisable to any DBA with current experience of current SQL Server, DB2 (MVS, UDB) or Teradata systems. [I can't comment much on Ingres, Informix, Sybase etc] My main areas of concern are: - the formal correctness of the recovery process As a result of this concern, PITR makes ZERO alterations to the recovery code itself. The trick is to feed it the right xlog files and to stop, if required, at the right place and allow normal work to resume. - the robustness and quality of my implementation This requires quality checking of the code and full beta testing -o-o- We've raised a couple of valid points on the lists in the last few days: - its probably a desirable feature (but not essential) to implement a write suspend feature on the bgwriter, if nothing else it will be a confidence building feature...as said previously, for many people, this will not be required, but people will no doubt keep asking - there is a small window of risk around the possibility that a recovery target might be set by the user that doesn't rollforward all the way past the end of the backup. That is real, but in general, people aren't likely to be performing archive recovery within minutes of a backup being taken -
Re: [HACKERS] Point in Time Recovery
On Fri, 2004-07-16 at 16:47, Tom Lane wrote: As far as the business about copying pg_control first goes: there is another way to think about it, which is to copy pg_control to another place that will be included in your backup. For example the standard backup procedure could be 1. [somewhat optional] Issue CHECKPOINT and wait till it finishes. 2. cp $PGDATA/global/pg_control $PGDATA/pg_control.dump 3. tar cf /dev/mt $PGDATA 4. do something to record ending WAL position If we standardized on this way, then the tar archive would automatically contain the pre-backup checkpoint position in ./pg_control.dump, and there is no need for any special assumptions about the order in which tar processes things. Sounds good. That would be familiar to Oracle DBAs doing BACKUP CONTROLFILE. We can document that and off it as a suggested procedure. However, once you decide to do things like that, there is no reason why the copied file has to be an exact image of pg_control. I claim it would be more useful if the copied file were plain text so that you could just cat it to find out the starting WAL position; that would let you determine without any special tools what range of WAL archive files you are going to need to bring back from your archives. I wouldn't be in favour of a manual mechanism. If you want an automated mechanism, whats wrong with using the one thats already there? You can use pg_controldata to read the controlfile, again whats wrong with that? We agreed some time back that an off-line xlog file inspector would be required to allow us to inspect the logs and make a decision about where to end recovery. You'd still need that. It's scary enough having to specify the end point, let alone having to specify the starting point as well. At your request, and with Bruce's idea, I designed and built the recovery system so that you don't need to know what range of xlogs to bring back. You just run it, it brings back the right files from archive and does recovery with them, then cleans up - and it works without running out of disk space on long recoveries. I've built it now and it works... This is pretty much the same chain of reasoning that Bruce and I went through yesterday to come up with the idea of putting a label file inside the tar backups. We concluded that it'd be worth putting both the backup starting time and the checkpoint WAL position into the label file --- the starting time isn't needed for restore but might be really helpful as documentation, if you needed to verify which dump file was which. ...if you are doing tar backups...what will you do if you're not using that mechanism? If you are: It's common practice to make up a backup filename from elements such as systemname, databasename, date and time etc. That gives you the start time, the file last mod date gives you the end time. I think its perfectly fine for everybody to do backups any way they please. There are many licenced variants of PostgreSQL and it might be appropriate in those to specify particular ways of doing things. I'll be trusting the management of backup metadata and storage media to a solution designed for the purpose (open or closed source), just as I'll be trusting my data to a database solution designed for that purpose. That for me is one of the good things about PostgreSQL - we use the filesystem, we don't write our own, we provide language interfaces not invent our own proprietary server language etc.. Best Regards, Simon Riggs ---(end of broadcast)--- TIP 5: Have you checked our extensive FAQ? http://www.postgresql.org/docs/faqs/FAQ.html
Re: [HACKERS] Point in Time Recovery
OK, I think I have some solid ideas and reasons for them. First, I think we need server-side functions to call when we start/stop the backup. The advantage of these server-side functions is that they will do the required work of recording the pg_control values and creating needed files with little chance for user error. It also allows us to change the internal operations in later releases without requiring admins to change their procedures. We are even able to adjust the internal operation in minor releases without forcing a new procedure on users. Second, I think once we start a restore, we should rename recovery.conf to recovery.in_progress, and when complete rename that to recovery.done. If the postmaster starts and sees recovery.in_progress, it will fail to start knowing its recovery was interrupted. This allows the admin to take appropriate action. (I am not sure what that action would be. Does he bring back the backup files or just keep going?) Third, I think we need to put a file in the archive location once we complete a backup, recording the start/stop xid and wal/offsets. This gives the admin documentation on what archive logs to keep and what xids are available for recovery. Ideally the recover program would read that file and check the recover xid to make sure it is after the stop xid recorded in the file. How would the recover program know the name of that file? We need to create it in /data with start contents before the backup, then complete it with end contents and archive it. What should we name it? Ideally it would be named by the WAL name/offset of the start so it orders in the proper spot in the archive file listing, e.g.: 093a 093b 093b.032b9.start 093c Are people going to know they need 093b for 093b.032b9.start? I hope so. Another idea is to do: 093a.xlog 093b.032b9.start 093b.xlog 093c.xlog This would order properly. It might be a very good idea to add extensions to these log files now that we are archiving them in strange places. In fact, maybe we should use *.pg_xlog to document the directory they came from. --- Simon Riggs wrote: On Fri, 2004-07-16 at 16:47, Tom Lane wrote: As far as the business about copying pg_control first goes: there is another way to think about it, which is to copy pg_control to another place that will be included in your backup. For example the standard backup procedure could be 1. [somewhat optional] Issue CHECKPOINT and wait till it finishes. 2. cp $PGDATA/global/pg_control $PGDATA/pg_control.dump 3. tar cf /dev/mt $PGDATA 4. do something to record ending WAL position If we standardized on this way, then the tar archive would automatically contain the pre-backup checkpoint position in ./pg_control.dump, and there is no need for any special assumptions about the order in which tar processes things. Sounds good. That would be familiar to Oracle DBAs doing BACKUP CONTROLFILE. We can document that and off it as a suggested procedure. However, once you decide to do things like that, there is no reason why the copied file has to be an exact image of pg_control. I claim it would be more useful if the copied file were plain text so that you could just cat it to find out the starting WAL position; that would let you determine without any special tools what range of WAL archive files you are going to need to bring back from your archives. I wouldn't be in favour of a manual mechanism. If you want an automated mechanism, whats wrong with using the one thats already there? You can use pg_controldata to read the controlfile, again whats wrong with that? We agreed some time back that an off-line xlog file inspector would be required to allow us to inspect the logs and make a decision about where to end recovery. You'd still need that. It's scary enough having to specify the end point, let alone having to specify the starting point as well. At your request, and with Bruce's idea, I designed and built the recovery system so that you don't need to know what range of xlogs to bring back. You just run it, it brings back the right files from archive and does recovery with them, then cleans up - and it works without running out of disk space on long recoveries. I've built it now and it works... This is pretty much the same chain of reasoning that Bruce and I went through yesterday to come up with the idea of putting a label file inside the tar backups. We concluded that it'd be worth putting both the backup starting time and the checkpoint WAL position into the label file --- the starting time isn't needed for restore but might be really helpful as documentation, if you needed to verify which
Re: [HACKERS] Point in Time Recovery
Let me address you concerns about PITR getting into 7.5. I think a few people spoke last week expressing concern about our release process and wanting to take drastic action. However, looking at the release status report I am about to post, you will see we are on track for an August 1 beta. PITR has been neglected only because it has been moving along so well we haven't needed to get deeply involved. Simon has been able to address concerns as we raised them and make adjustments quickly with little guidance. Now, we certainly don't want to skip adding PITR by not giving it our full attention to get into 7.5. Once Tom completes the cursor issues with NT in the next day or so, I think that removes the last big NT stumbling block, and we will start to focus on PITR. Unless there is some major thing we are missing, we fully expect to get PITR in 7.5. We don't have a crystal ball to know for sure, but our intent is clear. I know Simon is going away July 26 so we want to get him feedback as soon as possible. If we wait until after July 26, we will have to make all the adjustments without Simon's guidance, which will be difficult. As far as the importance of PITR, it is a _key_ enterprise feature, even more key than NT. PITR is going to be one of the crowning jewels of the 7.5 release, and I don't want to go into beta without it unless we can't help it. So, I know with the deadline looming, and everyone it getting nervous, but keep the faith. I can see the light at the end of the tunnel. I know this is a tighter schedule than we would like, but I know we can do it, and I expect we will do it. --- Simon Riggs wrote: On Thu, 2004-07-15 at 15:57, Bruce Momjian wrote: We will get there --- it just seems dark at this time. Thanks for that. My comments were heartfelt, but not useful right now. I'm badly overdrawn already on my time budget, though that is my concern alone. There is more to do than I have time for. Pragmatically, if we aren't going to get there then I need to stop now, so I can progress other outstanding issues. All help is appreciated. I'm aiming for the minimum feature set - which means we do need to take care over whether that set is insufficient and also to pull any part that doesn't stand up to close scrutiny over the next few days. Overall, my primary goal is increased robustness and availability for PostgreSQL...and then to have a rest! Best Regards, Simon Riggs ---(end of broadcast)--- TIP 4: Don't 'kill -9' the postmaster -- Bruce Momjian| http://candle.pha.pa.us [EMAIL PROTECTED] | (610) 359-1001 + If your life is a hard drive, | 13 Roberts Road + Christ can be your backup.| Newtown Square, Pennsylvania 19073 ---(end of broadcast)--- TIP 5: Have you checked our extensive FAQ? http://www.postgresql.org/docs/faqs/FAQ.html
Re: [HACKERS] Point in Time Recovery
On Thu, 2004-07-15 at 02:43, Mark Kirkwood wrote: I noticed that compiling with 5_1 patch applied fails due to XLOG_archive_dir being removed from xlog.c , but src/backend/commands/tablecmds.c still uses it. I did the following to tablecmds.c : 5408c5408 extern char XLOG_archive_dir[]; --- extern char *XLogArchiveDest; 5410c5410 use_wal = XLOG_archive_dir[0] !rel-rd_istemp; --- use_wal = XLogArchiveDest[0] !rel-rd_istemp; Yes, I discovered that myself. The fix is included in pitr_v5_2.patch... Your patch follows the right thinking and looks like it would have worked... - XLogArchiveMode carries the main bool value for mode on/off - XLogArchiveDest might also be used, though best to use the mode Thanks for looking through the code... Best Regards, Simon Riggs ---(end of broadcast)--- TIP 1: subscribe and unsubscribe commands go to [EMAIL PROTECTED]
Re: [HACKERS] Point in Time Recovery
On Thu, 2004-07-15 at 03:02, Bruce Momjian wrote: I talked to Tom on the phone today and and I think we have a procedure for doing backup/restore in a fairly foolproof way. As outlined below, we need to record the start/stop and checkpoint WAL file names and offsets, and somehow pass those on to restore. I think any system that requires users to link those values together is going to cause confusion and be error-prone. Unfortunately, it seems clear that many of my posts have not been read, nor has anyone here actually tried to use the patch. Everybody's views on what constitutes error-prone might well differ then. Speculation about additional requirements is just great, but please don't assume that I have infinite resources to apply to these problems. Documentation has still to be written. For a long time now, I've been adding one last feature to what is there, but we're still no nearer to anybody inspecting the patch or committing it. There is building consensus on other threads that PITR should not even be included in the release (3 tentative votes). This latest request feels more like the necessary excuse to take the decision to pull PITR. I would much rather that we took the brave decision and pull it NOW, rather than have me work like crazy to chase this release. :( Best Regards, Simon Riggs ---(end of broadcast)--- TIP 9: the planner will ignore your desire to choose an index scan if your joining column's datatypes do not match
Re: [HACKERS] Point in Time Recovery
I tried what I thought was a straightforward scenario, and seem to have broken it :-( Here is the little tale 1) initdb 2) set archive_mode and archive_dest in postgresql.conf 3) startup 4) create database called 'test' 5) connect to 'test' and type 'checkpoint' 6) backup PGDATA using 'tar -zcvf' 7) create tables in 'test' and add data using COPY (exactly 2 logs worth) 8) shutdown and remove PGDATA 9) recover using 'tar -zxvf' 10) copy recovery.conf into PGDATA 11) startup This is what I get : LOG: database system was interrupted at 2004-07-15 21:24:04 NZST LOG: recovery command file found... LOG: restore_program = cp %s/%s %s LOG: recovery_target_inclusive = true LOG: recovery_debug_log = true LOG: starting archive recovery LOG: restored log file from archive LOG: checkpoint record is at 0/A48054 LOG: redo record is at 0/A48054; undo record is at 0/0; shutdown FALSE LOG: next transaction ID: 496; next OID: 25419 LOG: database system was not properly shut down; automatic recovery in progress LOG: redo starts at 0/A48094 LOG: restored log file 0001 from archive LOG: record with zero length at 0/1E0 LOG: redo done at 0/130 LOG: restored log file 0001 from archive LOG: restored log file 0001 from archive PANIC: concurrent transaction log activity while database system is shutting down LOG: startup process (PID 13492) was terminated by signal 6 LOG: aborting startup due to startup process failure The concurrent access is a bit of a puzzle, as this is my home machine (i.e. I am *sure* noone else is connected!) Mark P.s : CVS HEAD from about 1 hour ago, PITR 5.2, FreeBSD 4.10 on x86 ---(end of broadcast)--- TIP 8: explain analyze is your friend
Re: [HACKERS] Point in Time Recovery
Other db's have commands for: start/end external backup I see that the analogy to external backup was not good, since you are correct that dba's would expect that to stop all writes, so they can safely split their mirror or some such. Usually the expected time from start until end external backup is expected to be only seconds. I actually think we do not need this functionality, since in pg you can safely split the mirror any time you like. My comment was meant to give dba's a familiar tool. The effect of it would only have been to create a separate backup of pg_control. Might as well tell people to always backup pg_control first. I originally thought you would require restore to specify an xlog id from which recovery will start. You would search this log for the first checkpoint record, create an appropriate pg_control, and start rollforward. I still think this would be a nice feature, since then all that would be required for restore is a system backup (that includes pg data files) and the xlogs. Andreas ---(end of broadcast)--- TIP 4: Don't 'kill -9' the postmaster
Re: [HACKERS] Point in Time Recovery
Dear Simon, I've just tested pitr_v5_2.patch and got an error message during archiving process as follows. -- begin LOG: archive command=cp /usr/local/pgsql/data/pg_xlog/ /tmp,return code=-1 -- end The command called in system(3) works, but it returns -1. system(3) can not get right exit code from its child process, when SIGCHLD is set as SIG_IGN. So I did following change to pgarch_Main() in pgarch.c -- line 236 --- - pgsignal(SIGCHLD, SIG_IGN); -- line 236 --- + pgsignal(SIGCHLD, SIG_DFL); After that, the error message doen't come out and it seems to be working propery. Regards, Hisada, Masaki On Wed, 14 Jul 2004 00:13:37 +0100 Simon Riggs [EMAIL PROTECTED] wrote: PITR Patch v5_1 just posted has Point in Time Recovery working Still some rough edgesbut we really need some testers now to give this a try and let me know what you think. Klaus Naumann and Mark Wong are the only [non-committers] to have tried to run the code (and let me know about it), so please have a look at [PATCHES] and try it out. Many thanks, Simon Riggs ---(end of broadcast)--- TIP 8: explain analyze is your friend -- HISADA, Masaki [EMAIL PROTECTED] ---(end of broadcast)--- TIP 9: the planner will ignore your desire to choose an index scan if your joining column's datatypes do not match
Re: [HACKERS] Point in Time Recovery
Sorry for the stupid question, but how do I get this patch if I do not receive the patches mails ? The web interface html'ifies it, thus making it unusable. Thanks Andreas ---(end of broadcast)--- TIP 5: Have you checked our extensive FAQ? http://www.postgresql.org/docs/faqs/FAQ.html
Re: [HACKERS] Point in Time Recovery
Simon Riggs wrote: On Thu, 2004-07-15 at 03:02, Bruce Momjian wrote: I talked to Tom on the phone today and and I think we have a procedure for doing backup/restore in a fairly foolproof way. As outlined below, we need to record the start/stop and checkpoint WAL file names and offsets, and somehow pass those on to restore. I think any system that requires users to link those values together is going to cause confusion and be error-prone. Unfortunately, it seems clear that many of my posts have not been read, nor has anyone here actually tried to use the patch. Everybody's views on what constitutes error-prone might well differ then. Speculation about additional requirements is just great, but please don't assume that I have infinite resources to apply to these problems. Documentation has still to be written. For a long time now, I've been adding one last feature to what is there, but we're still no nearer to anybody inspecting the patch or committing it. I totally understand your feeling this, and I would be feeling the exact same way (but would probably have complained much earlier :-) ). Anyway, the problem is that Tom and I are serializing application of the major features in the pipeline. We decided to focus on nested transactions (NT) first (it is a larger patch), and that is why PITR has gotten so little attention from us. However, there is no sense that you had anything to do with it being places behind NT in the queue, and therefore there is no feeling on our part that PITR is less important or deserves less time than NT. Certainly any system that made you less likely to be applied would be unfair and something we will not do. My explanation about the file format was an attempt to address the method of passing the wal filename/offset to the recover process. If that isn't needed, I am sorry. There is building consensus on other threads that PITR should not even be included in the release (3 tentative votes). This latest request feels more like the necessary excuse to take the decision to pull PITR. I would much rather that we took the brave decision and pull it NOW, rather than have me work like crazy to chase this release. Those three individuals are not representative of the group. Sorry it might seem there there is lack of enthusiasm for PITR, but it isn't true from our end. You might have noticed that the patch queue has shrunk dramatically, and now we are focused on NT and PITR almost exclusively. We will get there --- it just seems dark at this time. -- Bruce Momjian| http://candle.pha.pa.us [EMAIL PROTECTED] | (610) 359-1001 + If your life is a hard drive, | 13 Roberts Road + Christ can be your backup.| Newtown Square, Pennsylvania 19073 ---(end of broadcast)--- TIP 6: Have you searched our list archives? http://archives.postgresql.org
Re: [HACKERS] Point in Time Recovery
Simon Riggs wrote: On Wed, 2004-07-14 at 10:57, Zeugswetter Andreas SB SD wrote: The recovery mechanism doesn't rely upon you knowing 1 or 3. The recovery reads pg_control (from the backup) and then attempts to de-archive the appropriate xlog segment file and then starts rollforward Unfortunately this only works if pg_control was the first file to be backed up (or by chance no checkpoint happened after backup start and pg_control backup) Other db's have commands for: start/end external backup OK...this idea has come up a few times. Here's my take: - OS and hardware facilities exist now to make instant copies of sets of files. Some of these are open source, others not. If you use these, you have no requirement for this functionalitybut these alone are no replacement for archive recovery I accept that some people may not wish to go to the expense or effort to use those options, but in my mind these are the people that will not be using archive_mode anyway. - all we would really need to do is to stop the bgwriter from doing anything during backup. pgcontrol is only updated at checkpoint. The current xlog is updated constantly, but this need not be copied because we are already archiving it as soon as its full. That leaves the bgwriter, which is now responsible for both lazy writing and checkpoints. So, put a switch into bgwriter to halt for a period, then turn it back on at the end. Could be a SIGHUP GUC...or... I don't think we can turn off all file system writes during a backup. Imagine writing to a tape. Preventing file system writes would make the system useless. - please could somebody else code that?... my time is limited Yes, I think someone else could code this. -- Bruce Momjian| http://candle.pha.pa.us [EMAIL PROTECTED] | (610) 359-1001 + If your life is a hard drive, | 13 Roberts Road + Christ can be your backup.| Newtown Square, Pennsylvania 19073 ---(end of broadcast)--- TIP 5: Have you checked our extensive FAQ? http://www.postgresql.org/docs/faqs/FAQ.html
Re: [HACKERS] Point in Time Recovery
On Thu, 2004-07-15 at 10:47, Mark Kirkwood wrote: I tried what I thought was a straightforward scenario, and seem to have broken it :-( Here is the little tale 1) initdb 2) set archive_mode and archive_dest in postgresql.conf 3) startup 4) create database called 'test' 5) connect to 'test' and type 'checkpoint' 6) backup PGDATA using 'tar -zcvf' 7) create tables in 'test' and add data using COPY (exactly 2 logs worth) 8) shutdown and remove PGDATA 9) recover using 'tar -zxvf' 10) copy recovery.conf into PGDATA 11) startup This is what I get : LOG: database system was interrupted at 2004-07-15 21:24:04 NZST LOG: recovery command file found... LOG: restore_program = cp %s/%s %s LOG: recovery_target_inclusive = true LOG: recovery_debug_log = true LOG: starting archive recovery LOG: restored log file from archive LOG: checkpoint record is at 0/A48054 LOG: redo record is at 0/A48054; undo record is at 0/0; shutdown FALSE LOG: next transaction ID: 496; next OID: 25419 LOG: database system was not properly shut down; automatic recovery in progress LOG: redo starts at 0/A48094 LOG: restored log file 0001 from archive LOG: record with zero length at 0/1E0 LOG: redo done at 0/130 LOG: restored log file 0001 from archive LOG: restored log file 0001 from archive PANIC: concurrent transaction log activity while database system is shutting down LOG: startup process (PID 13492) was terminated by signal 6 LOG: aborting startup due to startup process failure The concurrent access is a bit of a puzzle, as this is my home machine (i.e. I am *sure* noone else is connected!) First, thanks for sticking with it to test this. I've not received such a message myself - this is interesting. Is it possible to copy that directory to one side and re-run the test? Add another parameter in postgresql.conf called archive_debug = true Does it happen identically the second time? What time difference was there between steps 5 and 6? I think I can here Andreas saying told you I'm thinking the backup might be somehow corrupted because the checkpoint occurred during the backup. Hmmm... Could you also post me the recovery.log file? (don't post to list) Thanks, Simon Riggs ---(end of broadcast)--- TIP 5: Have you checked our extensive FAQ? http://www.postgresql.org/docs/faqs/FAQ.html
Re: [HACKERS] Point in Time Recovery
On Thu, 2004-07-15 at 15:57, Bruce Momjian wrote: We will get there --- it just seems dark at this time. Thanks for that. My comments were heartfelt, but not useful right now. I'm badly overdrawn already on my time budget, though that is my concern alone. There is more to do than I have time for. Pragmatically, if we aren't going to get there then I need to stop now, so I can progress other outstanding issues. All help is appreciated. I'm aiming for the minimum feature set - which means we do need to take care over whether that set is insufficient and also to pull any part that doesn't stand up to close scrutiny over the next few days. Overall, my primary goal is increased robustness and availability for PostgreSQL...and then to have a rest! Best Regards, Simon Riggs ---(end of broadcast)--- TIP 4: Don't 'kill -9' the postmaster
Re: [HACKERS] Point in Time Recovery
-BEGIN PGP SIGNED MESSAGE- Hash: SHA1 Simon, On Thu, 15 Jul 2004, Simon Riggs wrote: We will get there --- it just seems dark at this time. Thanks for that. My comments were heartfelt, but not useful right now. I'm badly overdrawn already on my time budget, though that is my concern alone. There is more to do than I have time for. Pragmatically, if we aren't going to get there then I need to stop now, so I can progress other outstanding issues. All help is appreciated. Personally, as a PostgreSQL Advocate, I believe that PITR is one of the most important missing features in PostgreSQL. I've been keeping 'all' of you e-mails about PITR and I'm really excited with that feature. Please do not stop working on PITR. I'm pretty sure that most of the 'silent' people in the lists are waiting for PITR for an {Oracle, DB2, ...}-killer database. In my country (Turkey), too many people spend a lot of money for proprietary databases, just for some missing features in PostgreSQL. If you finish your work on PITR (and other guys on NT, Win32 port, etc), then we'll feel more concentrated on PostgreSQL Advocation, so that PostgreSQL will be used more and more. (Oh, we also need native clustering...) Maybe I should send this e-mail offlist, but I wanted everyone to learn my feelings. Regards and best wishes, - -- Devrim GUNDUZ devrim~gunduz.org devrim.gunduz~linux.org.tr http://www.tdmsoft.com http://www.gunduz.org -BEGIN PGP SIGNATURE- Version: GnuPG v1.2.1 (GNU/Linux) iD8DBQFA9wKxtl86P3SPfQ4RAms+AJ95RfFi0lVwMD7u7zQ/DzLFEBC8MACgvRzd HRqAjVqI3hekwImPpqelj9U= =l445 -END PGP SIGNATURE- ---(end of broadcast)--- TIP 1: subscribe and unsubscribe commands go to [EMAIL PROTECTED]
Re: [HACKERS] Point in Time Recovery
On Thu, 2004-07-15 at 13:16, HISADAMasaki wrote: Dear Simon, I've just tested pitr_v5_2.patch and got an error message during archiving process as follows. -- begin LOG: archive command=cp /usr/local/pgsql/data/pg_xlog/ /tmp,return code=-1 -- end The command called in system(3) works, but it returns -1. system(3) can not get right exit code from its child process, when SIGCHLD is set as SIG_IGN. So I did following change to pgarch_Main() in pgarch.c -- line 236 --- - pgsignal(SIGCHLD, SIG_IGN); -- line 236 --- + pgsignal(SIGCHLD, SIG_DFL); Thank you for testing the patch. Very much appreciated. I was aware of the potential issues of incorrect return codes, and that exact part of the code is the part I'm least happy with. I'm not sure I understand why its returned -1, though I'll take you recommendation. I've not witnessed such an issue. What system are you running, or is it a default shell issue? Do people think that the change is appropriate for all systems, or just the one you're using? Best Regards, Simon Riggs ---(end of broadcast)--- TIP 8: explain analyze is your friend
Re: [HACKERS] Point in Time Recovery
On Thu, 2004-07-15 at 23:18, Devrim GUNDUZ wrote: Thanks for the vote of confidence, on or off list. too many people spend a lot of money for proprietary databases, just for some missing features in PostgreSQL Agreed - PITR isn't aimed at existing users of PostgreSQL. If you use it already, even though it doesn't have it, then you are quite likely to be able to keep going without it. Most commercial users won't touch anything that doesn't have PITR. (Oh, we also need native clustering...) Next week, OK? :) Best Regards, Simon ---(end of broadcast)--- TIP 2: you can get off all lists at once with the unregister command (send unregister YourEmailAddressHere to [EMAIL PROTECTED])
Re: [HACKERS] Point in Time Recovery
On Thu, Jul 15, 2004 at 11:44:02PM +0100, Simon Riggs wrote: On Thu, 2004-07-15 at 13:16, HISADAMasaki wrote: -- line 236 --- - pgsignal(SIGCHLD, SIG_IGN); -- line 236 --- + pgsignal(SIGCHLD, SIG_DFL); I'm not sure I understand why its returned -1, though I'll take you recommendation. I've not witnessed such an issue. What system are you running, or is it a default shell issue? Do people think that the change is appropriate for all systems, or just the one you're using? My manpage for signal(2) says that you shouldn't assign SIG_IGN to SIGCHLD, according to POSIX. It goes on to say that BSD and SysV behaviors differ on this aspect. (This is on linux BTW) -- Alvaro Herrera (alvherre[a]dcc.uchile.cl) La experiencia nos dice que el hombre peló millones de veces las patatas, pero era forzoso admitir la posibilidad de que en un caso entre millones, las patatas pelarÃan al hombre (Ijon Tichy) ---(end of broadcast)--- TIP 6: Have you searched our list archives? http://archives.postgresql.org
Re: [HACKERS] Point in Time Recovery
Simon Riggs wrote: On Thu, 2004-07-15 at 23:18, Devrim GUNDUZ wrote: Thanks for the vote of confidence, on or off list. too many people spend a lot of money for proprietary databases, just for some missing features in PostgreSQL Agreed - PITR isn't aimed at existing users of PostgreSQL. If you use it already, even though it doesn't have it, then you are quite likely to be able to keep going without it. Most commercial users won't touch anything that doesn't have PITR. Agreed. I am surprised at how few requests we have gotten for PITR. I assume people are either using replication or not considering us. -- Bruce Momjian| http://candle.pha.pa.us [EMAIL PROTECTED] | (610) 359-1001 + If your life is a hard drive, | 13 Roberts Road + Christ can be your backup.| Newtown Square, Pennsylvania 19073 ---(end of broadcast)--- TIP 4: Don't 'kill -9' the postmaster
Re: [HACKERS] Point in Time Recovery
Simon Riggs wrote: First, thanks for sticking with it to test this. I've not received such a message myself - this is interesting. Is it possible to copy that directory to one side and re-run the test? Add another parameter in postgresql.conf called archive_debug = true Does it happen identically the second time? Yes, identical results - I re-initdb'ed and ran the process again, rather than reuse the files. What time difference was there between steps 5 and 6? I think I can here Andreas saying told you I'm thinking the backup might be somehow corrupted because the checkpoint occurred during the backup. Hmmm... I was wondering about this, so left a bit more time in between, and forced a sync as well for good measure. 5) $ psql -d test -c checkpoint; sleep 30;sync;sleep 30 6) $ tar -zcvf /data1/dump/pgdata-7.5.tar.gz * Thanks, Simon Riggs ---(end of broadcast)--- TIP 9: the planner will ignore your desire to choose an index scan if your joining column's datatypes do not match
Re: [HACKERS] Point in Time Recovery
Simon Riggs wrote: On Thu, 2004-07-15 at 15:57, Bruce Momjian wrote: We will get there --- it just seems dark at this time. Thanks for that. My comments were heartfelt, but not useful right now. I'm badly overdrawn already on my time budget, though that is my concern alone. There is more to do than I have time for. Pragmatically, if we aren't going to get there then I need to stop now, so I can progress other outstanding issues. All help is appreciated. I'm aiming for the minimum feature set - which means we do need to take care over whether that set is insufficient and also to pull any part that doesn't stand up to close scrutiny over the next few days. As you can see, we are still chewing on NT. What PITR features are missing? I assume because we can't stop file system writes during backup that we will need a backup parameter file like I described. Is there anything else that PITR needs? -- Bruce Momjian| http://candle.pha.pa.us [EMAIL PROTECTED] | (610) 359-1001 + If your life is a hard drive, | 13 Roberts Road + Christ can be your backup.| Newtown Square, Pennsylvania 19073 ---(end of broadcast)--- TIP 4: Don't 'kill -9' the postmaster
Re: [HACKERS] Point in Time Recovery
Simon Riggs wrote: On Thu, 2004-07-15 at 23:18, Devrim GUNDUZ wrote: Thanks for the vote of confidence, on or off list. too many people spend a lot of money for proprietary databases, just for some missing features in PostgreSQL Agreed - PITR isn't aimed at existing users of PostgreSQL. If you use it already, even though it doesn't have it, then you are quite likely to be able to keep going without it. Most commercial users won't touch anything that doesn't have PITR. Agreed. I am surprised at how few requests we have gotten for PITR. I assume people are either using replication or not considering us. Don't forget that there are (must be) lots of us that know it's coming and are just waiting until it's available. I haven't requested per se, but believe me, I'm waiting for it :-) ---(end of broadcast)--- TIP 9: the planner will ignore your desire to choose an index scan if your joining column's datatypes do not match
Re: [HACKERS] Point in Time Recovery
Couldn't agree more. Maybe we should have made more noise :-) Glen Parker wrote: Simon Riggs wrote: On Thu, 2004-07-15 at 23:18, Devrim GUNDUZ wrote: Thanks for the vote of confidence, on or off list. too many people spend a lot of money for proprietary databases, just for some missing features in PostgreSQL Agreed - PITR isn't aimed at existing users of PostgreSQL. If you use it already, even though it doesn't have it, then you are quite likely to be able to keep going without it. Most commercial users won't touch anything that doesn't have PITR. Agreed. I am surprised at how few requests we have gotten for PITR. I assume people are either using replication or not considering us. Don't forget that there are (must be) lots of us that know it's coming and are just waiting until it's available. I haven't requested per se, but believe me, I'm waiting for it :-) ---(end of broadcast)--- TIP 9: the planner will ignore your desire to choose an index scan if your joining column's datatypes do not match ---(end of broadcast)--- TIP 3: if posting/reading through Usenet, please send an appropriate subscribe-nomail command to [EMAIL PROTECTED] so that your message can get through to the mailing list cleanly
Re: [HACKERS] Point in Time Recovery
Simon Riggs wrote: So far: I've tried to re-create the problem as exactly as I can, but it works for me. This is clearly an important case to chase down. I assume that this is the very first time you tried recovery? Second and subsequent recoveries using the same set have a potential loophole, which we have been discussing. Right now, I'm thinking that the exactly 2 logs worth of data has brought you very close to the end of the log file (E0) ending with 1 and the shutdown checkpoint that is then subsequently written is failing. Can you repeat this your end? It is repeatable at my end. It is actually fairly easy to recreate the example I am using, download http://sourceforge.net/projects/benchw and generate the dataset for Pg - but trim the large fact0.dat dump file using head -10. Thus step 7 consists of creating the 4 tables and COPYing in the data for them. The nearest I can get to the exact record pointers you show are to start recovery at A4807C and to end at with 88. Overall, PITR changes the recovery process very little, if at all. The main areas of effect are to do with sequencing of actions and matching up the right logs with the right backup. I'm not looking for bugs in the code but in subtle side-effects and edge cases. Everything you can tell me will help me greatly in chasing that down. I agree - I will try this sort of example again, but will change the number of rows I am COPYing (currently 10) and see if that helps. Best Regards, Simon Riggs By way of contrast, using the *same* procedure (1-11), but generating 2 logs worth of INSERTS/UPDATES using 10 concurrent process *works fine* - e.g : LOG: database system was interrupted at 2004-07-16 11:17:52 NZST LOG: recovery command file found... LOG: restore_program = cp %s/%s %s LOG: recovery_target_inclusive = true LOG: recovery_debug_log = true LOG: starting archive recovery LOG: restored log file from archive LOG: checkpoint record is at 0/A4803C LOG: redo record is at 0/A4803C; undo record is at 0/0; shutdown FALSE LOG: next transaction ID: 496; next OID: 25419 LOG: database system was not properly shut down; automatic recovery in progress LOG: redo starts at 0/A4807C postmaster starting [EMAIL PROTECTED] 7.5]$ LOG: restored log file 0001 from archive cp: cannot stat `/data1/pgdata/7.5-archive/0002': No such file or directory LOG: could not restore 0002 from archive LOG: could not open file /data1/pgdata/7.5/pg_xlog/0002 (log file 0, segment 2): No such file or directory LOG: redo done at 0/1D4 LOG: archive recovery complete LOG: database system is ready LOG: archiver started ---(end of broadcast)--- TIP 6: Have you searched our list archives? http://archives.postgresql.org
Re: [HACKERS] Point in Time Recovery
On Fri, 2004-07-16 at 00:01, Alvaro Herrera wrote: On Thu, Jul 15, 2004 at 11:44:02PM +0100, Simon Riggs wrote: On Thu, 2004-07-15 at 13:16, HISADAMasaki wrote: -- line 236 --- - pgsignal(SIGCHLD, SIG_IGN); -- line 236 --- + pgsignal(SIGCHLD, SIG_DFL); I'm not sure I understand why its returned -1, though I'll take you recommendation. I've not witnessed such an issue. What system are you running, or is it a default shell issue? Do people think that the change is appropriate for all systems, or just the one you're using? My manpage for signal(2) says that you shouldn't assign SIG_IGN to SIGCHLD, according to POSIX. It goes on to say that BSD and SysV behaviors differ on this aspect. POSIX rules OK! So - I should be setting this to SIG_DFL and thats good for everyone? OK. Will do. Best regards, Simon Riggs ---(end of broadcast)--- TIP 9: the planner will ignore your desire to choose an index scan if your joining column's datatypes do not match
Re: [HACKERS] Point in Time Recovery
On Fri, 2004-07-16 at 00:46, Mark Kirkwood wrote: By way of contrast, using the *same* procedure (1-11), but generating 2 logs worth of INSERTS/UPDATES using 10 concurrent process *works fine* - e.g : Great...at least we have shown that something works (or can work) and have begun to isolate the problem whatever it is. Best Regards, Simon Riggs ---(end of broadcast)--- TIP 7: don't forget to increase your free space map settings
Re: [HACKERS] Point in Time Recovery
Simon Riggs [EMAIL PROTECTED] writes: On Fri, 2004-07-16 at 00:01, Alvaro Herrera wrote: My manpage for signal(2) says that you shouldn't assign SIG_IGN to SIGCHLD, according to POSIX. So - I should be setting this to SIG_DFL and thats good for everyone? Yeah, we learned the same lesson in the backend not too many releases back. SIG_IGN'ing SIGCHLD is bad voodoo; it'll work on some platforms but not others. You could do worse than to look at the existing handling of signals in the postmaster and its children; that code has been beat on pretty heavily ... regards, tom lane ---(end of broadcast)--- TIP 8: explain analyze is your friend
Re: [HACKERS] Point in Time Recovery
Thanks for that. My comments were heartfelt, but not useful right now. Hi Simon, I'm sorry if I gave the impression that I thought your work wasn't worthwhile, it is :( I'm badly overdrawn already on my time budget, though that is my concern alone. There is more to do than I have time for. Pragmatically, if we aren't going to get there then I need to stop now, so I can progress other outstanding issues. All help is appreciated. I've got your patch applied (but having some compilation problem), but I'm really not sure what to test really. I don't really understand the whole thing fully :/ I'm aiming for the minimum feature set - which means we do need to take care over whether that set is insufficient and also to pull any part that doesn't stand up to close scrutiny over the next few days. Overall, my primary goal is increased robustness and availability for PostgreSQL...and then to have a rest! Definitely! Chris ---(end of broadcast)--- TIP 9: the planner will ignore your desire to choose an index scan if your joining column's datatypes do not match
Re: [HACKERS] Point in Time Recovery
On Wed, 2004-07-14 at 03:31, Christopher Kings-Lynne wrote: Can you give us some suggestions of what kind of stuff to test? Is there a way we can artificially kill the backend in all sorts of nasty spots to see if recovery works? Does kill -9 simulate a 'power off'? I was hoping some fiendish plans would be presented to me... But please start with this feels like typical usage and we'll go from there...the important thing is to try the first one. I've not done power off tests, yet. They need to be done just to check...actually you don't need to do this to test PITR... We need to exhaustive tests of... - power off - scp and cross network copies - all the permuted recovery options - archive_mode = off (i.e. current behaviour) - deliberately incorrectly set options (idiot-proof testing) I'd love some help assembling a test document with numbered tests... Best regards, Simon Riggs ---(end of broadcast)--- TIP 6: Have you searched our list archives? http://archives.postgresql.org
Re: [HACKERS] Point in Time Recovery
The recovery mechanism doesn't rely upon you knowing 1 or 3. The recovery reads pg_control (from the backup) and then attempts to de-archive the appropriate xlog segment file and then starts rollforward Unfortunately this only works if pg_control was the first file to be backed up (or by chance no checkpoint happened after backup start and pg_control backup) Other db's have commands for: start/end external backup Maybe we should add those two commands that would initially only do the following: start external backup: - (checkpoint as an option) - make a copy of pg_control end external backup: - record WAL position (helps choose an allowed minimum PIT) Those commands would actually not be obligatory but recommended, and would only help with the restore process. Restore would then eighter take the existing pg_control backup, or ask the dba where rollforward should start and create a corresponding pg_control. A helper utility could list possible checkpoints in a given xlog. Andreas ---(end of broadcast)--- TIP 1: subscribe and unsubscribe commands go to [EMAIL PROTECTED]
Re: [HACKERS] Point in Time Recovery
Simon Riggs [EMAIL PROTECTED] writes: I've not done power off tests, yet. They need to be done just to check...actually you don't need to do this to test PITR... I agree, power off is not really the point here. What we need to check into is (a) the mechanics of archiving WAL segments and (b) the process of restoring given a backup and a bunch of WAL segments. regards, tom lane ---(end of broadcast)--- TIP 3: if posting/reading through Usenet, please send an appropriate subscribe-nomail command to [EMAIL PROTECTED] so that your message can get through to the mailing list cleanly
Re: [HACKERS] Point in Time Recovery
On 14 Jul, Simon Riggs wrote: PITR Patch v5_1 just posted has Point in Time Recovery working Still some rough edgesbut we really need some testers now to give this a try and let me know what you think. Klaus Naumann and Mark Wong are the only [non-committers] to have tried to run the code (and let me know about it), so please have a look at [PATCHES] and try it out. Many thanks, Simon Riggs Simon, I just tried applying the v5_1 patch against the cvs tip today and got a couple of rejections. I'll copy the patch output here. Let me know if you want to see the reject files or anything else: $ patch -p0 ../../../pitr-v5_1.diff patching file backend/access/nbtree/nbtsort.c Hunk #2 FAILED at 221. 1 out of 2 hunks FAILED -- saving rejects to file backend/access/nbtree/nbtsort.c.rej patching file backend/access/transam/xlog.c Hunk #11 FAILED at 1802. Hunk #15 FAILED at 2152. Hunk #16 FAILED at 2202. Hunk #21 FAILED at 3450. Hunk #23 FAILED at 3539. Hunk #25 FAILED at 3582. Hunk #26 FAILED at 3833. Hunk #27 succeeded at 3883 with fuzz 2. Hunk #28 FAILED at 4446. Hunk #29 succeeded at 4470 with fuzz 2. 8 out of 29 hunks FAILED -- saving rejects to file backend/access/transam/xlog.c.rej patching file backend/postmaster/Makefile patching file backend/postmaster/postmaster.c Hunk #3 succeeded at 1218 with fuzz 2 (offset 70 lines). Hunk #4 succeeded at 1827 (offset 70 lines). Hunk #5 succeeded at 1874 (offset 70 lines). Hunk #6 succeeded at 1894 (offset 70 lines). Hunk #7 FAILED at 1985. Hunk #8 succeeded at 2039 (offset 70 lines). Hunk #9 succeeded at 2236 (offset 70 lines). Hunk #10 succeeded at 2996 with fuzz 2 (offset 70 lines). 1 out of 10 hunks FAILED -- saving rejects to file backend/postmaster/postmaster.c.rej patching file backend/storage/smgr/md.c Hunk #1 succeeded at 162 with fuzz 2. patching file backend/utils/misc/guc.c Hunk #1 succeeded at 342 (offset 9 lines). Hunk #2 succeeded at 1387 (offset 9 lines). patching file backend/utils/misc/postgresql.conf.sample Hunk #1 succeeded at 113 (offset 10 lines). patching file bin/initdb/initdb.c patching file include/access/xlog.h patching file include/storage/pmsignal.h ---(end of broadcast)--- TIP 4: Don't 'kill -9' the postmaster
Re: [HACKERS] Point in Time Recovery
On Wed, 2004-07-14 at 16:55, [EMAIL PROTECTED] wrote: On 14 Jul, Simon Riggs wrote: PITR Patch v5_1 just posted has Point in Time Recovery working Still some rough edgesbut we really need some testers now to give this a try and let me know what you think. Klaus Naumann and Mark Wong are the only [non-committers] to have tried to run the code (and let me know about it), so please have a look at [PATCHES] and try it out. I just tried applying the v5_1 patch against the cvs tip today and got a couple of rejections. I'll copy the patch output here. Let me know if you want to see the reject files or anything else: I'm on it. Sorry 'bout that all - midnight fingers. ---(end of broadcast)--- TIP 5: Have you checked our extensive FAQ? http://www.postgresql.org/docs/faqs/FAQ.html
Re: [HACKERS] Point in Time Recovery
On Wed, 2004-07-14 at 10:57, Zeugswetter Andreas SB SD wrote: The recovery mechanism doesn't rely upon you knowing 1 or 3. The recovery reads pg_control (from the backup) and then attempts to de-archive the appropriate xlog segment file and then starts rollforward Unfortunately this only works if pg_control was the first file to be backed up (or by chance no checkpoint happened after backup start and pg_control backup) Other db's have commands for: start/end external backup OK...this idea has come up a few times. Here's my take: - OS and hardware facilities exist now to make instant copies of sets of files. Some of these are open source, others not. If you use these, you have no requirement for this functionalitybut these alone are no replacement for archive recovery I accept that some people may not wish to go to the expense or effort to use those options, but in my mind these are the people that will not be using archive_mode anyway. - all we would really need to do is to stop the bgwriter from doing anything during backup. pgcontrol is only updated at checkpoint. The current xlog is updated constantly, but this need not be copied because we are already archiving it as soon as its full. That leaves the bgwriter, which is now responsible for both lazy writing and checkpoints. So, put a switch into bgwriter to halt for a period, then turn it back on at the end. Could be a SIGHUP GUC...or... ...and with my greatest respects - please could somebody else code that?... my time is limited Best regards, Simon Riggs ---(end of broadcast)--- TIP 6: Have you searched our list archives? http://archives.postgresql.org
Re: [HACKERS] Point in Time Recovery
I noticed that compiling with 5_1 patch applied fails due to XLOG_archive_dir being removed from xlog.c , but src/backend/commands/tablecmds.c still uses it. I did the following to tablecmds.c : 5408c5408 extern char XLOG_archive_dir[]; --- extern char *XLogArchiveDest; 5410c5410 use_wal = XLOG_archive_dir[0] !rel-rd_istemp; --- use_wal = XLogArchiveDest[0] !rel-rd_istemp; Now I have to see if I have broken it with this change :-) regards Mark Simon Riggs wrote: On Wed, 2004-07-14 at 16:55, [EMAIL PROTECTED] wrote: On 14 Jul, Simon Riggs wrote: PITR Patch v5_1 just posted has Point in Time Recovery working Still some rough edgesbut we really need some testers now to give this a try and let me know what you think. Klaus Naumann and Mark Wong are the only [non-committers] to have tried to run the code (and let me know about it), so please have a look at [PATCHES] and try it out. I just tried applying the v5_1 patch against the cvs tip today and got a couple of rejections. I'll copy the patch output here. Let me know if you want to see the reject files or anything else: I'm on it. Sorry 'bout that all - midnight fingers. ---(end of broadcast)--- TIP 5: Have you checked our extensive FAQ? http://www.postgresql.org/docs/faqs/FAQ.html ---(end of broadcast)--- TIP 7: don't forget to increase your free space map settings
Re: [HACKERS] Point in Time Recovery
Hi, folks. My colleages and I are planning to test PITR after the 7.5 beta release. Now we are desinging test items, but some specification are enough clear (to us). For example, we are not clear which resouce manager order to store log records. - some access method (like B-tree) require to log its date or not. - create/drop action of table space to be stored to the log or not. We'll be pleased if someone informs them. The test set we'll proceed has following items; - PITR can recover ordinary commited transaction's data. - tuple data themselves - index data associated with them - PITR can recover commited some special transaction's data. - DDL; create database, table, index and so on - maintenance commands (handling large amount of data); truncate, vacuum, reindex and so on. Items above are 'data aspects' of the test. Other aspects are as follows - Place of the archival log's drive; PITR can recover a database from archived log data - stored in the same drive as xlog. - stored in a different drive on the same machine in which the PostgreSQL runs. - stored in a different drive on a different machine. - Duration between a checkpoint and recovery; PITR can recover a database enough long after a checkpoint. - Time to Recover; - to end of the log. - to some specified time. - Type of failures; - system down --- kill the PostgreSQL process (as a simulation). - media lost --- delete database files (as a simulation). - These two case will be tested by a simulated situation first, and we would try some 'real' failure after. (real power down of the test machine to the first case, and 'plug off' the disk drive to the second one. these action would damage test machine, this is because we plan them after 'ordinary' test items.) The test set is under construction and we'll test the 7.5 beta for some weeks, and report the result of the test here. Sincerely yours. Tetsuo SAKATA. -- sakata.tetsuo _at_ lab.ntt.co.jp SAKATA, Tetsuo. Yokosuka JAPAN. ---(end of broadcast)--- TIP 3: if posting/reading through Usenet, please send an appropriate subscribe-nomail command to [EMAIL PROTECTED] so that your message can get through to the mailing list cleanly
Re: [HACKERS] Point in Time Recovery
I talked to Tom on the phone today and and I think we have a procedure for doing backup/restore in a fairly foolproof way. As outlined below, we need to record the start/stop and checkpoint WAL file names and offsets, and somehow pass those on to restore. I think any system that requires users to link those values together is going to cause confusion and be error-prone. My idea is to do much of this automatically. First, create a server-side function called pitr_backup_start() which creates a file in the /data directory which contains the WAL filename/offsets for last checkpoint and start. Then do the backup of the data directory. Then call pitr_backup_stop() which adds the stop filename/offsets to the file, and archive that file in the same place as the WAL files. To restore, you untar the backup of /data. Then the recover backend reads the file created by pitr_backup_start() to find the name of the backup parameter file. It then recovers that file from the archive location and uses the start/stop/checkpoint filename/offset information to the restore. The advantage of this is that the tar backup contains everything needed to find the proper parameter file for restore. Ideally we could get all the parameters into the tar backup, but that isn't possible because we can't push the stop counters into the backup after the backup has completed. I recommend the pitr_backup_start() file be named for the current WAL filename/offset, perhaps 032c.3da390.backup or something like that. The file would be a simple text file in pg_xlog/archive_status: # start 2004-07-14 21:35:22.324579-04 wal_checkpoint = 0319.021233 wal_start = 032c.92a9cb ...added after backup completes... wal_stop = 034a.3db030 # stop 2004-07-14 21:32:22.0923213-04 The timestamps are for documentation only. These files give admins looking in the archive directory information on backup times. (As an idea, there is no need for the user to specify a recovery mode. If the postmaster starts and sees the pitr_backup_start() file in /data, it can go into recovery mode automatically. If the archiver can't find the file in the archive location, it can assume that it is just being started from power failure mode. However if it finds the file in the archive location, it can assume it is to enter recovery mode. There is a race condition that a crash during copy of the file to the archive location would be a problem. The solution would be to create a special flag file before copying the file to archive, and then archive it and remove the flag file. If the postmaster starts up and sees the pitr_backup_start() file in /data and in the archive location, and it doesn't see the flag file, it then knows it is doing a restore because the flag file would never appear in a backup. Anyway, this is just an idea.) --- Simon Riggs wrote: On Wed, 2004-07-14 at 10:57, Zeugswetter Andreas SB SD wrote: The recovery mechanism doesn't rely upon you knowing 1 or 3. The recovery reads pg_control (from the backup) and then attempts to de-archive the appropriate xlog segment file and then starts rollforward Unfortunately this only works if pg_control was the first file to be backed up (or by chance no checkpoint happened after backup start and pg_control backup) Other db's have commands for: start/end external backup OK...this idea has come up a few times. Here's my take: - OS and hardware facilities exist now to make instant copies of sets of files. Some of these are open source, others not. If you use these, you have no requirement for this functionalitybut these alone are no replacement for archive recovery I accept that some people may not wish to go to the expense or effort to use those options, but in my mind these are the people that will not be using archive_mode anyway. - all we would really need to do is to stop the bgwriter from doing anything during backup. pgcontrol is only updated at checkpoint. The current xlog is updated constantly, but this need not be copied because we are already archiving it as soon as its full. That leaves the bgwriter, which is now responsible for both lazy writing and checkpoints. So, put a switch into bgwriter to halt for a period, then turn it back on at the end. Could be a SIGHUP GUC...or... ...and with my greatest respects - please could somebody else code that?... my time is limited Best regards, Simon Riggs ---(end of broadcast)--- TIP 6: Have you searched our list archives? http://archives.postgresql.org -- Bruce Momjian| http://candle.pha.pa.us [EMAIL PROTECTED] | (610) 359-1001 + If your life is a hard drive, | 13 Roberts Road + Christ can be your
Re: [HACKERS] Point in Time Recovery
On Tue, 2004-07-06 at 22:40, Simon Riggs wrote: On Mon, 2004-07-05 at 22:46, Tom Lane wrote: Simon Riggs [EMAIL PROTECTED] writes: - when we stop, keep reading records until EOF, just don't apply them. When we write a checkpoint at end of recovery, the unapplied transactions are buried alive, never to return. - stop where we stop, then force zeros to EOF, so that no possible record remains of previous transactions. Go with plan B; it's best not to destroy data (what if you chose the wrong restart point the first time)? Actually this now reminds me of a discussion I had with Patrick Macdonald some time ago. The DB2 practice in this connection is that you *never* overwrite existing logfile data when recovering. Instead you start a brand new xlog segment file, which is given a new branch number so it can be distinguished from the future-time xlog segments that you chose not to apply. I don't recall what the DB2 terminology was exactly --- not branch number I don't think --- but anyway the idea is that when you restart the database after an incomplete recovery, you are now in a sort of parallel universe that has its own history after the branch point (PITR stop point). You need to be able to distinguish archived log segments of this parallel universe from those of previous and subsequent incarnations. I'm not sure whether Vadim intended our StartUpID to serve this purpose, but it could perhaps be used that way, if we reflected it in the WAL file names. Some more thoughts...focusing on the what do we do after we've finished recovering. The objectives, as I see them, are to put the system into a state, that preserves these features: 1. we never overwrite files, in case we want to re-run recovery 2. we never write files that MIGHT have been written previously 3. we need to ensure that any xlog records skipped at admins request (in PITR mode) are never in a position to be re-applied to this timeline. 4. ensure we can re-recover, if we need to, without further problems Tom's concept above, I'm going to call timelines. A timeline is the sequence of logs created by the execution of a server. If you recover the database, you create a new timeline. [This is because, if you've invoked PITR you absolutely definitely want log records written to, say, xlog15 to be different to those that were written to xlog15 in a previous timeline that you have chosen not to reapply.] Objective (1) is complex. When we are restoring, we always start with archived copies of the xlog, to make sure we don't finish too soon. We roll forward until we either reach PITR stop point, or we hit end of archived logs. If we hit end of logs on archive, then we switch to a local copy, if one exists that is higher than those, we carry on rolling forward until either we reach PITR stop point, or we hit end of that log. (Hopefully, there isn't more than one local xlog higher than the archive, but its possible). If we are rolling forward on local copies, then they are our only copies. We'd really like to archive them ASAP, but the archiver's not running yet - we don't want to force that situation in case the archive device (say a tape) is the one being used to recover right now. So we write an archive_status of .ready for that file, ensuring that the checkpoint won't remove it until it gets copied to archive, whenever that starts working again. Objective (1) met. When we have finished recovering we: - create a new xlog at the start of a new ++timeline - copy the last applied xlog record to it as the first record - set the record pointer so that it matches That way, when we come up and begin running, we never overwrite files that might have been written previously. Objective (2) met. We do the other stuff because recovery finishes up by pointing to the last applied record...which is what was causing all of this extra work in the first place. At this point, we also reset the secondary checkpoint record, so that should recovery be required again before next checkpoint AND the shutdown checkpoint record written after recovery completes is wrong/damaged, the recovery will not autorewind back past the PITR stop point and attempt to recover the records we have just tried so hard to reverse/ignore. Objective (3) met. (Clearly, that situation seems unlikely, but I feel we must deal with it...a newly restored system is actually very fragile, so a crash again within 3 minutes or so is very commonplace, as far as these things go). Should we need to re-recover, we can do so because the new timeline xlogs are further forward than the old timeline, so never get seen by any processes (all of which look backwards). Re-recovery is possible without problems, if required. This means you're a lot safer from some of the mistakes you might of made, such as deciding you need to go into recovery, then realising it wasn't required (or some other
Re: [HACKERS] Point in Time Recovery
The starting a new timeline thought works for xlogs, but not for clogs. No matter how far you go into the future, there is a small (yet vanishing) possibility that there is a yet undiscovered committed transaction in the future. (Because transactions are ordered in the clog because xids are assigned sequentially at txn start, but not ordered in the xlog where they are recorded in the order the txns complete). Won't ExtendCLOG take care of this with ZeroCLOGPage ? Else the same problem would arise at xid wraparound, no ? Andreas ---(end of broadcast)--- TIP 7: don't forget to increase your free space map settings
Re: [HACKERS] Point in Time Recovery
On Tue, 2004-07-13 at 13:18, Zeugswetter Andreas SB SD wrote: The starting a new timeline thought works for xlogs, but not for clogs. No matter how far you go into the future, there is a small (yet vanishing) possibility that there is a yet undiscovered committed transaction in the future. (Because transactions are ordered in the clog because xids are assigned sequentially at txn start, but not ordered in the xlog where they are recorded in the order the txns complete). Won't ExtendCLOG take care of this with ZeroCLOGPage ? Else the same problem would arise at xid wraparound, no ? Sounds like a good start... When PITR ends, we need to stop mid-way through a file. Does that handle that situation? Simon Riggs ---(end of broadcast)--- TIP 6: Have you searched our list archives? http://archives.postgresql.org
Re: [HACKERS] Point in Time Recovery
Simon Riggs [EMAIL PROTECTED] writes: Please tell me that we can ignore the state of the clog, We can. The reason that keeping track of timelines is interesting for xlog is simply to take pity on the poor DBA who needs to distinguish the various archived xlog files he's got laying about, and so that we can detect errors like supplying inconsistent sets of xlog segments during restore. This does not apply to clog because it's not archived. It's no more than a data file. If you think you have trouble recreating clog then you have the same issues recreating data files. regards, tom lane ---(end of broadcast)--- TIP 4: Don't 'kill -9' the postmaster
Re: [HACKERS] Point in Time Recovery
On Tue, 2004-07-13 at 15:29, Tom Lane wrote: Simon Riggs [EMAIL PROTECTED] writes: Please tell me that we can ignore the state of the clog, We can. In general, you are of course correct. The reason that keeping track of timelines is interesting for xlog is simply to take pity on the poor DBA who needs to distinguish the various archived xlog files he's got laying about, and so that we can detect errors like supplying inconsistent sets of xlog segments during restore. This does not apply to clog because it's not archived. It's no more than a data file. If you think you have trouble recreating clog then you have the same issues recreating data files. I'm getting carried away with the improbablebut this is the rather strange, but possible scenario I foresee: A sequence of times... 1. We start archiving xlogs 2. We take a checkpoint 3. we commit an important transaction 4. We take a backup 5. We take a checkpoint As stands currently, when we restore the backup, controlfile says that last checkpoint was at 2, so we rollforward from 2 THRU 4 and continue on past 5 until end of logs. Normally, end of logs isn't until after 4... When we specify a recovery target, it is possible to specify the rollforward to complete just before point 3. So we use the backup taken at 4 to rollforward to a point in the past (from the backups perspective). The backup taken at 4 may now have data and clog records written by bgwriter. Given that time between checkpoints is likely to be longer than previously was the case...this becomes a non-zero situation. I was trying to solve this problem head on, but the best way is to make sure we never allow ourselves such a muddled situation: ISTM the way to avoid this is to insist that we always rollforward through at least one checkpoint to guarantee that this will not occur. ...then I can forget I ever mentioned the ** clog again. I'm ignoring this issue for nowwhether it exists or not! Best Regards, Simon Riggs ---(end of broadcast)--- TIP 2: you can get off all lists at once with the unregister command (send unregister YourEmailAddressHere to [EMAIL PROTECTED])
Re: [HACKERS] Point in Time Recovery
Simon Riggs [EMAIL PROTECTED] writes: I'm getting carried away with the improbablebut this is the rather strange, but possible scenario I foresee: A sequence of times... 1. We start archiving xlogs 2. We take a checkpoint 3. we commit an important transaction 4. We take a backup 5. We take a checkpoint When we specify a recovery target, it is possible to specify the rollforward to complete just before point 3. No, it isn't possible. The recovery *must* proceed at least as far as wherever the end of the log was at the time the backup was completed. Otherwise everything is broken, not only clog, because you may have disk blocks in your backup that postdate where you stopped log replay. To have a consistent recovery at all, you must replay the log starting from a checkpoint before the backup began and extending to the time that the backup finished. You only get to decide where to stop after that point. regards, tom lane ---(end of broadcast)--- TIP 5: Have you checked our extensive FAQ? http://www.postgresql.org/docs/faqs/FAQ.html
Re: [HACKERS] Point in Time Recovery
On Tue, 2004-07-13 at 22:19, Tom Lane wrote: To have a consistent recovery at all, you must replay the log starting from a checkpoint before the backup began and extending to the time that the backup finished. You only get to decide where to stop after that point. So the situation is: - You must only stop recovery at a point in time (in the logs) after the backup had completed. No way to enforce that currently, apart from procedurally. Not exactly frequent, so I think I just document that and move on, eh? Thanks for your help, Best regards, Simon Riggs ---(end of broadcast)--- TIP 3: if posting/reading through Usenet, please send an appropriate subscribe-nomail command to [EMAIL PROTECTED] so that your message can get through to the mailing list cleanly
Re: [HACKERS] Point in Time Recovery
Simon Riggs [EMAIL PROTECTED] writes: So the situation is: - You must only stop recovery at a point in time (in the logs) after the backup had completed. Right. No way to enforce that currently, apart from procedurally. Not exactly frequent, so I think I just document that and move on, eh? The procedure that generates a backup has got to be responsible for recording both the start and stop times. If it does not do so then it's fatally flawed. (Note also that you had better be careful to get the time as seen on the server machine's clock ... this could be a nasty gotcha if the backup is run on a different machine, such as an NFS server.) regards, tom lane ---(end of broadcast)--- TIP 7: don't forget to increase your free space map settings
Re: [HACKERS] Point in Time Recovery
On Tue, 2004-07-13 at 23:42, Bruce Momjian wrote: Simon Riggs wrote: On Tue, 2004-07-13 at 22:19, Tom Lane wrote: To have a consistent recovery at all, you must replay the log starting from a checkpoint before the backup began and extending to the time that the backup finished. You only get to decide where to stop after that point. So the situation is: - You must only stop recovery at a point in time (in the logs) after the backup had completed. No way to enforce that currently, apart from procedurally. Not exactly frequent, so I think I just document that and move on, eh? If it happens, could you use your previous full backup and the PITR logs from before stop stopped logging, and then after? Yes. Is there a period where they could not restore reliably? Good question. No is the answer. The situation is that the backup isn't timestamped with respect to the logs, so its possible to attempt to use the wrong backup for recovery. The solution is procedural - make sure you timestamp your backup files, so you know which ones to recover with... Best Regards, Simon Riggs ---(end of broadcast)--- TIP 6: Have you searched our list archives? http://archives.postgresql.org
Re: [HACKERS] Point in Time Recovery
Tom Lane wrote: Simon Riggs [EMAIL PROTECTED] writes: So the situation is: - You must only stop recovery at a point in time (in the logs) after the backup had completed. Right. No way to enforce that currently, apart from procedurally. Not exactly frequent, so I think I just document that and move on, eh? The procedure that generates a backup has got to be responsible for recording both the start and stop times. If it does not do so then it's fatally flawed. (Note also that you had better be careful to get the time as seen on the server machine's clock ... this could be a nasty gotcha if the backup is run on a different machine, such as an NFS server.) OK, but procedurally, how do you correlate the start/stop time of the tar backup with the WAL numeric file names? -- Bruce Momjian| http://candle.pha.pa.us [EMAIL PROTECTED] | (610) 359-1001 + If your life is a hard drive, | 13 Roberts Road + Christ can be your backup.| Newtown Square, Pennsylvania 19073 ---(end of broadcast)--- TIP 7: don't forget to increase your free space map settings
Re: [HACKERS] Point in Time Recovery
Bruce Momjian [EMAIL PROTECTED] writes: OK, but procedurally, how do you correlate the start/stop time of the tar backup with the WAL numeric file names? Ideally the procedure for making a backup would go something like: 1. Inquire of the server its current time and the WAL position of the most recent checkpoint record (which is what you really need). 2. Make the backup. 3. Inquire of the server its current time and the current end-of-WAL position. 4. Record items 1 and 3 along with the backup itself. I think the current theory was you could fake #1 by copying pg_control before everything else, but this doesn't really help for step #3, so it would probably be better to add some server functions to get this info. regards, tom lane ---(end of broadcast)--- TIP 4: Don't 'kill -9' the postmaster
Re: [HACKERS] Point in Time Recovery
On Wed, 2004-07-14 at 00:28, Tom Lane wrote: Bruce Momjian [EMAIL PROTECTED] writes: OK, but procedurally, how do you correlate the start/stop time of the tar backup with the WAL numeric file names? Ideally the procedure for making a backup would go something like: 1. Inquire of the server its current time and the WAL position of the most recent checkpoint record (which is what you really need). 2. Make the backup. 3. Inquire of the server its current time and the current end-of-WAL position. 4. Record items 1 and 3 along with the backup itself. I think the current theory was you could fake #1 by copying pg_control before everything else, but this doesn't really help for step #3, so it would probably be better to add some server functions to get this info. err...I think at this point we should review the PITR patch The recovery mechanism doesn't rely upon you knowing 1 or 3. The recovery reads pg_control (from the backup) and then attempts to de-archive the appropriate xlog segment file and then starts rollforward from there. Effectively, restore assumes it has access to an infinite timeline of logswhich clearly isn't the case, but its up to *you* to check that you have the logs that go with the backups. (Or put another way, if this sounds hard, buy some software that administers the procedure for you). That's the mechanism that allows infinite recovery. In brief, the code path is as identical as possible to the current crash recovery situation...archive recovery restores the files from archive when they are needed, just as if they had always been in pg_xlog, in a way that ensures pg_xlog never runs out of space. Recovery ends when: it reaches the recovery target you specified, or it runs out of xlogs (first it runs out of archived xlogs, then tries to find a more recent local copy if there is one). I think the current theory was you could fake #1 by copying pg_control before everything else, but this doesn't really help for step #3, so it would probably be better to add some server functions to get this info. Not sure what you mean by fake Best Regards, Simon Riggs ---(end of broadcast)--- TIP 9: the planner will ignore your desire to choose an index scan if your joining column's datatypes do not match
Re: [HACKERS] Point in Time Recovery
PITR Patch v5_1 just posted has Point in Time Recovery working Still some rough edgesbut we really need some testers now to give this a try and let me know what you think. Klaus Naumann and Mark Wong are the only [non-committers] to have tried to run the code (and let me know about it), so please have a look at [PATCHES] and try it out. Many thanks, Simon Riggs ---(end of broadcast)--- TIP 8: explain analyze is your friend
Re: [HACKERS] Point in Time Recovery
Simon Riggs wrote: On Tue, 2004-07-13 at 22:19, Tom Lane wrote: To have a consistent recovery at all, you must replay the log starting from a checkpoint before the backup began and extending to the time that the backup finished. You only get to decide where to stop after that point. So the situation is: - You must only stop recovery at a point in time (in the logs) after the backup had completed. No way to enforce that currently, apart from procedurally. Not exactly frequent, so I think I just document that and move on, eh? If it happens, could you use your previous full backup and the PITR logs from before stop stopped logging, and then after? Is there a period where they could not restore reliably? -- Bruce Momjian| http://candle.pha.pa.us [EMAIL PROTECTED] | (610) 359-1001 + If your life is a hard drive, | 13 Roberts Road + Christ can be your backup.| Newtown Square, Pennsylvania 19073 ---(end of broadcast)--- TIP 7: don't forget to increase your free space map settings
Re: [HACKERS] Point in Time Recovery
On Wed, 2004-07-14 at 00:01, Bruce Momjian wrote: Tom Lane wrote: Simon Riggs [EMAIL PROTECTED] writes: So the situation is: - You must only stop recovery at a point in time (in the logs) after the backup had completed. Right. No way to enforce that currently, apart from procedurally. Not exactly frequent, so I think I just document that and move on, eh? The procedure that generates a backup has got to be responsible for recording both the start and stop times. If it does not do so then it's fatally flawed. (Note also that you had better be careful to get the time as seen on the server machine's clock ... this could be a nasty gotcha if the backup is run on a different machine, such as an NFS server.) OK, but procedurally, how do you correlate the start/stop time of the tar backup with the WAL numeric file names? No need. You just correlate the recovery target with the backup file times. Mostly, you'll only ever use your last backup and won't need to fuss with the times. Backup should begin with a CHECKPOINT...then wait for that to complete, just to make the backup as current as possible. If you want to start purging your archives of old archived xlogs, you can use the filedate (assuming you preserved that on your copy to archive - but even if not, they'll be fairly close). Best regards, Simon Riggs ---(end of broadcast)--- TIP 5: Have you checked our extensive FAQ? http://www.postgresql.org/docs/faqs/FAQ.html
Re: [HACKERS] Point in Time Recovery
Can you give us some suggestions of what kind of stuff to test? Is there a way we can artificially kill the backend in all sorts of nasty spots to see if recovery works? Does kill -9 simulate a 'power off'? Chris Simon Riggs wrote: PITR Patch v5_1 just posted has Point in Time Recovery working Still some rough edgesbut we really need some testers now to give this a try and let me know what you think. Klaus Naumann and Mark Wong are the only [non-committers] to have tried to run the code (and let me know about it), so please have a look at [PATCHES] and try it out. Many thanks, Simon Riggs ---(end of broadcast)--- TIP 8: explain analyze is your friend ---(end of broadcast)--- TIP 8: explain analyze is your friend
Re: [HACKERS] Point in Time Recovery
On 7/6/2004 3:58 PM, Simon Riggs wrote: On Tue, 2004-07-06 at 08:38, Zeugswetter Andreas SB SD wrote: - by time - but the time stamp on each xlog record only specifies to the second, which could easily be 10 or more commits (we hope) Should we use a different datatype than time_t for the commit timestamp, one that offers more fine grained differentiation between checkpoints? Imho seconds is really sufficient. If you know a more precise position you will probably know it from backend log or an xlog sniffer. With those you can easily use the TransactionId way. TransactionID and timestamp is only sufficient if the transactions are selected by their commit order. Especially in read committed mode, consider this execution: xid-1: start xid-2: start xid-2: update field x xid-2: commit xid-1: update field y xid-1: commit In this case, the update done by xid-1 depends on the row created by xid-2. So logically xid-2 precedes xid-1, because it made its changes earlier. So you have to apply the log until you find the commit record of the transaction you want apply last, and then stamp all transactions that where in progress at that time as aborted. Jan OK, thanks. I'll just leave the time_t datatype just the way it is. Best Regards, Simon Riggs ---(end of broadcast)--- TIP 3: if posting/reading through Usenet, please send an appropriate subscribe-nomail command to [EMAIL PROTECTED] so that your message can get through to the mailing list cleanly -- #==# # It's easier to get forgiveness for being wrong than for being right. # # Let's break this rule - forgive me. # #== [EMAIL PROTECTED] # ---(end of broadcast)--- TIP 9: the planner will ignore your desire to choose an index scan if your joining column's datatypes do not match
Re: [HACKERS] Point in Time Recovery
On Sat, 2004-07-10 at 15:17, Jan Wieck wrote: On 7/6/2004 3:58 PM, Simon Riggs wrote: On Tue, 2004-07-06 at 08:38, Zeugswetter Andreas SB SD wrote: - by time - but the time stamp on each xlog record only specifies to the second, which could easily be 10 or more commits (we hope) Should we use a different datatype than time_t for the commit timestamp, one that offers more fine grained differentiation between checkpoints? Imho seconds is really sufficient. If you know a more precise position you will probably know it from backend log or an xlog sniffer. With those you can easily use the TransactionId way. TransactionID and timestamp is only sufficient if the transactions are selected by their commit order. Especially in read committed mode, consider this execution: xid-1: start xid-2: start xid-2: update field x xid-2: commit xid-1: update field y xid-1: commit In this case, the update done by xid-1 depends on the row created by xid-2. So logically xid-2 precedes xid-1, because it made its changes earlier. So you have to apply the log until you find the commit record of the transaction you want apply last, and then stamp all transactions that where in progress at that time as aborted. Agreed. I've implemented this exactly as you say This turns out to be very easy because: - when looking where to stop we only ever stop at commit or aborts - these are the only records that have timestamps anyway... - any record that isn't specifically committed is not updated in the clog and therefore not visible. The clog starts in indeterminate state, 0 and is then updated to either committed or aborted. Aborted and indeterminate are handled similarly in the current code, to allow for crash recovery - PITR doesn't change anything there. So, PITR doesn't do anything that crash recovery doen't already do. Crash recovery makes no attempt to keep track of in-progress transactions and doesn't make a special journey to the clog to specifically mark them as aborted - they just are by default. So, what we mean by stop at a transactionId is stop applying redo at the commit/abort record for that transactionId. It has to be an exact match, not a greater than, for exactly the reason you mention. That means that although we stop at the commit record of transactionId X, we may also have applied records for transactions with later transactionIds e.g. X+1, X+2...etc (without limit or restriction). (I'll even admit that as first, I did think we could get away with the less than test that you are warning me about. Overall, I've spent more time on theory/analysis than on coding, on the idea that you can improve poor code, but wrong code just needs to be thrown away). Timestamps are more vague...When time is used, there might easily be 10+ transactions whose commit/abort records have identical timestamp values. So we either stop at the first or last record depending upon whether we specified inclusive or exclusive on the recovery target value. The hard bit, IMHO, is what we do with the part of the log that we have chosen not to applywhich has been discussed on list in detail also. Thanks for keeping an eye out for possible errors - this one is something I'd thought through and catered for (there are comments in my current latest published code to that effect, although I have not yet finished coding the clean-up-after-stopping part). This implies nothing with regard to other possible errors or oversights and so I very much welcome any questioning of this nature - I am as prone to error as the next man. It's important we get this right. Best regards, Simon Riggs ---(end of broadcast)--- TIP 1: subscribe and unsubscribe commands go to [EMAIL PROTECTED]
Re: [HACKERS] Point in Time Recovery
On Tue, 6 Jul 2004, Zeugswetter Andreas SB SD wrote: Should we use a different datatype than time_t for the commit timestamp, one that offers more fine grained differentiation between checkpoints? Imho seconds is really sufficient. If you know a more precise position you will probably know it from backend log or an xlog sniffer. With those you can easily use the TransactionId way. I'd also think that seconds are absolutely sufficient. From my daily experience I can say that you're normally lucky to know the time on one minute basis. If you need to get closer, a log sniffer is unavoidable ... Greetings, Klaus -- Full Name : Klaus Naumann | (http://www.mgnet.de/) (Germany) Phone / FAX : ++49/177/7862964 | E-Mail: ([EMAIL PROTECTED]) ---(end of broadcast)--- TIP 9: the planner will ignore your desire to choose an index scan if your joining column's datatypes do not match
Re: [HACKERS] Point in Time Recovery
On Thu, 8 Jul 2004, Simon Riggs wrote: We don't need to mention timelines in the docs, nor do we need to alter pg_controldata to display it...just a comment in the code to explain why we add a large number to the LogId after each recovery completes. I'd disagree on that. Knowing what exactly happens when recovering the database is a must. It makes people feel more safe about the process. This is because the software doesn't do things you wouldn't expect. On Oracle e.g. you create a new database incarnation when you recover a database (incomplete). Which means your System Change Number and your Log Sequence is reset to 0. Knowing this is crucial because you otherwise would wonder Why is it doing that? Is this a bug or a feature? Just my 2ct :-)) Greetings, Klaus -- Full Name : Klaus Naumann | (http://www.mgnet.de/) (Germany) Phone / FAX : ++49/177/7862964 | E-Mail: ([EMAIL PROTECTED]) ---(end of broadcast)--- TIP 8: explain analyze is your friend
Re: [HACKERS] Point in Time Recovery
On Thu, 2004-07-08 at 07:57, [EMAIL PROTECTED] wrote: On Thu, 8 Jul 2004, Simon Riggs wrote: We don't need to mention timelines in the docs, nor do we need to alter pg_controldata to display it...just a comment in the code to explain why we add a large number to the LogId after each recovery completes. I'd disagree on that. Knowing what exactly happens when recovering the database is a must. It makes people feel more safe about the process. This is because the software doesn't do things you wouldn't expect. On Oracle e.g. you create a new database incarnation when you recover a database (incomplete). Which means your System Change Number and your Log Sequence is reset to 0. Knowing this is crucial because you otherwise would wonder Why is it doing that? Is this a bug or a feature? OK, will do. Best regards, Simon Riggs ---(end of broadcast)--- TIP 6: Have you searched our list archives? http://archives.postgresql.org
Re: [HACKERS] Point in Time Recovery
Well, Tom does seem to have something with regard to StartUpIds. I feel it is easier to force a new timeline by adding a very large number to the LogId IF, and only if, we have performed an archive recovery. That way, we do not change at all the behaviour of the system for people that choose not to implement archive_mode. Imho you should take a close look at StartUpId, I think it is exactly this large number. Maybe you can add +2 to intentionally leave a hole. Once you increment, I think it is very essential to checkpoint and double check pg_control, cause otherwise a crashrecovery would read the wrong xlogs. Should we implement timelines? Yes :-) Andreas ---(end of broadcast)--- TIP 8: explain analyze is your friend
Re: [HACKERS] Point in Time Recovery
On Wed, 2004-07-07 at 14:17, Zeugswetter Andreas SB SD wrote: Well, Tom does seem to have something with regard to StartUpIds. I feel it is easier to force a new timeline by adding a very large number to the LogId IF, and only if, we have performed an archive recovery. That way, we do not change at all the behaviour of the system for people that choose not to implement archive_mode. Imho you should take a close look at StartUpId, I think it is exactly this large number. Maybe you can add +2 to intentionally leave a hole. Once you increment, I think it is very essential to checkpoint and double check pg_control, cause otherwise a crashrecovery would read the wrong xlogs. Thanks for your thoughts - you have made me rethink this over some hours. Trouble is, on this occasion, the other suggestion still seems the best one, IMVHO. If we number timelines based upon StartUpId, then there is still some potential for conflict and this is what we're seeking to avoid. Simply adding to the LogId puts the new timeline so far into the previous timelines future that there isn't any problems. We only increment the timeline when we recover, so we're not eating up the numbers quickly. Simply adding a number means that there isn't any conflict with any normal operations. The timelines aren't numbered separately, so I'm not introducing a new version of StartUpID...technically there isn't a new timeline, just a chunk of geological time between them. We don't need to mention timelines in the docs, nor do we need to alter pg_controldata to display it...just a comment in the code to explain why we add a large number to the LogId after each recovery completes. Best regards, Simon Riggs ---(end of broadcast)--- TIP 6: Have you searched our list archives? http://archives.postgresql.org
Re: [HACKERS] Point in Time Recovery
- by time - but the time stamp on each xlog record only specifies to the second, which could easily be 10 or more commits (we hope) Should we use a different datatype than time_t for the commit timestamp, one that offers more fine grained differentiation between checkpoints? Imho seconds is really sufficient. If you know a more precise position you will probably know it from backend log or an xlog sniffer. With those you can easily use the TransactionId way. - when we stop, keep reading records until EOF, just don't apply them. When we write a checkpoint at end of recovery, the unapplied transactions are buried alive, never to return. - stop where we stop, then force zeros to EOF, so that no possible record remains of previous transactions. I'm tempted by the first plan, because it is more straightforward and stands much less chance of me introducing 50 wierd bugs just before close. But what if you restore because after that PIT everything went haywire including the log ? Did you mean to apply the remaining changes but not commit those xids ? I think it is essential to only leave xlogs around that allow a subsequent rollforward from the same old full backup. Or is an instant new full backup required after restore ? Andreas ---(end of broadcast)--- TIP 3: if posting/reading through Usenet, please send an appropriate subscribe-nomail command to [EMAIL PROTECTED] so that your message can get through to the mailing list cleanly
Re: [HACKERS] Point in Time Recovery
On Tue, 2004-07-06 at 08:38, Zeugswetter Andreas SB SD wrote: - by time - but the time stamp on each xlog record only specifies to the second, which could easily be 10 or more commits (we hope) Should we use a different datatype than time_t for the commit timestamp, one that offers more fine grained differentiation between checkpoints? Imho seconds is really sufficient. If you know a more precise position you will probably know it from backend log or an xlog sniffer. With those you can easily use the TransactionId way. OK, thanks. I'll just leave the time_t datatype just the way it is. Best Regards, Simon Riggs ---(end of broadcast)--- TIP 3: if posting/reading through Usenet, please send an appropriate subscribe-nomail command to [EMAIL PROTECTED] so that your message can get through to the mailing list cleanly
Re: [HACKERS] Point in Time Recovery
On Mon, 2004-07-05 at 22:46, Tom Lane wrote: Simon Riggs [EMAIL PROTECTED] writes: - when we stop, keep reading records until EOF, just don't apply them. When we write a checkpoint at end of recovery, the unapplied transactions are buried alive, never to return. - stop where we stop, then force zeros to EOF, so that no possible record remains of previous transactions. Go with plan B; it's best not to destroy data (what if you chose the wrong restart point the first time)? Actually this now reminds me of a discussion I had with Patrick Macdonald some time ago. The DB2 practice in this connection is that you *never* overwrite existing logfile data when recovering. Instead you start a brand new xlog segment file, which is given a new branch number so it can be distinguished from the future-time xlog segments that you chose not to apply. I don't recall what the DB2 terminology was exactly --- not branch number I don't think --- but anyway the idea is that when you restart the database after an incomplete recovery, you are now in a sort of parallel universe that has its own history after the branch point (PITR stop point). You need to be able to distinguish archived log segments of this parallel universe from those of previous and subsequent incarnations. I'm not sure whether Vadim intended our StartUpID to serve this purpose, but it could perhaps be used that way, if we reflected it in the WAL file names. Some more thoughts...focusing on the what do we do after we've finished recovering. The objectives, as I see them, are to put the system into a state, that preserves these features: 1. we never overwrite files, in case we want to re-run recovery 2. we never write files that MIGHT have been written previously 3. we need to ensure that any xlog records skipped at admins request (in PITR mode) are never in a position to be re-applied to this timeline. 4. ensure we can re-recover, if we need to, without further problems Tom's concept above, I'm going to call timelines. A timeline is the sequence of logs created by the execution of a server. If you recover the database, you create a new timeline. [This is because, if you've invoked PITR you absolutely definitely want log records written to, say, xlog15 to be different to those that were written to xlog15 in a previous timeline that you have chosen not to reapply.] Objective (1) is complex. When we are restoring, we always start with archived copies of the xlog, to make sure we don't finish too soon. We roll forward until we either reach PITR stop point, or we hit end of archived logs. If we hit end of logs on archive, then we switch to a local copy, if one exists that is higher than those, we carry on rolling forward until either we reach PITR stop point, or we hit end of that log. (Hopefully, there isn't more than one local xlog higher than the archive, but its possible). If we are rolling forward on local copies, then they are our only copies. We'd really like to archive them ASAP, but the archiver's not running yet - we don't want to force that situation in case the archive device (say a tape) is the one being used to recover right now. So we write an archive_status of .ready for that file, ensuring that the checkpoint won't remove it until it gets copied to archive, whenever that starts working again. Objective (1) met. When we have finished recovering we: - create a new xlog at the start of a new ++timeline - copy the last applied xlog record to it as the first record - set the record pointer so that it matches That way, when we come up and begin running, we never overwrite files that might have been written previously. Objective (2) met. We do the other stuff because recovery finishes up by pointing to the last applied record...which is what was causing all of this extra work in the first place. At this point, we also reset the secondary checkpoint record, so that should recovery be required again before next checkpoint AND the shutdown checkpoint record written after recovery completes is wrong/damaged, the recovery will not autorewind back past the PITR stop point and attempt to recover the records we have just tried so hard to reverse/ignore. Objective (3) met. (Clearly, that situation seems unlikely, but I feel we must deal with it...a newly restored system is actually very fragile, so a crash again within 3 minutes or so is very commonplace, as far as these things go). Should we need to re-recover, we can do so because the new timeline xlogs are further forward than the old timeline, so never get seen by any processes (all of which look backwards). Re-recovery is possible without problems, if required. This means you're a lot safer from some of the mistakes you might of made, such as deciding you need to go into recovery, then realising it wasn't required (or some other painful flapping as goes on in computer rooms at 3am). How do we implement timelines? The main presumption in the code is that xlogs are
Re: [HACKERS] Point in Time Recovery
On Tue, 2004-07-06 at 20:00, Richard Huxton wrote: Simon Riggs wrote: On Mon, 2004-07-05 at 22:46, Tom Lane wrote: Simon Riggs [EMAIL PROTECTED] writes: Should we use a different datatype than time_t for the commit timestamp, one that offers more fine grained differentiation between checkpoints? Pretty much everybody supports gettimeofday() (time_t and separate integer microseconds); you might as well use that. Note that the actual resolution is not necessarily microseconds, and it'd still not be certain that successive commits have distinct timestamps --- so maybe this refinement would be pointless. You'll still have to design a user interface that allows selection without the assumption of distinct timestamps. Well, I agree, though without the desired-for UI now, I think some finer grained mechanism would be good. This means extending the xlog commit record by a couple of bytes...OK, lets live a little. At the risk of irritating people, I'll repeat what I suggested a few weeks ago... All feedback is good. Thanks. Add a table: pg_pitr_checkpt (pitr_id SERIAL, pitr_ts timestamptz, pitr_comment text) Let the user insert rows in transactions as desired. Let them stop the restore when a specific (pitr_ts,pitr_comment) gets inserted (or on pitr_id if they record it). It's a good plan, but the recovery is currently offline recovery and no SQL is possible. So no way to insert, no way to access tables until recovery completes. I like that plan and probably would have used it if it was viable. IMHO time is seldom relevant, event boundaries are. Agreed, but time is the universally agreed way of describing two events as being simultaneous. No other way to say recover to the point when the message queue went wild. As of last post to Andreas, I've said I'll not bother changing the granularity of the timestamp. If you want to add special syntax for this, fine. If not, an INSERT statement is a convenient way to do this anyway. The special syntax isn't hugely important - I did suggest a kind of SQL-like syntax previously, but thats gone now. Invoking recovery via a command file IS, so we are able to tell the system its not in crash recovery AND that when you've finished I want you to respond to crashes without re-entering archive recovery. Thanks for your comments. I'm not making this more complex than needs be; in fact much of the code is very simple - its just the planning that's complex. Best regards, Simon Riggs ---(end of broadcast)--- TIP 8: explain analyze is your friend
[HACKERS] Point in Time Recovery
Taking advantage of the freeze bubble allowed us... there are some last minute features to add. Summarising earlier thoughts, with some detailed digging and design from myself in last few days - we're now in a position to add Point-in-Time Recovery, on top of whats been achieved. The target for the last record to recover to can be specified in 2 ways: - by transactionId - not that useful, unless you have a means of identifying what has happened from the log, then using that info to specify how to recover - coming later - not in next few days :( - by time - but the time stamp on each xlog record only specifies to the second, which could easily be 10 or more commits (we hope) Should we use a different datatype than time_t for the commit timestamp, one that offers more fine grained differentiation between checkpoints? If we did, would that be portable? Suggestions welcome, because I know very little of the details of various *nix systems and win* on that topic. Only COMMIT and ABORT records have timestamps, allowing us to circumvent any discussion about partial transaction recovery and nested transactions. When we do recover, stopping at the timestamp is just half the battle. We need to leave the xlog in which we stop in a state from which we can enter production smoothly and cleanly. To do this, we could: - when we stop, keep reading records until EOF, just don't apply them. When we write a checkpoint at end of recovery, the unapplied transactions are buried alive, never to return. - stop where we stop, then force zeros to EOF, so that no possible record remains of previous transactions. I'm tempted by the first plan, because it is more straightforward and stands much less chance of me introducing 50 wierd bugs just before close. Also, I think it is straightforward to introduce control file duplexing, with a second copy stored and maintained in the pg_xlog directory. This would provide additional protection for pg_control, which takes on more importance now that archive recovery is working. pg_xlog is a natural home, since on busy systems it's on its own disk away from everything else, ensuring that at least one copy survives. I can't see a downside to that, but others might... We can introduce user specifiable duplexing, in later releases. For later, I envisage an off-line utility that can be used to inspect xlog records. This could provide a number of features: - validate archived xlogs, to check they are sound. - produce summary reports, to allow identification of transactionIds and the effects of particular transactions - performance analysis to allow decisions to be made about whether group commit features could be utilised to good effect (Not now...) Best regards, Simon Riggs ---(end of broadcast)--- TIP 6: Have you searched our list archives? http://archives.postgresql.org