subject:"\[HACKERS\] Point in Time Recovery"

Re: [HACKERS] Point in Time Recovery

2004-07-30 Thread Zeugswetter Andreas SB SD


  I was wondering about this point - might it not be just as reasonable 
  for the copied file to *be* an exact image of pg_control?  Then a very 
  simple variant of pg_controldata (or maybe even just adding switches to 
  pg_controldata itself) would enable the relevant info to be extracted
 
 We didn't do that so admins could easily read the file contents.

If you use a readable file you will also need a feature for restore (or a tool) 
to create an appropriate pg_control file, or are you intending to still require
that pg_control be the first file backed up. 
Another possibility would be that the start function writes the readable file and
also copies pg_control.

Andreas

---(end of broadcast)---
TIP 7: don't forget to increase your free space map settings

Re: [HACKERS] Point in Time Recovery

2004-07-30 Thread Bruce Momjian

Zeugswetter Andreas SB SD wrote:
 
   I was wondering about this point - might it not be just as reasonable 
   for the copied file to *be* an exact image of pg_control?  Then a very 
   simple variant of pg_controldata (or maybe even just adding switches to 
   pg_controldata itself) would enable the relevant info to be extracted
  
  We didn't do that so admins could easily read the file contents.
 
 If you use a readable file you will also need a feature for restore (or a tool) 
 to create an appropriate pg_control file, or are you intending to still require
 that pg_control be the first file backed up. 
 Another possibility would be that the start function writes the readable file and
 also copies pg_control.

We will back up pg_control in the tar file but it doesn't have to have
all correct information. The WAL replay will set it properly I think. 
In fact it has to start recovery checkpoint settings, not the backup
setting at all.

-- 
  Bruce Momjian|  http://candle.pha.pa.us
  [EMAIL PROTECTED]   |  (610) 359-1001
  +  If your life is a hard drive, |  13 Roberts Road
  +  Christ can be your backup.|  Newtown Square, Pennsylvania 19073

---(end of broadcast)---
TIP 4: Don't 'kill -9' the postmaster

Re: [HACKERS] Point in Time Recovery

2004-07-30 Thread Tom Lane

Zeugswetter Andreas SB SD [EMAIL PROTECTED] writes:
 If you use a readable file you will also need a feature for restore
 (or a tool) to create an appropriate pg_control file, or are you
 intending to still require that pg_control be the first file backed
 up.

No, the entire point of this exercise is to get rid of that assumption.
You do need *a* copy of pg_control, but the only reason you'd need to
back it up first rather than later is so that its checkpoint pointer
points to the last checkpoint before the dump starts.  Which is the
information we want to put in the archive-label file insted.

If a copy of pg_control were sufficient then I'd be all for using it as
the archive-label file, but it's *not* sufficient because you also need
the ending WAL offset.  So we need a different file layout in any case,
and we may as well take some pity on the poor DBA and make the file
easily human-readable.

regards, tom lane

---(end of broadcast)---
TIP 3: if posting/reading through Usenet, please send an appropriate
  subscribe-nomail command to [EMAIL PROTECTED] so that your
  message can get through to the mailing list cleanly

Re: [HACKERS] Point in Time Recovery

2004-07-30 Thread Mark Kirkwood

Ok - that is a much better way of doing it!
regards
Mark
Tom Lane wrote:
Zeugswetter Andreas SB SD [EMAIL PROTECTED] writes:
 

If you use a readable file you will also need a feature for restore
(or a tool) to create an appropriate pg_control file, or are you
intending to still require that pg_control be the first file backed
up.
   

No, the entire point of this exercise is to get rid of that assumption.
You do need *a* copy of pg_control, but the only reason you'd need to
back it up first rather than later is so that its checkpoint pointer
points to the last checkpoint before the dump starts.  Which is the
information we want to put in the archive-label file insted.
If a copy of pg_control were sufficient then I'd be all for using it as
the archive-label file, but it's *not* sufficient because you also need
the ending WAL offset.  So we need a different file layout in any case,
and we may as well take some pity on the poor DBA and make the file
easily human-readable.
regards, tom lane
---(end of broadcast)---
TIP 3: if posting/reading through Usenet, please send an appropriate
 subscribe-nomail command to [EMAIL PROTECTED] so that your
 message can get through to the mailing list cleanly
 

---(end of broadcast)---
TIP 8: explain analyze is your friend

Re: [HACKERS] Point in Time Recovery

2004-07-29 Thread Mark Kirkwood

I was wondering about this point - might it not be just as reasonable 
for the copied file to *be* an exact image of pg_control?  Then a very 
simple variant of pg_controldata (or maybe even just adding switches to 
pg_controldata itself) would enable the relevant info to be extracted

P.s : would love to be that volunteer - however up to the eyeballs in 
Business Objects (cringe) and Db2 for the next week or so

regards
Mark
Bruce Momjian wrote:
We need someone to code two backend functions to complete PITR. 

snippage
 

However, once you decide to do things like that, there is no reason why
the copied file has to be an exact image of pg_control.  I claim it
would be more useful if the copied file were plain text so that you
could just cat it to find out the starting WAL position; that would
let you determine without any special tools what range of WAL archive
files you are going to need to bring back from your archives.
   


---(end of broadcast)---
TIP 9: the planner will ignore your desire to choose an index scan if your
 joining column's datatypes do not match

Re: [HACKERS] Point in Time Recovery

2004-07-29 Thread Bruce Momjian

Mark Kirkwood wrote:
 I was wondering about this point - might it not be just as reasonable 
 for the copied file to *be* an exact image of pg_control?  Then a very 
 simple variant of pg_controldata (or maybe even just adding switches to 
 pg_controldata itself) would enable the relevant info to be extracted

We didn't do that so admins could easily read the file contents.

-- 
  Bruce Momjian|  http://candle.pha.pa.us
  [EMAIL PROTECTED]   |  (610) 359-1001
  +  If your life is a hard drive, |  13 Roberts Road
  +  Christ can be your backup.|  Newtown Square, Pennsylvania 19073

---(end of broadcast)---
TIP 8: explain analyze is your friend

Re: [HACKERS] Point in Time Recovery

2004-07-29 Thread markir

Quoting Bruce Momjian [EMAIL PROTECTED]:

 Mark Kirkwood wrote:
  I was wondering about this point - might it not be just as reasonable
  for the copied file to *be* an exact image of pg_control?  Then a very
  simple variant of pg_controldata (or maybe even just adding switches to
  pg_controldata itself) would enable the relevant info to be extracted

 We didn't do that so admins could easily read the file contents.

Ease of reading is a good thing, no argument there.

However using 'pg_controldata' (or similar) to perform the read is not really
that much harder than using 'cat' - (it is a wee bit harder, I grant you)

When I posted the original mail I was thinking that the pg_control image is good
because it has much more information than just the last wal offset, and could
be used to perform a recovery in the advent of the actual pg_control being
unsuitable (e.g. backed up last instead of first on a busy system).

Of couse this thinking didn't make it into the original mail, sorry about that!

regards

Mark



---(end of broadcast)---
TIP 6: Have you searched our list archives?

   http://archives.postgresql.org

Re: [HACKERS] Point in Time Recovery

2004-07-28 Thread Bruce Momjian


We need someone to code two backend functions to complete PITR.  The
function would be called at start/stop of backup of the data directory.
The functions would be checked during restore to make sure the requested
xid is not between the start/stop xids of the backup.  They would also
contain timestamps so the admin can easily review the archive directory.

The start needs to call checkpoint and create file in the data directory
that contains a few server parameters.  At backup stop the function
needs to move the file to pg_xlog and set the *.ready archive flag so it
is archived.

As for checking during recover, the file needs to be retrieved and
checked to see the xid recovery is valid.  Tom and I can help you with
that detail.

DOn't worry about all the details of the email below.  It is just a
general summary.  We can give you details once you volunteer.

---

Bruce Momjian wrote:
 
 OK, I think I have some solid ideas and reasons for them.
 
 First, I think we need server-side functions to call when we start/stop
 the backup.  The advantage of these server-side functions is that they
 will do the required work of recording the pg_control values and
 creating needed files with little chance for user error.  It also allows
 us to change the internal operations in later releases without requiring
 admins to change their procedures.  We are even able to adjust the
 internal operation in minor releases without forcing a new procedure on
 users.
 
 Second, I think once we start a restore, we should rename recovery.conf
 to recovery.in_progress, and when complete rename that to
 recovery.done.  If the postmaster starts and sees recovery.in_progress,
 it will fail to start knowing its recovery was interrupted.  This allows
 the admin to take appropriate action.  (I am not sure what that action
 would be. Does he bring back the backup files or just keep going?)
 
 Third, I think we need to put a file in the archive location once we
 complete a backup, recording the start/stop xid and wal/offsets.  This
 gives the admin documentation on what archive logs to keep and what xids
 are available for recovery.  Ideally the recover program would read that
 file and check the recover xid to make sure it is after the stop xid
 recorded in the file.
 
 How would the recover program know the name of that file?  We need to
 create it in /data with start contents before the backup, then complete
 it with end contents and archive it.
 
 What should we name it?  Ideally it would be named by the WAL
 name/offset of the start so it orders in the proper spot in the archive
 file listing, e.g.:
 
   093a
   093b
   093b.032b9.start
   093c
 
 Are people going to know they need 093b for
 093b.032b9.start?  I hope so.  Another idea is to do:
 
 
   093a.xlog
   093b.032b9.start
   093b.xlog
   093c.xlog
 
 This would order properly.  It might be a very good idea to add
 extensions to these log files now that we are archiving them in strange
 places.  In fact, maybe we should use *.pg_xlog to document the
 directory they came from.
 
 ---
 
 
 Simon Riggs wrote:
  On Fri, 2004-07-16 at 16:47, Tom Lane wrote:
   As far as the business about copying pg_control first goes: there is
   another way to think about it, which is to copy pg_control to another
   place that will be included in your backup.  For example the standard
   backup procedure could be
   
   1. [somewhat optional] Issue CHECKPOINT and wait till it finishes.
   
   2. cp $PGDATA/global/pg_control $PGDATA/pg_control.dump
   
   3. tar cf /dev/mt $PGDATA
   
   4. do something to record ending WAL position
   
   If we standardized on this way, then the tar archive would automatically
   contain the pre-backup checkpoint position in ./pg_control.dump, and
   there is no need for any special assumptions about the order in which
   tar processes things.
   
  
  Sounds good. That would be familiar to Oracle DBAs doing BACKUP
  CONTROLFILE. We can document that and off it as a suggested procedure.
  
   However, once you decide to do things like that, there is no reason why
   the copied file has to be an exact image of pg_control.  I claim it
   would be more useful if the copied file were plain text so that you
   could just cat it to find out the starting WAL position; that would
   let you determine without any special tools what range of WAL archive
   files you are going to need to bring back from your archives.
  
  I wouldn't be in favour of a manual mechanism. If you want an automated
  mechanism, whats wrong with using the one thats already there? You can
  use pg_controldata to read the controlfile, again whats wrong with that?
  
  We agreed some time back that an

Re: [HACKERS] Point in Time Recovery

2004-07-28 Thread Bruce Momjian


Oh, here is something else we need to add --- a GUC to control whether
pg_xlog is clean on recovery start.

---

Tom Lane wrote:
 Bruce and I had another phone chat about the problems that can ensue
 if you restore a tar backup that contains old (incompletely filled)
 versions of WAL segment files.  While the current code will ignore them
 during the recovery-from-archive run, leaving them laying around seems
 awfully dangerous.  One nasty possibility is that the archiving
 mechanism will pick up these files and overwrite good copies in the
 archive area with the obsolete ones from the backup :-(.
 
 Bruce earlier proposed that we simply rm pg_xlog/* at the start of
 a recovery-from-archive run, but as I said I'm scared to death of code
 that does such a thing automatically.  In particular this would make it
 impossible to handle scenarios where you want to do a PITR recovery but
 you need to use some recent WAL segments that didn't make it into your
 archive yet.  (Maybe you could get around this by forcibly transferring
 such segments into the archive, but that seems like a bad idea for
 incomplete segments.)
 
 It would really be best for the DBA to make sure that the starting
 condition for the recovery run does not have any obsolete segment files
 in pg_xlog.  He could do this either by setting up his backup policy so
 that pg_xlog isn't included in the tar backup in the first place, or by
 manually removing the included files just after restoring the backup,
 before he tries to start the recovery run.
 
 Of course the objection to that is what if the DBA forgets to do it?
 
 The idea that we came to on the phone was for the postmaster, when it
 enters recovery mode because a recovery.conf file exists, to look in
 pg_xlog for existing segment files and refuse to start if any are there
 --- *unless* the user has put a special, non-default overriding flag
 into recovery.conf.  Call it use_unarchived_files or something like
 that.  We'd have to provide good documentation and an extensive HINT of
 course, but basically the DBA would have two choices when he gets this
 refusal to start:
 
 1. Remove all the segment files in pg_xlog.  (This would be the right
 thing to do if he knows they all came off the backup.)
 
 2. Verify that pg_xlog contains only segment files that are newer than
 what's stored in the WAL archive, and then set the override flag in
 recovery.conf.  In this case the DBA is taking responsibility for
 leaving only segment files that are good to use.
 
 One interesting point is that with such a policy, we could use locally
 available WAL segments in preference to pulling the same segments from
 archive, which would be at least marginally more efficient, and seems
 logically cleaner anyway.
 
 In particular it seems that this would be a useful arrangement in cases
 where you have questionable WAL segments --- you're not sure if they're
 good or not.  Rather than having to push questionable data into your WAL
 archive, you can leave it local, try a recovery run, and see if you like
 the resulting state.  If not, it's a lot easier to do-over when you have
 not corrupted your archive area.
 
 Comments?  Better ideas?
 
   regards, tom lane
 

-- 
  Bruce Momjian|  http://candle.pha.pa.us
  [EMAIL PROTECTED]   |  (610) 359-1001
  +  If your life is a hard drive, |  13 Roberts Road
  +  Christ can be your backup.|  Newtown Square, Pennsylvania 19073

---(end of broadcast)---
TIP 7: don't forget to increase your free space map settings

Re: [ADMIN] [HACKERS] Point in Time Recovery

2004-07-28 Thread Bruce Momjian


[ Sorry, sent to hackers now.]

Here is another open PITR issue that I think will have to be addressed
in 7.6.  If you do a critical transaction, but do nothing else for eight
hours, that critical transaction hasn't been archived yet.  It is still
sitting in pg_xlog until the WAL file fills.

I think we will need to document this behavior and address it in some
way in 7.6.  We can't assume that we can send multiple copies of pg_xlog
to the archive (partial and full ones) because we might be going to a
tape drive.  However, this is a non-intuitive behavior of our archiver.
We might need to tell people to copy the most recent WAL file every
minute to some other location or something.

---

Tom Lane wrote:
 Bruce Momjian [EMAIL PROTECTED] writes:
  I think we should push the partially complete WAL file to the archive
  location before shutdown. ...
  When you are running and finally fill up the WAL file it would then
  overwrite the one in the archive but I think that is OK.
 
 I don't think this can fly at all.  Here are some off-the-top-of-the-head
 objections:
 
 1. We don't have the luxury of spending indefinite amounts of time to
 do a database shutdown.  Commonly we are under a twenty-second sentence
 of death from init.  I don't want to spend the 20 seconds waiting to see
 if the archiver will manage to push 16MB onto a slow tape drive.  Also,
 if the archiver does fail to push the data in time, it'll likely leave a
 broken (partial) xlog file in the archive, which would be really bad
 news if the user then relies on that.
 
 2. What if the archiver process entirely fails to push the file?  (Maybe
 there's not enough disk space, for instance.)  In normal operation we'll
 just retry every so often.  We definitely can't do that during shutdown.
 
 3. You're blithely assuming that the archival process can easily provide
 overwrite semantics for multiple pushes of the same xlog filename.  Stop
 thinking about cp to some directory and start thinking dump to tape
 or burn onto CD or something like that.  We'll be raising the ante
 considerably if we require the archive_command to deal with this.
 
 I think the last one is really the most significant issue.  We have to
 keep the archiver API as simple as possible.
 
   regards, tom lane
 

-- 
  Bruce Momjian|  http://candle.pha.pa.us
  [EMAIL PROTECTED]   |  (610) 359-1001
  +  If your life is a hard drive, |  13 Roberts Road
  +  Christ can be your backup.|  Newtown Square, Pennsylvania 19073

---(end of broadcast)---
TIP 7: don't forget to increase your free space map settings

Re: [PATCHES] [HACKERS] Point in Time Recovery

2004-07-27 Thread markw

On 26 Jul, To: [EMAIL PROTECTED] wrote:
 Sorry I wasn't clearer.  I think I have a better idea about what's going
 on now.  With the archiving enabled, it looks like the database is able
 to complete 1 transaction per database connection, but doesn't complete
 any subsequent transactions. I'm not sure how to see what's going on.
 Perhaps I should try a newer snapshot from CVS while I'm at it?

I tried to do an strace on the postmaster (and child processes) to see
if that might show something, but when the postmaster starts the
database isn't accepting any connections.  I have the feeling it's not
really starting up.  Trying to shut it down seems to agrees with that.
My wild guess is that the database is sitting waiting for something when
a stored function is called but I'm not sure how to verify that.

Mark

---(end of broadcast)---
TIP 8: explain analyze is your friend

Re: [HACKERS] Point in Time Recovery

2004-07-20 Thread Zeugswetter Andreas SB SD


  Hang on, are you supposed to MOVE or COPY away WAL segments?
 
 Copy.  pg will delete them once they are archived.

Copy. pg will recycle them once they are archived.

Andreas

---(end of broadcast)---
TIP 9: the planner will ignore your desire to choose an index scan if your
  joining column's datatypes do not match

Re: [HACKERS] Point in Time Recovery

2004-07-20 Thread Bruce Momjian

Simon Riggs wrote:
 On Sat, 2004-07-17 at 00:57, Bruce Momjian wrote:
  OK, I think I have some solid ideas and reasons for them.
  
 
 Sorry for taking so long to reply...
 
  First, I think we need server-side functions to call when we start/stop
  the backup.  The advantage of these server-side functions is that they
  will do the required work of recording the pg_control values and
  creating needed files with little chance for user error.  It also allows
  us to change the internal operations in later releases without requiring
  admins to change their procedures.  We are even able to adjust the
  internal operation in minor releases without forcing a new procedure on
  users.
 
 Yes, I think we should go down this route. there's a but and that
 is we don't absolutely need it for correctnessand so I must decline
 adding it to THIS release. I don't imagine I'll stop be associated with
 this code for a while yet
 
 Can we recommend that users should expect to have to call a start and
 end backup routine in later releases? Don't expect you'll agree to
 that..

I guess my big question is that if we don't do this for 7.5 how will
people doing restores know if the xid they specify is valid for the
backup they used.  If we recover to most recent time, is there any check
that will tell them their backup is invalid because there are no archive
records that span the time of their backup?

-- 
  Bruce Momjian|  http://candle.pha.pa.us
  [EMAIL PROTECTED]   |  (610) 359-1001
  +  If your life is a hard drive, |  13 Roberts Road
  +  Christ can be your backup.|  Newtown Square, Pennsylvania 19073

---(end of broadcast)---
TIP 3: if posting/reading through Usenet, please send an appropriate
  subscribe-nomail command to [EMAIL PROTECTED] so that your
  message can get through to the mailing list cleanly

Re: [HACKERS] Point in Time Recovery

2004-07-19 Thread Simon Riggs

On Sat, 2004-07-17 at 00:57, Bruce Momjian wrote:
 OK, I think I have some solid ideas and reasons for them.
 

Sorry for taking so long to reply...

 First, I think we need server-side functions to call when we start/stop
 the backup.  The advantage of these server-side functions is that they
 will do the required work of recording the pg_control values and
 creating needed files with little chance for user error.  It also allows
 us to change the internal operations in later releases without requiring
 admins to change their procedures.  We are even able to adjust the
 internal operation in minor releases without forcing a new procedure on
 users.

Yes, I think we should go down this route. there's a but and that
is we don't absolutely need it for correctnessand so I must decline
adding it to THIS release. I don't imagine I'll stop be associated with
this code for a while yet

Can we recommend that users should expect to have to call a start and
end backup routine in later releases? Don't expect you'll agree to
that..

 
 Second, I think once we start a restore, we should rename recovery.conf
 to recovery.in_progress, and when complete rename that to
 recovery.done.  If the postmaster starts and sees recovery.in_progress,
 it will fail to start knowing its recovery was interrupted.  This allows
 the admin to take appropriate action.  (I am not sure what that action
 would be. Does he bring back the backup files or just keep going?)
 

Superceded by Tom's actions. Two states are required: start and stop.
Recovery isn't going to be checkpoint-restartable anytime soon, IMHO.

Best regards, Simon Riggs


---(end of broadcast)---
TIP 3: if posting/reading through Usenet, please send an appropriate
  subscribe-nomail command to [EMAIL PROTECTED] so that your
  message can get through to the mailing list cleanly

Re: [HACKERS] Point in Time Recovery

2004-07-19 Thread Tom Lane

Bruce and I had another phone chat about the problems that can ensue
if you restore a tar backup that contains old (incompletely filled)
versions of WAL segment files.  While the current code will ignore them
during the recovery-from-archive run, leaving them laying around seems
awfully dangerous.  One nasty possibility is that the archiving
mechanism will pick up these files and overwrite good copies in the
archive area with the obsolete ones from the backup :-(.

Bruce earlier proposed that we simply rm pg_xlog/* at the start of
a recovery-from-archive run, but as I said I'm scared to death of code
that does such a thing automatically.  In particular this would make it
impossible to handle scenarios where you want to do a PITR recovery but
you need to use some recent WAL segments that didn't make it into your
archive yet.  (Maybe you could get around this by forcibly transferring
such segments into the archive, but that seems like a bad idea for
incomplete segments.)

It would really be best for the DBA to make sure that the starting
condition for the recovery run does not have any obsolete segment files
in pg_xlog.  He could do this either by setting up his backup policy so
that pg_xlog isn't included in the tar backup in the first place, or by
manually removing the included files just after restoring the backup,
before he tries to start the recovery run.

Of course the objection to that is what if the DBA forgets to do it?

The idea that we came to on the phone was for the postmaster, when it
enters recovery mode because a recovery.conf file exists, to look in
pg_xlog for existing segment files and refuse to start if any are there
--- *unless* the user has put a special, non-default overriding flag
into recovery.conf.  Call it use_unarchived_files or something like
that.  We'd have to provide good documentation and an extensive HINT of
course, but basically the DBA would have two choices when he gets this
refusal to start:

1. Remove all the segment files in pg_xlog.  (This would be the right
thing to do if he knows they all came off the backup.)

2. Verify that pg_xlog contains only segment files that are newer than
what's stored in the WAL archive, and then set the override flag in
recovery.conf.  In this case the DBA is taking responsibility for
leaving only segment files that are good to use.

One interesting point is that with such a policy, we could use locally
available WAL segments in preference to pulling the same segments from
archive, which would be at least marginally more efficient, and seems
logically cleaner anyway.

In particular it seems that this would be a useful arrangement in cases
where you have questionable WAL segments --- you're not sure if they're
good or not.  Rather than having to push questionable data into your WAL
archive, you can leave it local, try a recovery run, and see if you like
the resulting state.  If not, it's a lot easier to do-over when you have
not corrupted your archive area.

Comments?  Better ideas?

regards, tom lane

---(end of broadcast)---
TIP 4: Don't 'kill -9' the postmaster

Re: [HACKERS] Point in Time Recovery

2004-07-19 Thread Christopher Kings-Lynne

I've got a PITR set up here that's happily scp'ing WAL files across to 
another machine.  However, the NIC in the machine is currently stuffed, 
so it gets like 50k/s :)  What happens in general if you are generating 
WAL file bytes faster always than they can be copied off?

Also, does the archive dir just basically keep filling up forever?  How 
do I know when I can prune some files?  Anything older than the last 
full backup?

Chris
---(end of broadcast)---
TIP 6: Have you searched our list archives?
  http://archives.postgresql.org

Re: [HACKERS] Point in Time Recovery

2004-07-19 Thread Tom Lane

Christopher Kings-Lynne [EMAIL PROTECTED] writes:
 I've got a PITR set up here that's happily scp'ing WAL files across to 
 another machine.  However, the NIC in the machine is currently stuffed, 
 so it gets like 50k/s :)  What happens in general if you are generating 
 WAL file bytes faster always than they can be copied off?

If you keep falling further and further behind, eventually your pg_xlog
directory will fill the space available on its disk, and I think at that
point PG will panic and shut down because it can't create any more xlog
segments.

 Also, does the archive dir just basically keep filling up forever?  How 
 do I know when I can prune some files?  Anything older than the last 
 full backup?

Anything older than the starting checkpoint of the last full backup that
you might want to restore to.  We need to adjust the backup procedure so
that the starting segment number for a backup is more readily visible;
see recent discussions about logging that explicitly in some fashion.

regards, tom lane

---(end of broadcast)---
TIP 4: Don't 'kill -9' the postmaster

Re: [HACKERS] Point in Time Recovery

2004-07-19 Thread Christopher Kings-Lynne

If you keep falling further and further behind, eventually your pg_xlog
directory will fill the space available on its disk, and I think at that
point PG will panic and shut down because it can't create any more xlog
segments.
Hang on, are you supposed to MOVE or COPY away WAL segments?
Chris
---(end of broadcast)---
TIP 1: subscribe and unsubscribe commands go to [EMAIL PROTECTED]

Re: [HACKERS] Point in Time Recovery

2004-07-19 Thread Tom Lane

Christopher Kings-Lynne [EMAIL PROTECTED] writes:
 If you keep falling further and further behind, eventually your pg_xlog
 directory will fill the space available on its disk, and I think at that
 point PG will panic and shut down because it can't create any more xlog
 segments.

 Hang on, are you supposed to MOVE or COPY away WAL segments?

COPY.  The checkpoint code will then delete or recycle the segment file,
as appropriate.

regards, tom lane

---(end of broadcast)---
TIP 5: Have you checked our extensive FAQ?

   http://www.postgresql.org/docs/faqs/FAQ.html

Re: [HACKERS] Point in Time Recovery

2004-07-19 Thread Bruce Momjian

Christopher Kings-Lynne wrote:
  If you keep falling further and further behind, eventually your pg_xlog
  directory will fill the space available on its disk, and I think at that
  point PG will panic and shut down because it can't create any more xlog
  segments.
 
 Hang on, are you supposed to MOVE or COPY away WAL segments?

Copy.  pg will delete them once they are archived.

-- 
  Bruce Momjian|  http://candle.pha.pa.us
  [EMAIL PROTECTED]   |  (610) 359-1001
  +  If your life is a hard drive, |  13 Roberts Road
  +  Christ can be your backup.|  Newtown Square, Pennsylvania 19073

---(end of broadcast)---
TIP 5: Have you checked our extensive FAQ?

   http://www.postgresql.org/docs/faqs/FAQ.html

Re: [HACKERS] Point in Time Recovery

2004-07-19 Thread Christopher Kings-Lynne

Hang on, are you supposed to MOVE or COPY away WAL segments?
COPY.  The checkpoint code will then delete or recycle the segment file,
as appropriate.
So what happens if you just move it?  Postgres breaks?
Chris
---(end of broadcast)---
TIP 8: explain analyze is your friend

Re: [HACKERS] Point in Time Recovery

2004-07-19 Thread Tom Lane

Christopher Kings-Lynne [EMAIL PROTECTED] writes:
 Hang on, are you supposed to MOVE or COPY away WAL segments?
 
 COPY.  The checkpoint code will then delete or recycle the segment file,
 as appropriate.

 So what happens if you just move it?  Postgres breaks?

I don't think so, but it seems like a much less robust way to do things.
What happens if you have a failure partway through?  For instance
archive machine dies and loses recent data right after you've rm'd the
source file.  The recommended COPY procedure at least provides some
breathing room between when you install the data on the archive and when
the original file is removed.

It's not like you save any effort by using a MOVE anyway.  You're not
going to have the archive on the same machine as the database (or if you
are, you ain't gonna be *my* DBA ...)

regards, tom lane

---(end of broadcast)---
TIP 2: you can get off all lists at once with the unregister command
(send unregister YourEmailAddressHere to [EMAIL PROTECTED])

Re: [HACKERS] Point in Time Recovery

2004-07-19 Thread Christopher Kings-Lynne

I don't think so, but it seems like a much less robust way to do things.
What happens if you have a failure partway through?  For instance
archive machine dies and loses recent data right after you've rm'd the
source file.  The recommended COPY procedure at least provides some
breathing room between when you install the data on the archive and when
the original file is removed.
Well, I tried it in 'cross your fingers' mode and it works, at least:
archive_command = 'rm %p'
:)
Chris
---(end of broadcast)---
TIP 8: explain analyze is your friend

Re: [HACKERS] Point in Time Recovery

2004-07-16 Thread Simon Riggs

On Fri, 2004-07-16 at 04:49, Tom Lane wrote:
 Simon Riggs [EMAIL PROTECTED] writes:
  On Fri, 2004-07-16 at 00:01, Alvaro Herrera wrote:
  My manpage for signal(2) says that you shouldn't assign SIG_IGN to
  SIGCHLD, according to POSIX.
 
  So - I should be setting this to SIG_DFL and thats good for everyone?
 
 Yeah, we learned the same lesson in the backend not too many releases
 back.  SIG_IGN'ing SIGCHLD is bad voodoo; it'll work on some platforms
 but not others.

Many thanks all, Best Regards Simon Riggs





---(end of broadcast)---
TIP 8: explain analyze is your friend

Re: [HACKERS] Point in Time Recovery

2004-07-16 Thread Zeugswetter Andreas SB SD


  I'm aiming for the minimum feature set - which means we do need to take
  care over whether that set is insufficient and also to pull any part
  that doesn't stand up to close scrutiny over the next few days.
 
 As you can see, we are still chewing on NT.  What PITR features are
 missing?  I assume because we can't stop file system writes during
 backup that we will need a backup parameter file like I described.  Is
 there anything else that PITR needs?

No, we don't need to stop writes ! Not even to split a mirror,
other db's need that to be able to restore, but we dont.
We only need to tell people to backup pg_control first. The rest was only 
intended to enforce 
1. that pg_control is the first file backed up
2. the dba uses a large enough PIT (or xid) for restore

I think the idea with an extra file with WAL start position was overly
complicated, since all you need is pg_control (+ WAL end position to enforce 2.).

If we don't want to tell people to backup pg_control first, imho the next 
best plan would be to add a WAL start input (e.g. xlog name) parameter 
to recovery.conf, that fixes pg_control.

Andreas

---(end of broadcast)---
TIP 2: you can get off all lists at once with the unregister command
(send unregister YourEmailAddressHere to [EMAIL PROTECTED])

Re: [HACKERS] Point in Time Recovery

2004-07-16 Thread Tom Lane

Zeugswetter Andreas SB SD [EMAIL PROTECTED] writes:
 We only need to tell people to backup pg_control first. The rest was only 
 intended to enforce 
 1. that pg_control is the first file backed up
 2. the dba uses a large enough PIT (or xid) for restore

Right, but I think Bruce's point is that it is far too easy to get those
things wrong; especially point 2 for which a straight tar dump will
simply not contain the information you need to determine what is a safe
stopping point.

I agree with Bruce that we should have some mechanism that doesn't rely
on the DBA to get this right.  Exactly what the mechanism should be is
certainly open for discussion...

regards, tom lane

---(end of broadcast)---
TIP 4: Don't 'kill -9' the postmaster

Re: [HACKERS] Point in Time Recovery

2004-07-16 Thread Bruce Momjian

 Zeugswetter Andreas SB SD [EMAIL PROTECTED] writes:
  We only need to tell people to backup pg_control first. The rest was only 
  intended to enforce 
  1. that pg_control is the first file backed up
  2. the dba uses a large enough PIT (or xid) for restore
 
 Right, but I think Bruce's point is that it is far too easy to get those
 things wrong; especially point 2 for which a straight tar dump will
 simply not contain the information you need to determine what is a safe
 stopping point.
 
 I agree with Bruce that we should have some mechanism that doesn't rely
 on the DBA to get this right.  Exactly what the mechanism should be is
 certainly open for discussion...

Right.  I am wondering what process people would use to backup
pg_control first?  If they do:

tar -f $TAPE ./global/pg_control .

They will get two copies or pg_control, the early one, and one as part
of the directory scan.  On restore, they would restore the early one,
but the directory scan would overwrite it.  I suppose they could do:

cp global/pg_control global/pg_control.backup
tar -f $TAPE .

then on restore once all the files are restored move the
pg_control.backup to its original name.  That gives us the checkpoint
wal/offset but how do we get the start/stop information.  Is that not
required?  Maybe we should just have a start/stop server-side functions
that create a file in the archive directory describing the start/stop
counters and time and the admin would then have to find those values.
Why are the start/stop wal/offset values needed anyway?  I know why we
need the checkpoint value.  Do we need a checkpoint after the archiving
starts but before the backup begins?

Also, when you are in recovery mode, how do you get out of recovery
mode, meaning if you have a power failure, how do you prevent the system
from doing another recovery?  Do you remove the recovery.conf file?

-- 
  Bruce Momjian|  http://candle.pha.pa.us
  [EMAIL PROTECTED]   |  (610) 359-1001
  +  If your life is a hard drive, |  13 Roberts Road
  +  Christ can be your backup.|  Newtown Square, Pennsylvania 19073

---(end of broadcast)---
TIP 3: if posting/reading through Usenet, please send an appropriate
  subscribe-nomail command to [EMAIL PROTECTED] so that your
  message can get through to the mailing list cleanly

Re: [HACKERS] Point in Time Recovery

2004-07-16 Thread Tom Lane

Bruce Momjian [EMAIL PROTECTED] writes:
 Also, when you are in recovery mode, how do you get out of recovery
 mode, meaning if you have a power failure, how do you prevent the system
 from doing another recovery?  Do you remove the recovery.conf file?

I do not care for the idea of a recovery.conf file at all, and have been
intending to look to see what we'd need to do to not have one.  I find
it hard to believe that there is anything one would put in it that is
really persistent state.  The above concern shows why it shouldn't be
treated as a persistent configuration file.

regards, tom lane

---(end of broadcast)---
TIP 1: subscribe and unsubscribe commands go to [EMAIL PROTECTED]

Re: [HACKERS] Point in Time Recovery

2004-07-16 Thread Zeugswetter Andreas SB SD


 then on restore once all the files are restored move the
 pg_control.backup to its original name.  That gives us the checkpoint
 wal/offset but how do we get the start/stop information.  Is that not
 required?

The checkpoint wal/offset is in pg_control, that is sufficient start 
information. The stop info is only necessary as a safeguard.

 Do we need a checkpoint after the archiving
 starts but before the backup begins?

No.

 Also, when you are in recovery mode, how do you get out of recovery
 mode, meaning if you have a power failure, how do you prevent the system
 from doing another recovery?  Do you remove the recovery.conf file?

pg_control could be updated during rollforward (only if that actually 
does a checkpoint). So if pg_control is also the recovery start info, then 
we can continue from there if we have a power failure.
For the first release it would imho also be ok to simply start over if
you loose power.

I think the filename 'recovery.conf' is misleading, since it is not a 
static configuration file, but a command file for one recovery.
How about 'recovery.command' then 'recovery.inprogress', and on recovery 
completion it should be renamed to 'recovery.done'

Andreas

---(end of broadcast)---
TIP 2: you can get off all lists at once with the unregister command
(send unregister YourEmailAddressHere to [EMAIL PROTECTED])

Re: [HACKERS] Point in Time Recovery

2004-07-16 Thread Tom Lane

Zeugswetter Andreas SB SD [EMAIL PROTECTED] writes:
 Do we need a checkpoint after the archiving
 starts but before the backup begins?

 No.

Actually yes.  You have to start at a checkpoint record when replaying
the log, so if no checkpoint occurred between starting to archive WAL
and starting the tar backup, you have a useless backup.

It would be reasonable to issue a CHECKPOINT just before starting the
backup as part of the standard operating procedure for taking PITR
dumps.  We need not require this, but it would help to avoid this
particular sort of mistake; and of course it might save a little bit of
replay effort if the backup is ever used.

As far as the business about copying pg_control first goes: there is
another way to think about it, which is to copy pg_control to another
place that will be included in your backup.  For example the standard
backup procedure could be

1. [somewhat optional] Issue CHECKPOINT and wait till it finishes.

2. cp $PGDATA/global/pg_control $PGDATA/pg_control.dump

3. tar cf /dev/mt $PGDATA

4. do something to record ending WAL position

If we standardized on this way, then the tar archive would automatically
contain the pre-backup checkpoint position in ./pg_control.dump, and
there is no need for any special assumptions about the order in which
tar processes things.

However, once you decide to do things like that, there is no reason why
the copied file has to be an exact image of pg_control.  I claim it
would be more useful if the copied file were plain text so that you
could just cat it to find out the starting WAL position; that would
let you determine without any special tools what range of WAL archive
files you are going to need to bring back from your archives.

This is pretty much the same chain of reasoning that Bruce and I went
through yesterday to come up with the idea of putting a label file
inside the tar backups.  We concluded that it'd be worth putting
both the backup starting time and the checkpoint WAL position into
the label file --- the starting time isn't needed for restore but
might be really helpful as documentation, if you needed to verify
which dump file was which.

regards, tom lane

---(end of broadcast)---
TIP 8: explain analyze is your friend

Re: [HACKERS] Point in Time Recovery

2004-07-16 Thread Zeugswetter Andreas SB SD


  Do we need a checkpoint after the archiving
  starts but before the backup begins?
 
  No.
 
 Actually yes.

Sorry, I did incorrectly not connect 'archiving' with the backed up xlogs :-(
So yes, you need one checkpoint after archiving starts. Imho turning on xlog
archiving should issue such a checkpoint just to be sure. 

Andreas

---(end of broadcast)---
TIP 7: don't forget to increase your free space map settings

Re: [HACKERS] Point in Time Recovery

2004-07-16 Thread Simon Riggs

On Fri, 2004-07-16 at 16:58, Zeugswetter Andreas SB SD wrote:
   Do we need a checkpoint after the archiving
   starts but before the backup begins?
  
   No.
  
  Actually yes.
 
 Sorry, I did incorrectly not connect 'archiving' with the backed up xlogs :-(
 So yes, you need one checkpoint after archiving starts. Imho turning on xlog
 archiving should issue such a checkpoint just to be sure. 
 

By agreement, archive_mode can only be turned on at postmaster startup,
which means you always have a checkpoint - either because you shut it
down cleanly, or you didn't and it recovers, then writes one.

There is always something to start the rollforward. 

So, non-issue.

Best regards, Simon Riggs


---(end of broadcast)---
TIP 8: explain analyze is your friend

Re: [HACKERS] Point in Time Recovery

2004-07-16 Thread Simon Riggs

On Fri, 2004-07-16 at 15:27, Bruce Momjian wrote:

 Also, when you are in recovery mode, how do you get out of recovery
 mode, meaning if you have a power failure, how do you prevent the system
 from doing another recovery?  Do you remove the recovery.conf file?

That was the whole point of the recovery.conf file:
it prevents you from re-entering recovery accidentally, as would occur
if the parameters were set in the normal .conf file.

Best Regards, Simon Riggs


---(end of broadcast)---
TIP 2: you can get off all lists at once with the unregister command
(send unregister YourEmailAddressHere to [EMAIL PROTECTED])

Re: [HACKERS] Point in Time Recovery

2004-07-16 Thread Simon Riggs

On Fri, 2004-07-16 at 16:25, Zeugswetter Andreas SB SD wrote:

 I think the filename 'recovery.conf' is misleading, since it is not a 
 static configuration file, but a command file for one recovery.
 How about 'recovery.command' then 'recovery.inprogress', and on recovery 
 completion it should be renamed to 'recovery.done'

You understand this and your assessment is correct.

recovery.conf isn't an attempt to persist information. It is a means of
delivering a set of parameters to the recovery process, as well as
signalling overall that archive recovery is required (because the system
default remains the same, which is to recover from the logs it has
locally available to it).

I originally offered a design which used a command, similar to
DB2/Oracle...that was overruled as too complex. The (whatever you call
it) file is just a very simple way of specifying whats required.

There is more to be said here...clearly some explanations are required
and I will provide those later...

Best regards, Simon Riggs


---(end of broadcast)---
TIP 7: don't forget to increase your free space map settings

Re: [HACKERS] Point in Time Recovery

2004-07-16 Thread Bruce Momjian

Simon Riggs wrote:
 On Fri, 2004-07-16 at 16:58, Zeugswetter Andreas SB SD wrote:
Do we need a checkpoint after the archiving
starts but before the backup begins?
   
No.
   
   Actually yes.
  
  Sorry, I did incorrectly not connect 'archiving' with the backed up xlogs :-(
  So yes, you need one checkpoint after archiving starts. Imho turning on xlog
  archiving should issue such a checkpoint just to be sure. 
  
 
 By agreement, archive_mode can only be turned on at postmaster startup,
 which means you always have a checkpoint - either because you shut it
 down cleanly, or you didn't and it recovers, then writes one.
 
 There is always something to start the rollforward. 
 
 So, non-issue.

I don't think so.  I can imagine many cases where you want to do a
nightly tar backup without turning archiving on/off or restarting the
postmaster.  In those cases, a manual checkpoint would have to be issued
before the backup begins.

Imagine a system that is up for a month, and they don't have enough
archive space to keep a months worth of WAL files.  They would probably
do nightly or weekend tar backups, and then discard the WAL archives.

What procedure would they use?  I assume they would copy all their old
WAL files to a save directory, issue a checkpoint, do a tar backup, then
they can delete the saved WAL files.  Is that correct?

-- 
  Bruce Momjian|  http://candle.pha.pa.us
  [EMAIL PROTECTED]   |  (610) 359-1001
  +  If your life is a hard drive, |  13 Roberts Road
  +  Christ can be your backup.|  Newtown Square, Pennsylvania 19073

---(end of broadcast)---
TIP 5: Have you checked our extensive FAQ?

   http://www.postgresql.org/docs/faqs/FAQ.html

Re: [HACKERS] Point in Time Recovery

2004-07-16 Thread Simon Riggs

On Fri, 2004-07-16 at 19:30, Bruce Momjian wrote:
 Simon Riggs wrote:
  On Fri, 2004-07-16 at 16:58, Zeugswetter Andreas SB SD wrote:
 Do we need a checkpoint after the archiving
 starts but before the backup begins?

 No.

Actually yes.
   
   Sorry, I did incorrectly not connect 'archiving' with the backed up xlogs :-(
   So yes, you need one checkpoint after archiving starts. Imho turning on xlog
   archiving should issue such a checkpoint just to be sure. 
   
  
  By agreement, archive_mode can only be turned on at postmaster startup,
  which means you always have a checkpoint - either because you shut it
  down cleanly, or you didn't and it recovers, then writes one.
  
  There is always something to start the rollforward. 
  
  So, non-issue.
 

I was discussing the claim that there might not be a checkpoint to begin
the rollforward from. There always is: if you are in archive_mode=true
then you will always have a checkpoint that can be used for recovery. It
may be a long way in the past, if there has been no write activity,
but the rollforward will very very quick, since there will be no log
records.

 I don't think so.  I can imagine many cases where you want to do a
 nightly tar backup without turning archiving on/off or restarting the
 postmaster.  

This is a misunderstanding. I strongly agree with what you say: the
whole system has been designed to avoid any benefit from turning on/off
archiving and there is no requirement to restart postmaster to take
backups.

 In those cases, a manual checkpoint would have to be issued
 before the backup begins.

A manual checkpoint doesn't HAVE TO be issued. Presumably most systems
will be running checkpoint every few minutes. Wherever the last one was
is where the rollforward would start from.

But you can if thats the way you want to do things, just wait long
enough for the checkpoint to have completed, otherwise your objective of
reducing rollforward time will not be met.

(please note my earlier reported rollback performance of approximately
x10 rate of recovery v elapsed time - will require testing on your own
systems).

 Imagine a system that is up for a month, and they don't have enough
 archive space to keep a months worth of WAL files.  They would probably
 do nightly or weekend tar backups, and then discard the WAL archives.
 

Yes, that would be normal practice. I would recommend keeping at least
the last 3 full backups and all of the WAL logs to cover that period.

 What procedure would they use?  I assume they would copy all their old
 WAL files to a save directory, issue a checkpoint, do a tar backup, then
 they can delete the saved WAL files.  Is that correct?

PITR is designed to interface with a wide range of systems, through the
extensible archive/recovery program interface. We shouldn't focus on
just tar backups - if you do, then the whole thing seems less
feature-rich. The current design allows interfacing with tape, remote
backup, internet backup providers, automated standby servers and the
dozen major storage/archive vendors' solutions.

Writing a procedure to backup, assign filenames, keep track of stuff
isn't too difficult if you're a competent DBA with a mild knowledge of
shell or perl scripting. But if data is important, people will want to
invest the time and trouble to adopt one of the open source or licenced
vendors that provide solutions in this area.

Systems management is a discipline and procedures should be in place for
everything. I fully agree with the automate everything dictum, but
just don't want to constrain people too much to a particular way of
doing things.

-o-o-

Overall, for first release, I think the complexity of this design is
acceptable. PITR is similar to Oracle7 Backup/Recovery, and easily
recognisable to any DBA with current experience of current SQL Server,
DB2 (MVS, UDB) or Teradata systems. [I can't comment much on Ingres,
Informix, Sybase etc]

My main areas of concern are:
- the formal correctness of the recovery process
As a result of this concern, PITR makes ZERO alterations to the recovery
code itself. The trick is to feed it the right xlog files and to stop,
if required, at the right place and allow normal work to resume.

- the robustness and quality of my implementation
This requires quality checking of the code and full beta testing

-o-o-

We've raised a couple of valid points on the lists in the last few days:
- its probably a desirable feature (but not essential) to implement a
write suspend feature on the bgwriter, if nothing else it will be a
confidence building feature...as said previously, for many people, this
will not be required, but people will no doubt keep asking
- there is a small window of risk around the possibility that a recovery
target might be set by the user that doesn't rollforward all the way
past the end of the backup. That is real, but in general, people aren't
likely to be performing archive recovery within minutes of a backup
being taken -

Re: [HACKERS] Point in Time Recovery

2004-07-16 Thread Simon Riggs

On Fri, 2004-07-16 at 16:47, Tom Lane wrote:
 As far as the business about copying pg_control first goes: there is
 another way to think about it, which is to copy pg_control to another
 place that will be included in your backup.  For example the standard
 backup procedure could be
 
 1. [somewhat optional] Issue CHECKPOINT and wait till it finishes.
 
 2. cp $PGDATA/global/pg_control $PGDATA/pg_control.dump
 
 3. tar cf /dev/mt $PGDATA
 
 4. do something to record ending WAL position
 
 If we standardized on this way, then the tar archive would automatically
 contain the pre-backup checkpoint position in ./pg_control.dump, and
 there is no need for any special assumptions about the order in which
 tar processes things.
 

Sounds good. That would be familiar to Oracle DBAs doing BACKUP
CONTROLFILE. We can document that and off it as a suggested procedure.

 However, once you decide to do things like that, there is no reason why
 the copied file has to be an exact image of pg_control.  I claim it
 would be more useful if the copied file were plain text so that you
 could just cat it to find out the starting WAL position; that would
 let you determine without any special tools what range of WAL archive
 files you are going to need to bring back from your archives.

I wouldn't be in favour of a manual mechanism. If you want an automated
mechanism, whats wrong with using the one thats already there? You can
use pg_controldata to read the controlfile, again whats wrong with that?

We agreed some time back that an off-line xlog file inspector would be
required to allow us to inspect the logs and make a decision about where
to end recovery. You'd still need that.

It's scary enough having to specify the end point, let alone having to
specify the starting point as well.

At your request, and with Bruce's idea, I designed and built the
recovery system so that you don't need to know what range of xlogs to
bring back. You just run it, it brings back the right files from archive
and does recovery with them, then cleans up - and it works without
running out of disk space on long recoveries.

I've built it now and it works...

 This is pretty much the same chain of reasoning that Bruce and I went
 through yesterday to come up with the idea of putting a label file
 inside the tar backups.  We concluded that it'd be worth putting
 both the backup starting time and the checkpoint WAL position into
 the label file --- the starting time isn't needed for restore but
 might be really helpful as documentation, if you needed to verify
 which dump file was which.

...if you are doing tar backups...what will you do if you're not using
that mechanism?

If you are: It's common practice to make up a backup filename from
elements such as systemname, databasename, date and time etc. That gives
you the start time, the file last mod date gives you the end time. 

I think its perfectly fine for everybody to do backups any way they
please. There are many licenced variants of PostgreSQL and it might be
appropriate in those to specify particular ways of doing things.

I'll be trusting the management of backup metadata and storage media to
a solution designed for the purpose (open or closed source), just as
I'll be trusting my data to a database solution designed for that
purpose. That for me is one of the good things about PostgreSQL - we use
the filesystem, we don't write our own, we provide language interfaces
not invent our own proprietary server language etc..

Best Regards, Simon Riggs


---(end of broadcast)---
TIP 5: Have you checked our extensive FAQ?

   http://www.postgresql.org/docs/faqs/FAQ.html

Re: [HACKERS] Point in Time Recovery

2004-07-16 Thread Bruce Momjian


OK, I think I have some solid ideas and reasons for them.

First, I think we need server-side functions to call when we start/stop
the backup.  The advantage of these server-side functions is that they
will do the required work of recording the pg_control values and
creating needed files with little chance for user error.  It also allows
us to change the internal operations in later releases without requiring
admins to change their procedures.  We are even able to adjust the
internal operation in minor releases without forcing a new procedure on
users.

Second, I think once we start a restore, we should rename recovery.conf
to recovery.in_progress, and when complete rename that to
recovery.done.  If the postmaster starts and sees recovery.in_progress,
it will fail to start knowing its recovery was interrupted.  This allows
the admin to take appropriate action.  (I am not sure what that action
would be. Does he bring back the backup files or just keep going?)

Third, I think we need to put a file in the archive location once we
complete a backup, recording the start/stop xid and wal/offsets.  This
gives the admin documentation on what archive logs to keep and what xids
are available for recovery.  Ideally the recover program would read that
file and check the recover xid to make sure it is after the stop xid
recorded in the file.

How would the recover program know the name of that file?  We need to
create it in /data with start contents before the backup, then complete
it with end contents and archive it.

What should we name it?  Ideally it would be named by the WAL
name/offset of the start so it orders in the proper spot in the archive
file listing, e.g.:

093a
093b
093b.032b9.start
093c

Are people going to know they need 093b for
093b.032b9.start?  I hope so.  Another idea is to do:


093a.xlog
093b.032b9.start
093b.xlog
093c.xlog

This would order properly.  It might be a very good idea to add
extensions to these log files now that we are archiving them in strange
places.  In fact, maybe we should use *.pg_xlog to document the
directory they came from.

---


Simon Riggs wrote:
 On Fri, 2004-07-16 at 16:47, Tom Lane wrote:
  As far as the business about copying pg_control first goes: there is
  another way to think about it, which is to copy pg_control to another
  place that will be included in your backup.  For example the standard
  backup procedure could be
  
  1. [somewhat optional] Issue CHECKPOINT and wait till it finishes.
  
  2. cp $PGDATA/global/pg_control $PGDATA/pg_control.dump
  
  3. tar cf /dev/mt $PGDATA
  
  4. do something to record ending WAL position
  
  If we standardized on this way, then the tar archive would automatically
  contain the pre-backup checkpoint position in ./pg_control.dump, and
  there is no need for any special assumptions about the order in which
  tar processes things.
  
 
 Sounds good. That would be familiar to Oracle DBAs doing BACKUP
 CONTROLFILE. We can document that and off it as a suggested procedure.
 
  However, once you decide to do things like that, there is no reason why
  the copied file has to be an exact image of pg_control.  I claim it
  would be more useful if the copied file were plain text so that you
  could just cat it to find out the starting WAL position; that would
  let you determine without any special tools what range of WAL archive
  files you are going to need to bring back from your archives.
 
 I wouldn't be in favour of a manual mechanism. If you want an automated
 mechanism, whats wrong with using the one thats already there? You can
 use pg_controldata to read the controlfile, again whats wrong with that?
 
 We agreed some time back that an off-line xlog file inspector would be
 required to allow us to inspect the logs and make a decision about where
 to end recovery. You'd still need that.
 
 It's scary enough having to specify the end point, let alone having to
 specify the starting point as well.
 
 At your request, and with Bruce's idea, I designed and built the
 recovery system so that you don't need to know what range of xlogs to
 bring back. You just run it, it brings back the right files from archive
 and does recovery with them, then cleans up - and it works without
 running out of disk space on long recoveries.
 
 I've built it now and it works...
 
  This is pretty much the same chain of reasoning that Bruce and I went
  through yesterday to come up with the idea of putting a label file
  inside the tar backups.  We concluded that it'd be worth putting
  both the backup starting time and the checkpoint WAL position into
  the label file --- the starting time isn't needed for restore but
  might be really helpful as documentation, if you needed to verify
  which

Re: [HACKERS] Point in Time Recovery

2004-07-16 Thread Bruce Momjian


Let me address you concerns about PITR getting into 7.5. I think a few
people spoke last week expressing concern about our release process and
wanting to take drastic action.  However, looking at the release status
report I am about to post, you will see we are on track for an August 1
beta.

PITR has been neglected only because it has been moving along so well we
haven't needed to get deeply involved.  Simon has been able to address
concerns as we raised them and make adjustments quickly with little
guidance.  

Now, we certainly don't want to skip adding PITR by not giving it our
full attention to get into 7.5.  Once Tom completes the cursor issues
with NT in the next day or so,  I think that removes the last big NT
stumbling block, and we will start to focus on PITR.  Unless there is
some major thing we are missing, we fully expect to get PITR in 7.5.  We
don't have a crystal ball to know for sure, but our intent is clear.

I know Simon is going away July 26 so we want to get him feedback as
soon as possible.  If we wait until after July 26, we will have to make
all the adjustments without Simon's guidance, which will be difficult.

As far as the importance of PITR, it is a _key_ enterprise feature, even
more key than NT.  PITR is going to be one of the crowning jewels of the
7.5 release, and I don't want to go into beta without it unless we can't
help it.

So, I know with the deadline looming, and everyone it getting nervous,
but keep the faith.  I can see the light at the end of the tunnel.  I
know this is a tighter schedule than we would like, but I know we can do
it, and I expect we will do it.

---

Simon Riggs wrote:
 On Thu, 2004-07-15 at 15:57, Bruce Momjian wrote:
 
  We will get there --- it just seems dark at this time.
 
 Thanks for that. My comments were heartfelt, but not useful right now. 
 
 I'm badly overdrawn already on my time budget, though that is my concern
 alone. There is more to do than I have time for. Pragmatically, if we
 aren't going to get there then I need to stop now, so I can progress
 other outstanding issues. All help is appreciated.
 
 I'm aiming for the minimum feature set - which means we do need to take
 care over whether that set is insufficient and also to pull any part
 that doesn't stand up to close scrutiny over the next few days.
 
 Overall, my primary goal is increased robustness and availability for
 PostgreSQL...and then to have a rest!
 
 Best Regards, Simon Riggs
 
 
 ---(end of broadcast)---
 TIP 4: Don't 'kill -9' the postmaster
 

-- 
  Bruce Momjian|  http://candle.pha.pa.us
  [EMAIL PROTECTED]   |  (610) 359-1001
  +  If your life is a hard drive, |  13 Roberts Road
  +  Christ can be your backup.|  Newtown Square, Pennsylvania 19073

---(end of broadcast)---
TIP 5: Have you checked our extensive FAQ?

   http://www.postgresql.org/docs/faqs/FAQ.html

Re: [HACKERS] Point in Time Recovery

2004-07-15 Thread Simon Riggs

On Thu, 2004-07-15 at 02:43, Mark Kirkwood wrote:
 I noticed that compiling with 5_1 patch applied fails due to 
 XLOG_archive_dir being removed from xlog.c , but 
 src/backend/commands/tablecmds.c still uses it.
 
 I did the following to tablecmds.c :
 
 5408c5408
extern char XLOG_archive_dir[];
 ---
 extern char *XLogArchiveDest;
 5410c5410
use_wal = XLOG_archive_dir[0]  !rel-rd_istemp;
 ---
 use_wal = XLogArchiveDest[0]  !rel-rd_istemp;
 
 

Yes, I discovered that myself.

The fix is included in pitr_v5_2.patch...

Your patch follows the right thinking and looks like it would have
worked...
- XLogArchiveMode carries the main bool value for mode on/off
- XLogArchiveDest might also be used, though best to use the mode

Thanks for looking through the code...

Best Regards, Simon Riggs


---(end of broadcast)---
TIP 1: subscribe and unsubscribe commands go to [EMAIL PROTECTED]

Re: [HACKERS] Point in Time Recovery

2004-07-15 Thread Simon Riggs

On Thu, 2004-07-15 at 03:02, Bruce Momjian wrote:
 I talked to Tom on the phone today and and I think we have a procedure
 for doing backup/restore in a fairly foolproof way.
 
 As outlined below, we need to record the start/stop and checkpoint WAL
 file names and offsets, and somehow pass those on to restore.  I think
 any system that requires users to link those values together is going
 to cause confusion and be error-prone.
 

Unfortunately, it seems clear that many of my posts have not been read,
nor has anyone here actually tried to use the patch. Everybody's views
on what constitutes error-prone might well differ then.

Speculation about additional requirements is just great, but please
don't assume that I have infinite resources to apply to these problems.
Documentation has still to be written.

For a long time now, I've been adding one last feature to what is
there, but we're still no nearer to anybody inspecting the patch or
committing it.

There is building consensus on other threads that PITR should not even
be included in the release (3 tentative votes). This latest request
feels more like the necessary excuse to take the decision to pull PITR.
I would much rather that we took the brave decision and pull it NOW,
rather than have me work like crazy to chase this release.

:(

Best Regards, Simon Riggs




---(end of broadcast)---
TIP 9: the planner will ignore your desire to choose an index scan if your
  joining column's datatypes do not match

Re: [HACKERS] Point in Time Recovery

2004-07-15 Thread Mark Kirkwood

I tried what I thought was a straightforward scenario, and seem to have 
broken it :-(

Here is the little tale
1) initdb
2) set archive_mode and archive_dest in postgresql.conf
3) startup
4) create database called 'test'
5) connect to 'test' and type 'checkpoint'
6) backup PGDATA using 'tar -zcvf'
7) create tables in 'test' and add data using COPY (exactly 2 logs worth)
8) shutdown and remove PGDATA
9)  recover using 'tar -zxvf'
10) copy recovery.conf into PGDATA
11) startup
This is what I get :
LOG:  database system was interrupted at 2004-07-15 21:24:04 NZST
LOG:  recovery command file found...
LOG:  restore_program = cp %s/%s %s
LOG:  recovery_target_inclusive = true
LOG:  recovery_debug_log = true
LOG:  starting archive recovery
LOG:  restored log file  from archive
LOG:  checkpoint record is at 0/A48054
LOG:  redo record is at 0/A48054; undo record is at 0/0; shutdown FALSE
LOG:  next transaction ID: 496; next OID: 25419
LOG:  database system was not properly shut down; automatic recovery in 
progress
LOG:  redo starts at 0/A48094
LOG:  restored log file 0001 from archive
LOG:  record with zero length at 0/1E0
LOG:  redo done at 0/130
LOG:  restored log file 0001 from archive
LOG:  restored log file 0001 from archive
PANIC:  concurrent transaction log activity while database system is 
shutting down
LOG:  startup process (PID 13492) was terminated by signal 6
LOG:  aborting startup due to startup process failure

The concurrent access is a bit of a puzzle, as this is my home machine 
(i.e. I am *sure* noone else is connected!)

Mark
P.s : CVS HEAD from about 1 hour ago, PITR 5.2, FreeBSD 4.10 on x86
---(end of broadcast)---
TIP 8: explain analyze is your friend

Re: [HACKERS] Point in Time Recovery

2004-07-15 Thread Zeugswetter Andreas SB SD


  Other db's have commands for:
  start/end external backup 

I see that the analogy to external backup was not good, since you are correct
that dba's would expect that to stop all writes, so they can safely split
their mirror or some such. Usually the expected time from start
until end external backup is expected to be only seconds. I actually think we
do not need this functionality, since in pg you can safely split the mirror any 
time you like.

My comment was meant to give dba's a familiar tool. The effect of it
would only have been to create a separate backup of pg_control.
Might as well tell people to always backup pg_control first.

I originally thought you would require restore to specify an xlog id
from which recovery will start. You would search this log for the first 
checkpoint record, create an appropriate pg_control, and start rollforward.

I still think this would be a nice feature, since then all that would be required
for restore is a system backup (that includes pg data files) and the xlogs.

Andreas

---(end of broadcast)---
TIP 4: Don't 'kill -9' the postmaster

Re: [HACKERS] Point in Time Recovery

2004-07-15 Thread HISADAMasaki

Dear Simon,

I've just tested pitr_v5_2.patch and got an error message
during archiving process as follows.

-- begin
 LOG:  archive command=cp /usr/local/pgsql/data/pg_xlog/ /tmp,return 
code=-1
-- end

The command called in system(3) works, but it returns -1.
system(3) can not get right exit code from its child process,
when SIGCHLD is set as SIG_IGN.

So I did following change to pgarch_Main() in pgarch.c

-- line 236 ---
- pgsignal(SIGCHLD, SIG_IGN);

-- line 236 ---
+ pgsignal(SIGCHLD, SIG_DFL);

After that, 
the error message doen't come out and it seems to be working propery.

Regards,
Hisada, Masaki

On Wed, 14 Jul 2004 00:13:37 +0100
Simon Riggs [EMAIL PROTECTED] wrote:

 PITR Patch v5_1 just posted has Point in Time Recovery working
 
 Still some rough edgesbut we really need some testers now to give
 this a try and let me know what you think.
 
 Klaus Naumann and Mark Wong are the only [non-committers] to have tried
 to run the code (and let me know about it), so please have a look at
 [PATCHES] and try it out.
 
 Many thanks,
 
 Simon Riggs
 
 
 ---(end of broadcast)---
 TIP 8: explain analyze is your friend

-- 
HISADA, Masaki [EMAIL PROTECTED]



---(end of broadcast)---
TIP 9: the planner will ignore your desire to choose an index scan if your
  joining column's datatypes do not match

Re: [HACKERS] Point in Time Recovery

2004-07-15 Thread Zeugswetter Andreas SB SD


Sorry for the stupid question, but how do I get this patch if I do not
receive the patches mails ?

The web interface html'ifies it, thus making it unusable.

Thanks
Andreas

---(end of broadcast)---
TIP 5: Have you checked our extensive FAQ?

   http://www.postgresql.org/docs/faqs/FAQ.html

Re: [HACKERS] Point in Time Recovery

2004-07-15 Thread Bruce Momjian

Simon Riggs wrote:
 On Thu, 2004-07-15 at 03:02, Bruce Momjian wrote:
  I talked to Tom on the phone today and and I think we have a procedure
  for doing backup/restore in a fairly foolproof way.
  
  As outlined below, we need to record the start/stop and checkpoint WAL
  file names and offsets, and somehow pass those on to restore.  I think
  any system that requires users to link those values together is going
  to cause confusion and be error-prone.
  
 
 Unfortunately, it seems clear that many of my posts have not been read,
 nor has anyone here actually tried to use the patch. Everybody's views
 on what constitutes error-prone might well differ then.
 
 Speculation about additional requirements is just great, but please
 don't assume that I have infinite resources to apply to these problems.
 Documentation has still to be written.
 
 For a long time now, I've been adding one last feature to what is
 there, but we're still no nearer to anybody inspecting the patch or
 committing it.

I totally understand your feeling this, and I would be feeling the exact
same way (but would probably have complained much earlier  :-)  ). 
Anyway, the problem is that Tom and I are serializing application of the
major features in the pipeline.  We decided to focus on nested
transactions (NT) first (it is a larger patch), and that is why PITR has
gotten so little attention from us.   However, there is no sense that
you had anything to do with it being places behind NT in the queue, and
therefore there is no feeling on our part that PITR is less important or
deserves less time than NT.  Certainly any system that made you less
likely to be applied would be unfair and something we will not do.

My explanation about the file format was an attempt to address the
method of passing the wal filename/offset to the recover process.  If
that isn't needed, I am sorry.

 There is building consensus on other threads that PITR should not even
 be included in the release (3 tentative votes). This latest request
 feels more like the necessary excuse to take the decision to pull PITR.
 I would much rather that we took the brave decision and pull it NOW,
 rather than have me work like crazy to chase this release.

Those three individuals are not representative of the group.  Sorry it
might seem there there is lack of enthusiasm for PITR, but it isn't true
from our end.  You might have noticed that the patch queue has shrunk
dramatically, and now we are focused on NT and PITR almost exclusively.

We will get there --- it just seems dark at this time.

-- 
  Bruce Momjian|  http://candle.pha.pa.us
  [EMAIL PROTECTED]   |  (610) 359-1001
  +  If your life is a hard drive, |  13 Roberts Road
  +  Christ can be your backup.|  Newtown Square, Pennsylvania 19073

---(end of broadcast)---
TIP 6: Have you searched our list archives?

   http://archives.postgresql.org

Re: [HACKERS] Point in Time Recovery

2004-07-15 Thread Bruce Momjian

Simon Riggs wrote:
 On Wed, 2004-07-14 at 10:57, Zeugswetter Andreas SB SD wrote:
   The recovery mechanism doesn't rely upon you knowing 1 or 3. The
   recovery reads pg_control (from the backup) and then attempts to
   de-archive the appropriate xlog segment file and then starts 
   rollforward
  
  Unfortunately this only works if pg_control was the first file to be 
  backed up (or by chance no checkpoint happened after backup start and 
  pg_control backup)
  
  Other db's have commands for:
  start/end external backup
  
 
 OK...this idea has come up a few times. Here's my take:
 
 - OS and hardware facilities exist now to make instant copies of sets of
 files. Some of these are open source, others not. If you use these, you
 have no requirement for this functionalitybut these alone are no
 replacement for archive recovery I accept that some people may not
 wish to go to the expense or effort to use those options, but in my mind
 these are the people that will not be using archive_mode anyway.
 
 - all we would really need to do is to stop the bgwriter from doing
 anything during backup. pgcontrol is only updated at checkpoint. The
 current xlog is updated constantly, but this need not be copied because
 we are already archiving it as soon as its full. That leaves the
 bgwriter, which is now responsible for both lazy writing and
 checkpoints.
 So, put a switch into bgwriter to halt for a period, then turn it back
 on at the end. Could be a SIGHUP GUC...or...

I don't think we can turn off all file system writes during a backup. 
Imagine writing to a tape.  Preventing file system writes would make the
system useless.

 - please could somebody else code that?... my time is limited

Yes, I think someone else could code this.

-- 
  Bruce Momjian|  http://candle.pha.pa.us
  [EMAIL PROTECTED]   |  (610) 359-1001
  +  If your life is a hard drive, |  13 Roberts Road
  +  Christ can be your backup.|  Newtown Square, Pennsylvania 19073

---(end of broadcast)---
TIP 5: Have you checked our extensive FAQ?

   http://www.postgresql.org/docs/faqs/FAQ.html

Re: [HACKERS] Point in Time Recovery

2004-07-15 Thread Simon Riggs

On Thu, 2004-07-15 at 10:47, Mark Kirkwood wrote:
 I tried what I thought was a straightforward scenario, and seem to have 
 broken it :-(
 
 Here is the little tale
 
 1) initdb
 2) set archive_mode and archive_dest in postgresql.conf
 3) startup
 4) create database called 'test'
 5) connect to 'test' and type 'checkpoint'
 6) backup PGDATA using 'tar -zcvf'
 7) create tables in 'test' and add data using COPY (exactly 2 logs worth)
 8) shutdown and remove PGDATA
 9)  recover using 'tar -zxvf'
 10) copy recovery.conf into PGDATA
 11) startup
 
 This is what I get :
 
 LOG:  database system was interrupted at 2004-07-15 21:24:04 NZST
 LOG:  recovery command file found...
 LOG:  restore_program = cp %s/%s %s
 LOG:  recovery_target_inclusive = true
 LOG:  recovery_debug_log = true
 LOG:  starting archive recovery
 LOG:  restored log file  from archive
 LOG:  checkpoint record is at 0/A48054
 LOG:  redo record is at 0/A48054; undo record is at 0/0; shutdown FALSE
 LOG:  next transaction ID: 496; next OID: 25419
 LOG:  database system was not properly shut down; automatic recovery in 
 progress
 LOG:  redo starts at 0/A48094
 LOG:  restored log file 0001 from archive
 LOG:  record with zero length at 0/1E0
 LOG:  redo done at 0/130
 LOG:  restored log file 0001 from archive
 LOG:  restored log file 0001 from archive
 PANIC:  concurrent transaction log activity while database system is 
 shutting down
 LOG:  startup process (PID 13492) was terminated by signal 6
 LOG:  aborting startup due to startup process failure
 
 The concurrent access is a bit of a puzzle, as this is my home machine 
 (i.e. I am *sure* noone else is connected!)

First, thanks for sticking with it to test this.

I've not received such a message myself - this is interesting.

Is it possible to copy that directory to one side and re-run the test?
Add another parameter in postgresql.conf called archive_debug = true
Does it happen identically the second time?

What time difference was there between steps 5 and 6? I think I can here
Andreas saying told you I'm thinking the backup might be somehow
corrupted because the checkpoint occurred during the backup. Hmmm...

Could you also post me the recovery.log file? (don't post to list)

Thanks, Simon Riggs


---(end of broadcast)---
TIP 5: Have you checked our extensive FAQ?

   http://www.postgresql.org/docs/faqs/FAQ.html

Re: [HACKERS] Point in Time Recovery

2004-07-15 Thread Simon Riggs

On Thu, 2004-07-15 at 15:57, Bruce Momjian wrote:

 We will get there --- it just seems dark at this time.

Thanks for that. My comments were heartfelt, but not useful right now. 

I'm badly overdrawn already on my time budget, though that is my concern
alone. There is more to do than I have time for. Pragmatically, if we
aren't going to get there then I need to stop now, so I can progress
other outstanding issues. All help is appreciated.

I'm aiming for the minimum feature set - which means we do need to take
care over whether that set is insufficient and also to pull any part
that doesn't stand up to close scrutiny over the next few days.

Overall, my primary goal is increased robustness and availability for
PostgreSQL...and then to have a rest!

Best Regards, Simon Riggs


---(end of broadcast)---
TIP 4: Don't 'kill -9' the postmaster

Re: [HACKERS] Point in Time Recovery

2004-07-15 Thread Devrim GUNDUZ

-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1


Simon,

On Thu, 15 Jul 2004, Simon Riggs wrote:

  We will get there --- it just seems dark at this time.
 
 Thanks for that. My comments were heartfelt, but not useful right now. 
 
 I'm badly overdrawn already on my time budget, though that is my concern
 alone. There is more to do than I have time for. Pragmatically, if we
 aren't going to get there then I need to stop now, so I can progress
 other outstanding issues. All help is appreciated.

Personally, as a PostgreSQL Advocate, I believe that PITR is one of the 
most important missing features in PostgreSQL. I've been keeping 'all' of 
you e-mails about PITR and I'm really excited with that feature. 

Please do not stop working on PITR. I'm pretty sure that most of the 
'silent' people in the lists are waiting for PITR for an {Oracle, DB2, ...}-killer 
database. In my country (Turkey), too many people spend a lot of 
money for  proprietary databases, just for some missing features in 
PostgreSQL. If you finish your work on PITR (and other guys on NT, Win32 
port, etc), then we'll feel more concentrated on PostgreSQL Advocation, so 
that PostgreSQL will be used more and more. (Oh, we also need native 
clustering...)

Maybe I should send this e-mail offlist, but I wanted everyone to learn my 
feelings.

Regards and best wishes,
- -- 
Devrim GUNDUZ  
devrim~gunduz.org   devrim.gunduz~linux.org.tr 
http://www.tdmsoft.com
http://www.gunduz.org
-BEGIN PGP SIGNATURE-
Version: GnuPG v1.2.1 (GNU/Linux)

iD8DBQFA9wKxtl86P3SPfQ4RAms+AJ95RfFi0lVwMD7u7zQ/DzLFEBC8MACgvRzd
HRqAjVqI3hekwImPpqelj9U=
=l445
-END PGP SIGNATURE-


---(end of broadcast)---
TIP 1: subscribe and unsubscribe commands go to [EMAIL PROTECTED]

Re: [HACKERS] Point in Time Recovery

2004-07-15 Thread Simon Riggs

On Thu, 2004-07-15 at 13:16, HISADAMasaki wrote:
 Dear Simon,
 
 I've just tested pitr_v5_2.patch and got an error message
 during archiving process as follows.
 
 -- begin
  LOG:  archive command=cp /usr/local/pgsql/data/pg_xlog/ 
 /tmp,return code=-1
 -- end
 
 The command called in system(3) works, but it returns -1.
 system(3) can not get right exit code from its child process,
 when SIGCHLD is set as SIG_IGN.
 
 So I did following change to pgarch_Main() in pgarch.c
 
 -- line 236 ---
 - pgsignal(SIGCHLD, SIG_IGN);
 
 -- line 236 ---
 + pgsignal(SIGCHLD, SIG_DFL);
 

Thank you for testing the patch. Very much appreciated.

I was aware of the potential issues of incorrect return codes, and that
exact part of the code is the part I'm least happy with. 

I'm not sure I understand why its returned -1, though I'll take you
recommendation. I've not witnessed such an issue. What system are you
running, or is it a default shell issue?

Do people think that the change is appropriate for all systems, or just
the one you're using?

Best Regards, Simon Riggs


---(end of broadcast)---
TIP 8: explain analyze is your friend

Re: [HACKERS] Point in Time Recovery

2004-07-15 Thread Simon Riggs

 On Thu, 2004-07-15 at 23:18, Devrim GUNDUZ wrote: 

Thanks for the vote of confidence, on or off list.

  too many people spend a lot of 
 money for  proprietary databases, just for some missing features in 
 PostgreSQL

Agreed - PITR isn't aimed at existing users of PostgreSQL. If you use it
already, even though it doesn't have it, then you are quite likely to be
able to keep going without it.

Most commercial users won't touch anything that doesn't have PITR.

  (Oh, we also need native 
 clustering...)

Next week, OK? :)


Best Regards, Simon




---(end of broadcast)---
TIP 2: you can get off all lists at once with the unregister command
(send unregister YourEmailAddressHere to [EMAIL PROTECTED])

Re: [HACKERS] Point in Time Recovery

2004-07-15 Thread Alvaro Herrera

On Thu, Jul 15, 2004 at 11:44:02PM +0100, Simon Riggs wrote:
 On Thu, 2004-07-15 at 13:16, HISADAMasaki wrote:

  -- line 236 ---
  - pgsignal(SIGCHLD, SIG_IGN);
  
  -- line 236 ---
  + pgsignal(SIGCHLD, SIG_DFL);
 
 I'm not sure I understand why its returned -1, though I'll take you
 recommendation. I've not witnessed such an issue. What system are you
 running, or is it a default shell issue?
 
 Do people think that the change is appropriate for all systems, or just
 the one you're using?

My manpage for signal(2) says that you shouldn't assign SIG_IGN to
SIGCHLD, according to POSIX.  It goes on to say that BSD and SysV
behaviors differ on this aspect.

(This is on linux BTW)

-- 
Alvaro Herrera (alvherre[a]dcc.uchile.cl)
La experiencia nos dice que el hombre peló millones de veces las patatas,
pero era forzoso admitir la posibilidad de que en un caso entre millones,
las patatas pelarían al hombre (Ijon Tichy)


---(end of broadcast)---
TIP 6: Have you searched our list archives?

   http://archives.postgresql.org

Re: [HACKERS] Point in Time Recovery

2004-07-15 Thread Bruce Momjian

Simon Riggs wrote:
  On Thu, 2004-07-15 at 23:18, Devrim GUNDUZ wrote: 
 
 Thanks for the vote of confidence, on or off list.
 
   too many people spend a lot of 
  money for  proprietary databases, just for some missing features in 
  PostgreSQL
 
 Agreed - PITR isn't aimed at existing users of PostgreSQL. If you use it
 already, even though it doesn't have it, then you are quite likely to be
 able to keep going without it.
 
 Most commercial users won't touch anything that doesn't have PITR.

Agreed. I am surprised at how few requests we have gotten for PITR.  I
assume people are either using replication or not considering us.

-- 
  Bruce Momjian|  http://candle.pha.pa.us
  [EMAIL PROTECTED]   |  (610) 359-1001
  +  If your life is a hard drive, |  13 Roberts Road
  +  Christ can be your backup.|  Newtown Square, Pennsylvania 19073

---(end of broadcast)---
TIP 4: Don't 'kill -9' the postmaster

Re: [HACKERS] Point in Time Recovery

2004-07-15 Thread Mark Kirkwood


Simon Riggs wrote:
First, thanks for sticking with it to test this.
I've not received such a message myself - this is interesting.
Is it possible to copy that directory to one side and re-run the test?
Add another parameter in postgresql.conf called archive_debug = true
Does it happen identically the second time?
 

Yes, identical results - I re-initdb'ed and ran the process again, 
rather than reuse the files.

What time difference was there between steps 5 and 6? I think I can here
Andreas saying told you I'm thinking the backup might be somehow
corrupted because the checkpoint occurred during the backup. Hmmm...
 

I was wondering about this, so  left a bit more time in between, and 
forced a sync as well for good measure.

5) $ psql -d test -c checkpoint; sleep 30;sync;sleep 30
6) $ tar -zcvf /data1/dump/pgdata-7.5.tar.gz *
Thanks, Simon Riggs
 

---(end of broadcast)---
TIP 9: the planner will ignore your desire to choose an index scan if your
 joining column's datatypes do not match

Re: [HACKERS] Point in Time Recovery

2004-07-15 Thread Bruce Momjian

Simon Riggs wrote:
 On Thu, 2004-07-15 at 15:57, Bruce Momjian wrote:
 
  We will get there --- it just seems dark at this time.
 
 Thanks for that. My comments were heartfelt, but not useful right now. 
 
 I'm badly overdrawn already on my time budget, though that is my concern
 alone. There is more to do than I have time for. Pragmatically, if we
 aren't going to get there then I need to stop now, so I can progress
 other outstanding issues. All help is appreciated.
 
 I'm aiming for the minimum feature set - which means we do need to take
 care over whether that set is insufficient and also to pull any part
 that doesn't stand up to close scrutiny over the next few days.

As you can see, we are still chewing on NT.  What PITR features are
missing?  I assume because we can't stop file system writes during
backup that we will need a backup parameter file like I described.  Is
there anything else that PITR needs?

-- 
  Bruce Momjian|  http://candle.pha.pa.us
  [EMAIL PROTECTED]   |  (610) 359-1001
  +  If your life is a hard drive, |  13 Roberts Road
  +  Christ can be your backup.|  Newtown Square, Pennsylvania 19073

---(end of broadcast)---
TIP 4: Don't 'kill -9' the postmaster

Re: [HACKERS] Point in Time Recovery

2004-07-15 Thread Glen Parker

 Simon Riggs wrote:
   On Thu, 2004-07-15 at 23:18, Devrim GUNDUZ wrote:
 
  Thanks for the vote of confidence, on or off list.
 
too many people spend a lot of
   money for  proprietary databases, just for some missing features in
   PostgreSQL
 
  Agreed - PITR isn't aimed at existing users of PostgreSQL. If you use it
  already, even though it doesn't have it, then you are quite likely to be
  able to keep going without it.
 
  Most commercial users won't touch anything that doesn't have PITR.

 Agreed. I am surprised at how few requests we have gotten for PITR.  I
 assume people are either using replication or not considering us.

Don't forget that there are (must be) lots of us that know it's coming and
are just waiting until it's available.  I haven't requested per se, but
believe me, I'm waiting for it :-)


---(end of broadcast)---
TIP 9: the planner will ignore your desire to choose an index scan if your
  joining column's datatypes do not match

Re: [HACKERS] Point in Time Recovery

2004-07-15 Thread Mark Kirkwood

Couldn't agree more. Maybe we should have made more noise :-)
Glen Parker wrote:
Simon Riggs wrote:
   

On Thu, 2004-07-15 at 23:18, Devrim GUNDUZ wrote:
   

Thanks for the vote of confidence, on or off list.
 

too many people spend a lot of
money for  proprietary databases, just for some missing features in
PostgreSQL
   

Agreed - PITR isn't aimed at existing users of PostgreSQL. If you use it
already, even though it doesn't have it, then you are quite likely to be
able to keep going without it.
Most commercial users won't touch anything that doesn't have PITR.
 

Agreed. I am surprised at how few requests we have gotten for PITR.  I
assume people are either using replication or not considering us.
   

Don't forget that there are (must be) lots of us that know it's coming and
are just waiting until it's available.  I haven't requested per se, but
believe me, I'm waiting for it :-)
---(end of broadcast)---
TIP 9: the planner will ignore your desire to choose an index scan if your
 joining column's datatypes do not match
 

---(end of broadcast)---
TIP 3: if posting/reading through Usenet, please send an appropriate
 subscribe-nomail command to [EMAIL PROTECTED] so that your
 message can get through to the mailing list cleanly

Re: [HACKERS] Point in Time Recovery

2004-07-15 Thread Mark Kirkwood

Simon Riggs wrote:
So far:
I've tried to re-create the problem as exactly as I can, but it works
for me. 

This is clearly an important case to chase down.
I assume that this is the very first time you tried recovery? Second and
subsequent recoveries using the same set have a potential loophole,
which we have been discussing.
Right now, I'm thinking that the exactly 2 logs worth of data has
brought you very close to the end of the log file (E0) ending with 1
and the shutdown checkpoint that is then subsequently written is
failing.
Can you repeat this your end?
 

It is repeatable at my end. It is actually fairly easy to recreate the 
example I am using, download 

http://sourceforge.net/projects/benchw
and generate the dataset for Pg - but trim the large fact0.dat dump 
file using head -10.
Thus step 7 consists of creating the 4 tables and COPYing in the data 
for them.

The nearest I can get to the exact record pointers you show are to start
recovery at A4807C and to end at with 88.
Overall, PITR changes the recovery process very little, if at all. The
main areas of effect are to do with sequencing of actions and matching
up the right logs with the right backup. I'm not looking for bugs in the
code but in subtle side-effects and edge cases. Everything you can
tell me will help me greatly in chasing that down. 

 

I agree - I will try this sort of example again, but will change the 
number of rows I am COPYing (currently 10) and see if that helps.

Best Regards, Simon Riggs
 

By way of contrast, using the *same* procedure (1-11), but generating 2 
logs worth of INSERTS/UPDATES using 10 concurrent process *works fine* - 
e.g :

LOG:  database system was interrupted at 2004-07-16 11:17:52 NZST
LOG:  recovery command file found...
LOG:  restore_program = cp %s/%s %s
LOG:  recovery_target_inclusive = true
LOG:  recovery_debug_log = true
LOG:  starting archive recovery
LOG:  restored log file  from archive
LOG:  checkpoint record is at 0/A4803C
LOG:  redo record is at 0/A4803C; undo record is at 0/0; shutdown FALSE
LOG:  next transaction ID: 496; next OID: 25419
LOG:  database system was not properly shut down; automatic recovery in 
progress
LOG:  redo starts at 0/A4807C
postmaster starting
[EMAIL PROTECTED] 7.5]$ LOG:  restored log file 0001 
from archive
cp: cannot stat `/data1/pgdata/7.5-archive/0002': No such 
file or directory
LOG:  could not restore 0002 from archive
LOG:  could not open file /data1/pgdata/7.5/pg_xlog/0002 
(log file 0, segment 2): No such file or directory
LOG:  redo done at 0/1D4
LOG:  archive recovery complete
LOG:  database system is ready
LOG:  archiver started


---(end of broadcast)---
TIP 6: Have you searched our list archives?
  http://archives.postgresql.org

Re: [HACKERS] Point in Time Recovery

2004-07-15 Thread Simon Riggs

On Fri, 2004-07-16 at 00:01, Alvaro Herrera wrote:
 On Thu, Jul 15, 2004 at 11:44:02PM +0100, Simon Riggs wrote:
  On Thu, 2004-07-15 at 13:16, HISADAMasaki wrote:
 
   -- line 236 ---
   - pgsignal(SIGCHLD, SIG_IGN);
   
   -- line 236 ---
   + pgsignal(SIGCHLD, SIG_DFL);
  
  I'm not sure I understand why its returned -1, though I'll take you
  recommendation. I've not witnessed such an issue. What system are you
  running, or is it a default shell issue?
  
  Do people think that the change is appropriate for all systems, or just
  the one you're using?
 
 My manpage for signal(2) says that you shouldn't assign SIG_IGN to
 SIGCHLD, according to POSIX.  It goes on to say that BSD and SysV
 behaviors differ on this aspect.
 

POSIX rules OK!

So - I should be setting this to SIG_DFL and thats good for everyone?

OK. Will do.

Best regards, Simon Riggs


---(end of broadcast)---
TIP 9: the planner will ignore your desire to choose an index scan if your
  joining column's datatypes do not match

Re: [HACKERS] Point in Time Recovery

2004-07-15 Thread Simon Riggs

On Fri, 2004-07-16 at 00:46, Mark Kirkwood wrote:

 
 By way of contrast, using the *same* procedure (1-11), but generating 2 
 logs worth of INSERTS/UPDATES using 10 concurrent process *works fine* - 
 e.g :
 

Great...at least we have shown that something works (or can work) and
have begun to isolate the problem whatever it is.

Best Regards, Simon Riggs


---(end of broadcast)---
TIP 7: don't forget to increase your free space map settings

Re: [HACKERS] Point in Time Recovery

2004-07-15 Thread Tom Lane

Simon Riggs [EMAIL PROTECTED] writes:
 On Fri, 2004-07-16 at 00:01, Alvaro Herrera wrote:
 My manpage for signal(2) says that you shouldn't assign SIG_IGN to
 SIGCHLD, according to POSIX.

 So - I should be setting this to SIG_DFL and thats good for everyone?

Yeah, we learned the same lesson in the backend not too many releases
back.  SIG_IGN'ing SIGCHLD is bad voodoo; it'll work on some platforms
but not others.

You could do worse than to look at the existing handling of signals in
the postmaster and its children; that code has been beat on pretty
heavily ...

regards, tom lane

---(end of broadcast)---
TIP 8: explain analyze is your friend

Re: [HACKERS] Point in Time Recovery

2004-07-15 Thread Christopher Kings-Lynne

Thanks for that. My comments were heartfelt, but not useful right now. 
Hi Simon,  I'm sorry if I gave the impression that I thought your work 
wasn't worthwhile, it is :(

I'm badly overdrawn already on my time budget, though that is my concern
alone. There is more to do than I have time for. Pragmatically, if we
aren't going to get there then I need to stop now, so I can progress
other outstanding issues. All help is appreciated.
I've got your patch applied (but having some compilation problem), but 
I'm really not sure what to test really.  I don't really understand the 
whole thing fully :/

I'm aiming for the minimum feature set - which means we do need to take
care over whether that set is insufficient and also to pull any part
that doesn't stand up to close scrutiny over the next few days.
Overall, my primary goal is increased robustness and availability for
PostgreSQL...and then to have a rest!
Definitely!
Chris
---(end of broadcast)---
TIP 9: the planner will ignore your desire to choose an index scan if your
 joining column's datatypes do not match

Re: [HACKERS] Point in Time Recovery

2004-07-14 Thread Simon Riggs

On Wed, 2004-07-14 at 03:31, Christopher Kings-Lynne wrote:
 Can you give us some suggestions of what kind of stuff to test?  Is 
 there a way we can artificially kill the backend in all sorts of nasty 
 spots to see if recovery works?  Does kill -9 simulate a 'power off'?
 

I was hoping some fiendish plans would be presented to me...

But please start with this feels like typical usage and we'll go from
there...the important thing is to try the first one.

I've not done power off tests, yet. They need to be done just to
check...actually you don't need to do this to test PITR...

We need to exhaustive tests of...
- power off
- scp and cross network copies
- all the permuted recovery options
- archive_mode = off (i.e. current behaviour)
- deliberately incorrectly set options (idiot-proof testing)

I'd love some help assembling a test document with numbered tests...

Best regards, Simon Riggs


---(end of broadcast)---
TIP 6: Have you searched our list archives?

   http://archives.postgresql.org

Re: [HACKERS] Point in Time Recovery

2004-07-14 Thread Zeugswetter Andreas SB SD


 The recovery mechanism doesn't rely upon you knowing 1 or 3. The
 recovery reads pg_control (from the backup) and then attempts to
 de-archive the appropriate xlog segment file and then starts 
 rollforward

Unfortunately this only works if pg_control was the first file to be 
backed up (or by chance no checkpoint happened after backup start and 
pg_control backup)

Other db's have commands for:
start/end external backup

Maybe we should add those two commands that would initially only do 
the following:

start external backup:
- (checkpoint as an option)
- make a copy of pg_control
end external backup:
- record WAL position (helps choose an allowed minimum PIT)

Those commands would actually not be obligatory but recommended, and would 
only help with the restore process.

Restore would then eighter take the existing pg_control backup, or ask
the dba where rollforward should start and create a corresponding pg_control.
A helper utility could list possible checkpoints in a given xlog.

Andreas

---(end of broadcast)---
TIP 1: subscribe and unsubscribe commands go to [EMAIL PROTECTED]

Re: [HACKERS] Point in Time Recovery

2004-07-14 Thread Tom Lane

Simon Riggs [EMAIL PROTECTED] writes:
 I've not done power off tests, yet. They need to be done just to
 check...actually you don't need to do this to test PITR...

I agree, power off is not really the point here.  What we need to check
into is (a) the mechanics of archiving WAL segments and (b) the
process of restoring given a backup and a bunch of WAL segments.

regards, tom lane

---(end of broadcast)---
TIP 3: if posting/reading through Usenet, please send an appropriate
  subscribe-nomail command to [EMAIL PROTECTED] so that your
  message can get through to the mailing list cleanly

Re: [HACKERS] Point in Time Recovery

2004-07-14 Thread markw

On 14 Jul, Simon Riggs wrote:
 PITR Patch v5_1 just posted has Point in Time Recovery working
 
 Still some rough edgesbut we really need some testers now to give
 this a try and let me know what you think.
 
 Klaus Naumann and Mark Wong are the only [non-committers] to have tried
 to run the code (and let me know about it), so please have a look at
 [PATCHES] and try it out.
 
 Many thanks,
 
 Simon Riggs

Simon,

I just tried applying the v5_1 patch against the cvs tip today and got a
couple of rejections.  I'll copy the patch output here.  Let me know if
you want to see the reject files or anything else:

$ patch -p0  ../../../pitr-v5_1.diff
patching file backend/access/nbtree/nbtsort.c
Hunk #2 FAILED at 221.
1 out of 2 hunks FAILED -- saving rejects to file backend/access/nbtree/nbtsort.c.rej
patching file backend/access/transam/xlog.c
Hunk #11 FAILED at 1802.
Hunk #15 FAILED at 2152.
Hunk #16 FAILED at 2202.
Hunk #21 FAILED at 3450.
Hunk #23 FAILED at 3539.
Hunk #25 FAILED at 3582.
Hunk #26 FAILED at 3833.
Hunk #27 succeeded at 3883 with fuzz 2.
Hunk #28 FAILED at 4446.
Hunk #29 succeeded at 4470 with fuzz 2.
8 out of 29 hunks FAILED -- saving rejects to file backend/access/transam/xlog.c.rej
patching file backend/postmaster/Makefile
patching file backend/postmaster/postmaster.c
Hunk #3 succeeded at 1218 with fuzz 2 (offset 70 lines).
Hunk #4 succeeded at 1827 (offset 70 lines).
Hunk #5 succeeded at 1874 (offset 70 lines).
Hunk #6 succeeded at 1894 (offset 70 lines).
Hunk #7 FAILED at 1985.
Hunk #8 succeeded at 2039 (offset 70 lines).
Hunk #9 succeeded at 2236 (offset 70 lines).
Hunk #10 succeeded at 2996 with fuzz 2 (offset 70 lines).
1 out of 10 hunks FAILED -- saving rejects to file backend/postmaster/postmaster.c.rej
patching file backend/storage/smgr/md.c
Hunk #1 succeeded at 162 with fuzz 2.
patching file backend/utils/misc/guc.c
Hunk #1 succeeded at 342 (offset 9 lines).
Hunk #2 succeeded at 1387 (offset 9 lines).
patching file backend/utils/misc/postgresql.conf.sample
Hunk #1 succeeded at 113 (offset 10 lines).
patching file bin/initdb/initdb.c
patching file include/access/xlog.h
patching file include/storage/pmsignal.h


---(end of broadcast)---
TIP 4: Don't 'kill -9' the postmaster

Re: [HACKERS] Point in Time Recovery

2004-07-14 Thread Simon Riggs

On Wed, 2004-07-14 at 16:55, [EMAIL PROTECTED] wrote:
 On 14 Jul, Simon Riggs wrote:
  PITR Patch v5_1 just posted has Point in Time Recovery working
  
  Still some rough edgesbut we really need some testers now to give
  this a try and let me know what you think.
  
  Klaus Naumann and Mark Wong are the only [non-committers] to have tried
  to run the code (and let me know about it), so please have a look at
  [PATCHES] and try it out.
  

 I just tried applying the v5_1 patch against the cvs tip today and got a
 couple of rejections.  I'll copy the patch output here.  Let me know if
 you want to see the reject files or anything else:
 

I'm on it. Sorry 'bout that all - midnight fingers.


---(end of broadcast)---
TIP 5: Have you checked our extensive FAQ?

   http://www.postgresql.org/docs/faqs/FAQ.html

Re: [HACKERS] Point in Time Recovery

2004-07-14 Thread Simon Riggs

On Wed, 2004-07-14 at 10:57, Zeugswetter Andreas SB SD wrote:
  The recovery mechanism doesn't rely upon you knowing 1 or 3. The
  recovery reads pg_control (from the backup) and then attempts to
  de-archive the appropriate xlog segment file and then starts 
  rollforward
 
 Unfortunately this only works if pg_control was the first file to be 
 backed up (or by chance no checkpoint happened after backup start and 
 pg_control backup)
 
 Other db's have commands for:
 start/end external backup
 

OK...this idea has come up a few times. Here's my take:

- OS and hardware facilities exist now to make instant copies of sets of
files. Some of these are open source, others not. If you use these, you
have no requirement for this functionalitybut these alone are no
replacement for archive recovery I accept that some people may not
wish to go to the expense or effort to use those options, but in my mind
these are the people that will not be using archive_mode anyway.

- all we would really need to do is to stop the bgwriter from doing
anything during backup. pgcontrol is only updated at checkpoint. The
current xlog is updated constantly, but this need not be copied because
we are already archiving it as soon as its full. That leaves the
bgwriter, which is now responsible for both lazy writing and
checkpoints.
So, put a switch into bgwriter to halt for a period, then turn it back
on at the end. Could be a SIGHUP GUC...or...

...and with my greatest respects

- please could somebody else code that?... my time is limited

Best regards, Simon Riggs


---(end of broadcast)---
TIP 6: Have you searched our list archives?

   http://archives.postgresql.org

Re: [HACKERS] Point in Time Recovery

2004-07-14 Thread Mark Kirkwood

I noticed that compiling with 5_1 patch applied fails due to 
XLOG_archive_dir being removed from xlog.c , but 
src/backend/commands/tablecmds.c still uses it.

I did the following to tablecmds.c :
5408c5408
   extern char XLOG_archive_dir[];
---
   extern char *XLogArchiveDest;
5410c5410
   use_wal = XLOG_archive_dir[0]  !rel-rd_istemp;
---
   use_wal = XLogArchiveDest[0]  !rel-rd_istemp;
Now I have to see if I have broken it with this change :-)
regards
Mark
Simon Riggs wrote:
On Wed, 2004-07-14 at 16:55, [EMAIL PROTECTED] wrote:
 

On 14 Jul, Simon Riggs wrote:
   

PITR Patch v5_1 just posted has Point in Time Recovery working
Still some rough edgesbut we really need some testers now to give
this a try and let me know what you think.
Klaus Naumann and Mark Wong are the only [non-committers] to have tried
to run the code (and let me know about it), so please have a look at
[PATCHES] and try it out.
 

 

I just tried applying the v5_1 patch against the cvs tip today and got a
couple of rejections.  I'll copy the patch output here.  Let me know if
you want to see the reject files or anything else:
   

I'm on it. Sorry 'bout that all - midnight fingers.
---(end of broadcast)---
TIP 5: Have you checked our extensive FAQ?
  http://www.postgresql.org/docs/faqs/FAQ.html
 

---(end of broadcast)---
TIP 7: don't forget to increase your free space map settings

Re: [HACKERS] Point in Time Recovery

2004-07-14 Thread SAKATA Tetsuo

Hi, folks.
My colleages and I are planning to test PITR after the 7.5 beta release.
Now we are desinging test items, but some specification are enough clear
(to us).
For example, we are not clear which resouce manager order to store log
records.
  - some access method (like B-tree) require to log its date or not.
  - create/drop action of table space to be stored to the log or not.
We'll be pleased if someone informs them.
The test set we'll proceed has following items;
  - PITR can recover ordinary commited transaction's data.
- tuple data themselves
- index data associated with them
  - PITR can recover commited some special transaction's data.
- DDL; create database, table, index and so on
- maintenance commands (handling large amount of data);
  truncate, vacuum, reindex and so on.
Items above are 'data aspects' of the test. Other aspects are as follows
  - Place of the archival log's drive;
PITR can recover a database from archived log data
   - stored in the same drive as xlog.
   - stored in a different drive on the same machine
 in which the PostgreSQL runs.
   - stored in a different drive on a different machine.
  - Duration between a checkpoint and recovery;
PITR can recover a database enough long after a checkpoint.
  - Time to Recover;
- to end of the log.
- to some specified time.
  - Type of failures;
- system down --- kill the PostgreSQL process (as a simulation).
- media lost  --- delete database files (as a simulation).
- These two case will be tested by a simulated situation first,
  and we would try some 'real' failure after.
  (real power down of the test machine to the first case,
   and 'plug off' the disk drive to the second one.
   these action would damage test machine, this is because
   we plan them after 'ordinary' test items.)
The test set is under construction and we'll test the 7.5 beta
for some weeks, and report the result of the test here.
Sincerely yours.
Tetsuo SAKATA.
--
sakata.tetsuo _at_ lab.ntt.co.jp
SAKATA, Tetsuo. Yokosuka JAPAN.
---(end of broadcast)---
TIP 3: if posting/reading through Usenet, please send an appropriate
 subscribe-nomail command to [EMAIL PROTECTED] so that your
 message can get through to the mailing list cleanly

Re: [HACKERS] Point in Time Recovery

2004-07-14 Thread Bruce Momjian


I talked to Tom on the phone today and and I think we have a procedure
for doing backup/restore in a fairly foolproof way.

As outlined below, we need to record the start/stop and checkpoint WAL
file names and offsets, and somehow pass those on to restore.  I think
any system that requires users to link those values together is going
to cause confusion and be error-prone.

My idea is to do much of this automatically.  First, create a
server-side function called pitr_backup_start() which creates a file in
the /data directory which contains the WAL filename/offsets for
last checkpoint and start.  Then do the backup of the data directory. 
Then call pitr_backup_stop() which adds the stop filename/offsets to the
file, and archive that file in the same place as the WAL files.

To restore, you untar the backup of /data.  Then the recover backend
reads the file created by pitr_backup_start() to find the name of the
backup parameter file.  It then recovers that file from the archive
location and uses the start/stop/checkpoint filename/offset information
to the restore.  

The advantage of this is that the tar backup contains everything needed
to find the proper parameter file for restore.  Ideally we could get all
the parameters into the tar backup, but that isn't possible because we
can't push the stop counters into the backup after the backup has
completed.

I recommend the pitr_backup_start() file be named for the current WAL
filename/offset, perhaps 032c.3da390.backup or something
like that.  The file would be a simple text file in
pg_xlog/archive_status:

# start 2004-07-14 21:35:22.324579-04
wal_checkpoint = 0319.021233
wal_start = 032c.92a9cb
...added after backup completes...
wal_stop = 034a.3db030
# stop 2004-07-14 21:32:22.0923213-04

The timestamps are for documentation only.  These files give admins
looking in the archive directory information on backup times.

(As an idea, there is no need for the user to specify a recovery mode. 
If the postmaster starts and sees the pitr_backup_start() file in /data,
it can go into recovery mode automatically.  If the archiver can't find
the file in the archive location, it can assume that it is just being
started from power failure mode.  However if it finds the file in the
archive location, it can assume it is to enter recovery mode.  There is
a race condition that a crash during copy of the file to the archive
location would be a problem.   The solution would be to create a special
flag file before copying the file to archive, and then archive it and
remove the flag file.  If the postmaster starts up and sees the
pitr_backup_start() file in /data and in the archive location, and it
doesn't see the flag file, it then knows it is doing a restore because
the flag file would never appear in a backup.  Anyway, this is just an
idea.)

---

Simon Riggs wrote:
 On Wed, 2004-07-14 at 10:57, Zeugswetter Andreas SB SD wrote:
   The recovery mechanism doesn't rely upon you knowing 1 or 3. The
   recovery reads pg_control (from the backup) and then attempts to
   de-archive the appropriate xlog segment file and then starts 
   rollforward
  
  Unfortunately this only works if pg_control was the first file to be 
  backed up (or by chance no checkpoint happened after backup start and 
  pg_control backup)
  
  Other db's have commands for:
  start/end external backup
  
 
 OK...this idea has come up a few times. Here's my take:
 
 - OS and hardware facilities exist now to make instant copies of sets of
 files. Some of these are open source, others not. If you use these, you
 have no requirement for this functionalitybut these alone are no
 replacement for archive recovery I accept that some people may not
 wish to go to the expense or effort to use those options, but in my mind
 these are the people that will not be using archive_mode anyway.
 
 - all we would really need to do is to stop the bgwriter from doing
 anything during backup. pgcontrol is only updated at checkpoint. The
 current xlog is updated constantly, but this need not be copied because
 we are already archiving it as soon as its full. That leaves the
 bgwriter, which is now responsible for both lazy writing and
 checkpoints.
 So, put a switch into bgwriter to halt for a period, then turn it back
 on at the end. Could be a SIGHUP GUC...or...
 
 ...and with my greatest respects
 
 - please could somebody else code that?... my time is limited
 
 Best regards, Simon Riggs
 
 
 ---(end of broadcast)---
 TIP 6: Have you searched our list archives?
 
http://archives.postgresql.org
 

-- 
  Bruce Momjian|  http://candle.pha.pa.us
  [EMAIL PROTECTED]   |  (610) 359-1001
  +  If your life is a hard drive, |  13 Roberts Road
  +  Christ can be your

Re: [HACKERS] Point in Time Recovery

2004-07-13 Thread Simon Riggs

On Tue, 2004-07-06 at 22:40, Simon Riggs wrote:
 On Mon, 2004-07-05 at 22:46, Tom Lane wrote:
  Simon Riggs [EMAIL PROTECTED] writes:
 
   - when we stop, keep reading records until EOF, just don't apply them.
   When we write a checkpoint at end of recovery, the unapplied
   transactions are buried alive, never to return.
   - stop where we stop, then force zeros to EOF, so that no possible
   record remains of previous transactions.
  
  Go with plan B; it's best not to destroy data (what if you chose the
  wrong restart point the first time)?
  
  Actually this now reminds me of a discussion I had with Patrick
  Macdonald some time ago.  The DB2 practice in this connection is that
  you *never* overwrite existing logfile data when recovering.  Instead
  you start a brand new xlog segment file, which is given a new branch
  number so it can be distinguished from the future-time xlog segments
  that you chose not to apply.  I don't recall what the DB2 terminology
  was exactly --- not branch number I don't think --- but anyway the
  idea is that when you restart the database after an incomplete recovery,
  you are now in a sort of parallel universe that has its own history
  after the branch point (PITR stop point).  You need to be able to
  distinguish archived log segments of this parallel universe from those
  of previous and subsequent incarnations.  I'm not sure whether Vadim
  intended our StartUpID to serve this purpose, but it could perhaps be
  used that way, if we reflected it in the WAL file names.
  
 
 Some more thoughts...focusing on the what do we do after we've finished
 recovering. The objectives, as I see them, are to put the system into a
 state, that preserves these features:
 1. we never overwrite files, in case we want to re-run recovery
 2. we never write files that MIGHT have been written previously
 3. we need to ensure that any xlog records skipped at admins request (in
 PITR mode) are never in a position to be re-applied to this timeline.
 4. ensure we can re-recover, if we need to, without further problems
 
 Tom's concept above, I'm going to call timelines. A timeline is the
 sequence of logs created by the execution of a server. If you recover
 the database, you create a new timeline. [This is because, if you've
 invoked PITR you absolutely definitely want log records written to, say,
 xlog15 to be different to those that were written to xlog15 in a
 previous timeline that you have chosen not to reapply.]
 
 Objective (1) is complex.
 When we are restoring, we always start with archived copies of the xlog,
 to make sure we don't finish too soon. We roll forward until we either
 reach PITR stop point, or we hit end of archived logs. If we hit end of
 logs on archive, then we switch to a local copy, if one exists that is
 higher than those, we carry on rolling forward until either we reach
 PITR stop point, or we hit end of that log. (Hopefully, there isn't more
 than one local xlog higher than the archive, but its possible). 
 If we are rolling forward on local copies, then they are our only
 copies. We'd really like to archive them ASAP, but the archiver's not
 running yet - we don't want to force that situation in case the archive
 device (say a tape) is the one being used to recover right now. So we
 write an archive_status of .ready for that file, ensuring that the
 checkpoint won't remove it until it gets copied to archive, whenever
 that starts working again. Objective (1) met.
 
 When we have finished recovering we:
 - create a new xlog at the start of a new ++timeline
 - copy the last applied xlog record to it as the first record
 - set the record pointer so that it matches
 That way, when we come up and begin running, we never overwrite files
 that might have been written previously. Objective (2) met.
 We do the other stuff because recovery finishes up by pointing to the
 last applied record...which is what was causing all of this extra work
 in the first place.
 
 At this point, we also reset the secondary checkpoint record, so that
 should recovery be required again before next checkpoint AND the
 shutdown checkpoint record written after recovery completes is
 wrong/damaged, the recovery will not autorewind back past the PITR stop
 point and attempt to recover the records we have just tried so hard to
 reverse/ignore. Objective (3) met. (Clearly, that situation seems
 unlikely, but I feel we must deal with it...a newly restored system is
 actually very fragile, so a crash again within 3 minutes or so is very
 commonplace, as far as these things go).
 
 Should we need to re-recover, we can do so because the new timeline
 xlogs are further forward than the old timeline, so never get seen by
 any processes (all of which look backwards). Re-recovery is possible
 without problems, if required. This means you're a lot safer from some
 of the mistakes you might of made, such as deciding you need to go into
 recovery, then realising it wasn't required (or some other

Re: [HACKERS] Point in Time Recovery

2004-07-13 Thread Zeugswetter Andreas SB SD


 The starting a new timeline thought works for xlogs, but not for clogs.
 No matter how far you go into the future, there is a small (yet
 vanishing) possibility that there is a yet undiscovered committed
 transaction in the future. (Because transactions are ordered in the clog
 because xids are assigned sequentially at txn start, but not ordered in
 the xlog where they are recorded in the order the txns complete).

Won't ExtendCLOG take care of this with ZeroCLOGPage ? Else the same problem
would arise at xid wraparound, no ?

Andreas

---(end of broadcast)---
TIP 7: don't forget to increase your free space map settings

Re: [HACKERS] Point in Time Recovery

2004-07-13 Thread Simon Riggs

On Tue, 2004-07-13 at 13:18, Zeugswetter Andreas SB SD wrote:
  The starting a new timeline thought works for xlogs, but not for clogs.
  No matter how far you go into the future, there is a small (yet
  vanishing) possibility that there is a yet undiscovered committed
  transaction in the future. (Because transactions are ordered in the clog
  because xids are assigned sequentially at txn start, but not ordered in
  the xlog where they are recorded in the order the txns complete).
 
 Won't ExtendCLOG take care of this with ZeroCLOGPage ? Else the same problem
 would arise at xid wraparound, no ?
 

Sounds like a good start...

When PITR ends, we need to stop mid-way through a file. Does that handle
that situation?

Simon Riggs


---(end of broadcast)---
TIP 6: Have you searched our list archives?

   http://archives.postgresql.org

Re: [HACKERS] Point in Time Recovery

2004-07-13 Thread Tom Lane

Simon Riggs [EMAIL PROTECTED] writes:
 Please tell me that we can ignore the state of the clog,

We can.

The reason that keeping track of timelines is interesting for xlog is
simply to take pity on the poor DBA who needs to distinguish the various
archived xlog files he's got laying about, and so that we can detect
errors like supplying inconsistent sets of xlog segments during restore.

This does not apply to clog because it's not archived.  It's no more
than a data file.  If you think you have trouble recreating clog then
you have the same issues recreating data files.

regards, tom lane

---(end of broadcast)---
TIP 4: Don't 'kill -9' the postmaster

Re: [HACKERS] Point in Time Recovery

2004-07-13 Thread Simon Riggs

On Tue, 2004-07-13 at 15:29, Tom Lane wrote:
 Simon Riggs [EMAIL PROTECTED] writes:
  Please tell me that we can ignore the state of the clog,
 
 We can.
 

In general, you are of course correct.

 The reason that keeping track of timelines is interesting for xlog is
 simply to take pity on the poor DBA who needs to distinguish the various
 archived xlog files he's got laying about, and so that we can detect
 errors like supplying inconsistent sets of xlog segments during restore.
 
 This does not apply to clog because it's not archived.  It's no more
 than a data file.  If you think you have trouble recreating clog then
 you have the same issues recreating data files.

I'm getting carried away with the improbablebut this is the rather
strange, but possible scenario I foresee:

A sequence of times...
1. We start archiving xlogs
2. We take a checkpoint
3. we commit an important transaction
4. We take a backup
5. We take a checkpoint

As stands currently, when we restore the backup, controlfile says that
last checkpoint was at 2, so we rollforward from 2 THRU 4 and continue
on past 5 until end of logs. Normally, end of logs isn't until after
4...

When we specify a recovery target, it is possible to specify the
rollforward to complete just before point 3. So we use the backup taken
at 4 to rollforward to a point in the past (from the backups
perspective). The backup taken at 4 may now have data and clog records
written by bgwriter.

Given that time between checkpoints is likely to be longer than
previously was the case...this becomes a non-zero situation.

I was trying to solve this problem head on, but the best way is to make
sure we never allow ourselves such a muddled situation:

ISTM the way to avoid this is to insist that we always rollforward
through at least one checkpoint to guarantee that this will not occur. 

...then I can forget I ever mentioned the ** clog again.

I'm ignoring this issue for nowwhether it exists or not!

Best Regards, Simon Riggs


---(end of broadcast)---
TIP 2: you can get off all lists at once with the unregister command
(send unregister YourEmailAddressHere to [EMAIL PROTECTED])

Re: [HACKERS] Point in Time Recovery

2004-07-13 Thread Tom Lane

Simon Riggs [EMAIL PROTECTED] writes:
 I'm getting carried away with the improbablebut this is the rather
 strange, but possible scenario I foresee:

 A sequence of times...
 1. We start archiving xlogs
 2. We take a checkpoint
 3. we commit an important transaction
 4. We take a backup
 5. We take a checkpoint

 When we specify a recovery target, it is possible to specify the
 rollforward to complete just before point 3.

No, it isn't possible.  The recovery *must* proceed at least as far as
wherever the end of the log was at the time the backup was completed.
Otherwise everything is broken, not only clog, because you may have disk
blocks in your backup that postdate where you stopped log replay.

To have a consistent recovery at all, you must replay the log starting
from a checkpoint before the backup began and extending to the time that
the backup finished.  You only get to decide where to stop after that
point.

regards, tom lane

---(end of broadcast)---
TIP 5: Have you checked our extensive FAQ?

   http://www.postgresql.org/docs/faqs/FAQ.html

Re: [HACKERS] Point in Time Recovery

2004-07-13 Thread Simon Riggs

On Tue, 2004-07-13 at 22:19, Tom Lane wrote:

 To have a consistent recovery at all, you must replay the log starting
 from a checkpoint before the backup began and extending to the time that
 the backup finished.  You only get to decide where to stop after that
 point.
 

So the situation is: 
- You must only stop recovery at a point in time (in the logs) after the
backup had completed.

No way to enforce that currently, apart from procedurally. Not exactly
frequent, so I think I just document that and move on, eh?

Thanks for your help,

Best regards, Simon Riggs


---(end of broadcast)---
TIP 3: if posting/reading through Usenet, please send an appropriate
  subscribe-nomail command to [EMAIL PROTECTED] so that your
  message can get through to the mailing list cleanly

Re: [HACKERS] Point in Time Recovery

2004-07-13 Thread Tom Lane

Simon Riggs [EMAIL PROTECTED] writes:
 So the situation is: 
 - You must only stop recovery at a point in time (in the logs) after the
 backup had completed.

Right.

 No way to enforce that currently, apart from procedurally. Not exactly
 frequent, so I think I just document that and move on, eh?

The procedure that generates a backup has got to be responsible for
recording both the start and stop times.  If it does not do so then
it's fatally flawed.  (Note also that you had better be careful to get
the time as seen on the server machine's clock ... this could be a nasty
gotcha if the backup is run on a different machine, such as an NFS
server.)

regards, tom lane

---(end of broadcast)---
TIP 7: don't forget to increase your free space map settings

Re: [HACKERS] Point in Time Recovery

2004-07-13 Thread Simon Riggs

On Tue, 2004-07-13 at 23:42, Bruce Momjian wrote:
 Simon Riggs wrote:
  On Tue, 2004-07-13 at 22:19, Tom Lane wrote:
  
   To have a consistent recovery at all, you must replay the log starting
   from a checkpoint before the backup began and extending to the time that
   the backup finished.  You only get to decide where to stop after that
   point.
   
  
  So the situation is: 
  - You must only stop recovery at a point in time (in the logs) after the
  backup had completed.
  
  No way to enforce that currently, apart from procedurally. Not exactly
  frequent, so I think I just document that and move on, eh?
 
 If it happens, could you use your previous full backup and the PITR logs
 from before stop stopped logging, and then after?  

Yes.

 Is there a period
 where they could not restore reliably?

Good question. No is the answer. 

The situation is that the backup isn't timestamped with respect to the
logs, so its possible to attempt to use the wrong backup for recovery.

The solution is procedural - make sure you timestamp your backup files,
so you know which ones to recover with...

Best Regards, Simon Riggs


---(end of broadcast)---
TIP 6: Have you searched our list archives?

   http://archives.postgresql.org

Re: [HACKERS] Point in Time Recovery

2004-07-13 Thread Bruce Momjian

Tom Lane wrote:
 Simon Riggs [EMAIL PROTECTED] writes:
  So the situation is: 
  - You must only stop recovery at a point in time (in the logs) after the
  backup had completed.
 
 Right.
 
  No way to enforce that currently, apart from procedurally. Not exactly
  frequent, so I think I just document that and move on, eh?
 
 The procedure that generates a backup has got to be responsible for
 recording both the start and stop times.  If it does not do so then
 it's fatally flawed.  (Note also that you had better be careful to get
 the time as seen on the server machine's clock ... this could be a nasty
 gotcha if the backup is run on a different machine, such as an NFS
 server.)

OK, but procedurally, how do you correlate the start/stop time of the
tar backup with the WAL numeric file names?

-- 
  Bruce Momjian|  http://candle.pha.pa.us
  [EMAIL PROTECTED]   |  (610) 359-1001
  +  If your life is a hard drive, |  13 Roberts Road
  +  Christ can be your backup.|  Newtown Square, Pennsylvania 19073

---(end of broadcast)---
TIP 7: don't forget to increase your free space map settings

Re: [HACKERS] Point in Time Recovery

2004-07-13 Thread Tom Lane

Bruce Momjian [EMAIL PROTECTED] writes:
 OK, but procedurally, how do you correlate the start/stop time of the
 tar backup with the WAL numeric file names?

Ideally the procedure for making a backup would go something like:

1. Inquire of the server its current time and the WAL position of the
most recent checkpoint record (which is what you really need).

2. Make the backup.

3. Inquire of the server its current time and the current end-of-WAL
position.

4. Record items 1 and 3 along with the backup itself.

I think the current theory was you could fake #1 by copying pg_control
before everything else, but this doesn't really help for step #3, so
it would probably be better to add some server functions to get this
info.

regards, tom lane

---(end of broadcast)---
TIP 4: Don't 'kill -9' the postmaster

Re: [HACKERS] Point in Time Recovery

2004-07-13 Thread Simon Riggs

On Wed, 2004-07-14 at 00:28, Tom Lane wrote:
 Bruce Momjian [EMAIL PROTECTED] writes:
  OK, but procedurally, how do you correlate the start/stop time of the
  tar backup with the WAL numeric file names?
 
 Ideally the procedure for making a backup would go something like:
 
 1. Inquire of the server its current time and the WAL position of the
 most recent checkpoint record (which is what you really need).
 
 2. Make the backup.
 
 3. Inquire of the server its current time and the current end-of-WAL
 position.
 
 4. Record items 1 and 3 along with the backup itself.
 
 I think the current theory was you could fake #1 by copying pg_control
 before everything else, but this doesn't really help for step #3, so
 it would probably be better to add some server functions to get this
 info.
 

err...I think at this point we should review the PITR patch

The recovery mechanism doesn't rely upon you knowing 1 or 3. The
recovery reads pg_control (from the backup) and then attempts to
de-archive the appropriate xlog segment file and then starts rollforward
from there. Effectively, restore assumes it has access to an infinite
timeline of logswhich clearly isn't the case, but its up to *you* to
check that you have the logs that go with the backups. (Or put another
way, if this sounds hard, buy some software that administers the
procedure for you). That's the mechanism that allows infinite
recovery.

In brief, the code path is as identical as possible to the current crash
recovery situation...archive recovery restores the files from archive
when they are needed, just as if they had always been in pg_xlog, in a
way that ensures pg_xlog never runs out of space.

Recovery ends when: it reaches the recovery target you specified, or it
runs out of xlogs (first it runs out of archived xlogs, then tries to
find a more recent local copy if there is one).

I think the current theory was you could fake #1 by copying pg_control
 before everything else, but this doesn't really help for step #3, so
 it would probably be better to add some server functions to get this
 info.

Not sure what you mean by fake

Best Regards, Simon Riggs


---(end of broadcast)---
TIP 9: the planner will ignore your desire to choose an index scan if your
  joining column's datatypes do not match

Re: [HACKERS] Point in Time Recovery

2004-07-13 Thread Simon Riggs

PITR Patch v5_1 just posted has Point in Time Recovery working

Still some rough edgesbut we really need some testers now to give
this a try and let me know what you think.

Klaus Naumann and Mark Wong are the only [non-committers] to have tried
to run the code (and let me know about it), so please have a look at
[PATCHES] and try it out.

Many thanks,

Simon Riggs


---(end of broadcast)---
TIP 8: explain analyze is your friend

Re: [HACKERS] Point in Time Recovery

2004-07-13 Thread Bruce Momjian

Simon Riggs wrote:
 On Tue, 2004-07-13 at 22:19, Tom Lane wrote:
 
  To have a consistent recovery at all, you must replay the log starting
  from a checkpoint before the backup began and extending to the time that
  the backup finished.  You only get to decide where to stop after that
  point.
  
 
 So the situation is: 
 - You must only stop recovery at a point in time (in the logs) after the
 backup had completed.
 
 No way to enforce that currently, apart from procedurally. Not exactly
 frequent, so I think I just document that and move on, eh?

If it happens, could you use your previous full backup and the PITR logs
from before stop stopped logging, and then after?  Is there a period
where they could not restore reliably?

-- 
  Bruce Momjian|  http://candle.pha.pa.us
  [EMAIL PROTECTED]   |  (610) 359-1001
  +  If your life is a hard drive, |  13 Roberts Road
  +  Christ can be your backup.|  Newtown Square, Pennsylvania 19073

---(end of broadcast)---
TIP 7: don't forget to increase your free space map settings

Re: [HACKERS] Point in Time Recovery

2004-07-13 Thread Simon Riggs

On Wed, 2004-07-14 at 00:01, Bruce Momjian wrote:
 Tom Lane wrote:
  Simon Riggs [EMAIL PROTECTED] writes:
   So the situation is: 
   - You must only stop recovery at a point in time (in the logs) after the
   backup had completed.
  
  Right.
  
   No way to enforce that currently, apart from procedurally. Not exactly
   frequent, so I think I just document that and move on, eh?
  
  The procedure that generates a backup has got to be responsible for
  recording both the start and stop times.  If it does not do so then
  it's fatally flawed.  (Note also that you had better be careful to get
  the time as seen on the server machine's clock ... this could be a nasty
  gotcha if the backup is run on a different machine, such as an NFS
  server.)
 
 OK, but procedurally, how do you correlate the start/stop time of the
 tar backup with the WAL numeric file names?

No need. You just correlate the recovery target with the backup file
times. Mostly, you'll only ever use your last backup and won't need to
fuss with the times.

Backup should begin with a CHECKPOINT...then wait for that to complete,
just to make the backup as current as possible.

If you want to start purging your archives of old archived xlogs, you
can use the filedate (assuming you preserved that on your copy to
archive - but even if not, they'll be fairly close).

Best regards, Simon Riggs


---(end of broadcast)---
TIP 5: Have you checked our extensive FAQ?

   http://www.postgresql.org/docs/faqs/FAQ.html

Re: [HACKERS] Point in Time Recovery

2004-07-13 Thread Christopher Kings-Lynne

Can you give us some suggestions of what kind of stuff to test?  Is 
there a way we can artificially kill the backend in all sorts of nasty 
spots to see if recovery works?  Does kill -9 simulate a 'power off'?

Chris
Simon Riggs wrote:
PITR Patch v5_1 just posted has Point in Time Recovery working
Still some rough edgesbut we really need some testers now to give
this a try and let me know what you think.
Klaus Naumann and Mark Wong are the only [non-committers] to have tried
to run the code (and let me know about it), so please have a look at
[PATCHES] and try it out.
Many thanks,
Simon Riggs
---(end of broadcast)---
TIP 8: explain analyze is your friend
---(end of broadcast)---
TIP 8: explain analyze is your friend

Re: [HACKERS] Point in Time Recovery

2004-07-10 Thread Jan Wieck

On 7/6/2004 3:58 PM, Simon Riggs wrote:
On Tue, 2004-07-06 at 08:38, Zeugswetter Andreas SB SD wrote:
  - by time - but the time stamp on each xlog record only specifies to the
 second, which could easily be 10 or more commits (we hope)
 
 Should we use a different datatype than time_t for the commit timestamp,
 one that offers more fine grained differentiation between checkpoints?

Imho seconds is really sufficient. If you know a more precise position
you will probably know it from backend log or an xlog sniffer. With those
you can easily use the TransactionId way.
TransactionID and timestamp is only sufficient if the transactions are 
selected by their commit order. Especially in read committed mode, 
consider this execution:

xid-1: start
xid-2: start
xid-2: update field x
xid-2: commit
xid-1: update field y
xid-1: commit
In this case, the update done by xid-1 depends on the row created by 
xid-2. So logically xid-2 precedes xid-1, because it made its changes 
earlier.

So you have to apply the log until you find the commit record of the 
transaction you want apply last, and then stamp all transactions that 
where in progress at that time as aborted.

Jan

OK, thanks. I'll just leave the time_t datatype just the way it is.
Best Regards, Simon Riggs
---(end of broadcast)---
TIP 3: if posting/reading through Usenet, please send an appropriate
  subscribe-nomail command to [EMAIL PROTECTED] so that your
  message can get through to the mailing list cleanly

--
#==#
# It's easier to get forgiveness for being wrong than for being right. #
# Let's break this rule - forgive me.  #
#== [EMAIL PROTECTED] #
---(end of broadcast)---
TIP 9: the planner will ignore your desire to choose an index scan if your
 joining column's datatypes do not match

Re: [HACKERS] Point in Time Recovery

2004-07-10 Thread Simon Riggs

On Sat, 2004-07-10 at 15:17, Jan Wieck wrote:
 On 7/6/2004 3:58 PM, Simon Riggs wrote:
 
  On Tue, 2004-07-06 at 08:38, Zeugswetter Andreas SB SD wrote:
- by time - but the time stamp on each xlog record only specifies to the
   second, which could easily be 10 or more commits (we hope)
   
   Should we use a different datatype than time_t for the commit timestamp,
   one that offers more fine grained differentiation between checkpoints?
  
  Imho seconds is really sufficient. If you know a more precise position
  you will probably know it from backend log or an xlog sniffer. With those
  you can easily use the TransactionId way.
 
 TransactionID and timestamp is only sufficient if the transactions are 
 selected by their commit order. Especially in read committed mode, 
 consider this execution:
 
  xid-1: start
  xid-2: start
  xid-2: update field x
  xid-2: commit
  xid-1: update field y
  xid-1: commit
 
 In this case, the update done by xid-1 depends on the row created by 
 xid-2. So logically xid-2 precedes xid-1, because it made its changes 
 earlier.
 
 So you have to apply the log until you find the commit record of the 
 transaction you want apply last, and then stamp all transactions that 
 where in progress at that time as aborted.
 

Agreed.

I've implemented this exactly as you say

This turns out to be very easy because:
- when looking where to stop we only ever stop at commit or aborts -
these are the only records that have timestamps anyway...
- any record that isn't specifically committed is not updated in the
clog and therefore not visible. The clog starts in indeterminate state,
0 and is then updated to either committed or aborted. Aborted and
indeterminate are handled similarly in the current code, to allow for
crash recovery - PITR doesn't change anything there.
So, PITR doesn't do anything that crash recovery doen't already do.
Crash recovery makes no attempt to keep track of in-progress
transactions and doesn't make a special journey to the clog to
specifically mark them as aborted - they just are by default.

So, what we mean by stop at a transactionId is stop applying redo at
the commit/abort record for that transactionId. It has to be an exact
match, not a greater than, for exactly the reason you mention. That
means that although we stop at the commit record of transactionId X, we
may also have applied records for transactions with later transactionIds
e.g. X+1, X+2...etc (without limit or restriction).

(I'll even admit that as first, I did think we could get away with the
less than test that you are warning me about. Overall, I've spent more
time on theory/analysis than on coding, on the idea that you can improve
poor code, but wrong code just needs to be thrown away).

Timestamps are more vague...When time is used, there might easily be 10+
transactions whose commit/abort records have identical timestamp values.
So we either stop at the first or last record depending upon whether we
specified inclusive or exclusive on the recovery target value.

The hard bit, IMHO, is what we do with the part of the log that we have
chosen not to applywhich has been discussed on list in detail also.

Thanks for keeping an eye out for possible errors - this one is
something I'd thought through and catered for (there are comments in my
current latest published code to that effect, although I have not yet
finished coding the clean-up-after-stopping part). 

This implies nothing with regard to other possible errors or oversights
and so I very much welcome any questioning of this nature - I am as
prone to error as the next man. It's important we get this right.

Best regards, Simon Riggs


---(end of broadcast)---
TIP 1: subscribe and unsubscribe commands go to [EMAIL PROTECTED]

Re: [HACKERS] Point in Time Recovery

2004-07-09 Thread spock

On Tue, 6 Jul 2004, Zeugswetter Andreas SB SD wrote:

  Should we use a different datatype than time_t for the commit timestamp,
  one that offers more fine grained differentiation between checkpoints?

 Imho seconds is really sufficient. If you know a more precise position
 you will probably know it from backend log or an xlog sniffer. With those
 you can easily use the TransactionId way.

I'd also think that seconds are absolutely sufficient. From my daily
experience I can say that you're normally lucky to know the time
on one minute basis.
If you need to get closer, a log sniffer is unavoidable ...

Greetings, Klaus

-- 
Full Name   : Klaus Naumann | (http://www.mgnet.de/) (Germany)
Phone / FAX : ++49/177/7862964  | E-Mail: ([EMAIL PROTECTED])

---(end of broadcast)---
TIP 9: the planner will ignore your desire to choose an index scan if your
  joining column's datatypes do not match

Re: [HACKERS] Point in Time Recovery

2004-07-09 Thread spock

On Thu, 8 Jul 2004, Simon Riggs wrote:

 We don't need to mention timelines in the docs, nor do we need to alter
 pg_controldata to display it...just a comment in the code to explain why
 we add a large number to the LogId after each recovery completes.

I'd disagree on that. Knowing what exactly happens when recovering the
database is a must. It makes people feel more safe about the process. This
is because the software doesn't do things you wouldn't expect.

On Oracle e.g. you create a new database incarnation when you recover a
database (incomplete). Which means your System Change Number and your Log
Sequence is reset to 0.
Knowing this is crucial because you otherwise would wonder Why is it
doing that? Is this a bug or a feature?

Just my 2ct :-))

Greetings, Klaus

-- 
Full Name   : Klaus Naumann | (http://www.mgnet.de/) (Germany)
Phone / FAX : ++49/177/7862964  | E-Mail: ([EMAIL PROTECTED])

---(end of broadcast)---
TIP 8: explain analyze is your friend

Re: [HACKERS] Point in Time Recovery

2004-07-08 Thread Simon Riggs

On Thu, 2004-07-08 at 07:57, [EMAIL PROTECTED] wrote:
 On Thu, 8 Jul 2004, Simon Riggs wrote:
 
  We don't need to mention timelines in the docs, nor do we need to alter
  pg_controldata to display it...just a comment in the code to explain why
  we add a large number to the LogId after each recovery completes.
 
 I'd disagree on that. Knowing what exactly happens when recovering the
 database is a must. It makes people feel more safe about the process. This
 is because the software doesn't do things you wouldn't expect.
 
 On Oracle e.g. you create a new database incarnation when you recover a
 database (incomplete). Which means your System Change Number and your Log
 Sequence is reset to 0.
 Knowing this is crucial because you otherwise would wonder Why is it
 doing that? Is this a bug or a feature?
 

OK, will do.

Best regards, Simon Riggs


---(end of broadcast)---
TIP 6: Have you searched our list archives?

   http://archives.postgresql.org

Re: [HACKERS] Point in Time Recovery

2004-07-07 Thread Zeugswetter Andreas SB SD


 Well, Tom does seem to have something with regard to StartUpIds. I feel
 it is easier to force a new timeline by adding a very large number to
 the LogId IF, and only if, we have performed an archive recovery. That
 way, we do not change at all the behaviour of the system for people that
 choose not to implement archive_mode.

Imho you should take a close look at StartUpId, I think it is exactly this 
large number. Maybe you can add +2 to intentionally leave a hole.

Once you increment, I think it is very essential to checkpoint and double 
check pg_control, cause otherwise a crashrecovery would read the wrong xlogs.
 
 Should we implement timelines?

Yes :-)

Andreas

---(end of broadcast)---
TIP 8: explain analyze is your friend

Re: [HACKERS] Point in Time Recovery

2004-07-07 Thread Simon Riggs

On Wed, 2004-07-07 at 14:17, Zeugswetter Andreas SB SD wrote:
  Well, Tom does seem to have something with regard to StartUpIds. I feel
  it is easier to force a new timeline by adding a very large number to
  the LogId IF, and only if, we have performed an archive recovery. That
  way, we do not change at all the behaviour of the system for people that
  choose not to implement archive_mode.
 
 Imho you should take a close look at StartUpId, I think it is exactly this 
 large number. Maybe you can add +2 to intentionally leave a hole.
 
 Once you increment, I think it is very essential to checkpoint and double 
 check pg_control, cause otherwise a crashrecovery would read the wrong xlogs.

Thanks for your thoughts - you have made me rethink this over some
hours. Trouble is, on this occasion, the other suggestion still seems
the best one, IMVHO.

If we number timelines based upon StartUpId, then there is still some
potential for conflict and this is what we're seeking to avoid.

Simply adding  to the LogId puts the new timeline so far into the
previous timelines future that there isn't any problems. We only
increment the timeline when we recover, so we're not eating up the
numbers quickly. Simply adding a number means that there isn't any
conflict with any normal operations. The timelines aren't numbered
separately, so I'm not introducing a new version of
StartUpID...technically there isn't a new timeline, just a chunk of
geological time between them.

We don't need to mention timelines in the docs, nor do we need to alter
pg_controldata to display it...just a comment in the code to explain why
we add a large number to the LogId after each recovery completes.

Best regards, Simon Riggs



---(end of broadcast)---
TIP 6: Have you searched our list archives?

   http://archives.postgresql.org

Re: [HACKERS] Point in Time Recovery

2004-07-06 Thread Zeugswetter Andreas SB SD

 
 - by time - but the time stamp on each xlog record only specifies to the
 second, which could easily be 10 or more commits (we hope)
 
 Should we use a different datatype than time_t for the commit timestamp,
 one that offers more fine grained differentiation between checkpoints?

Imho seconds is really sufficient. If you know a more precise position
you will probably know it from backend log or an xlog sniffer. With those
you can easily use the TransactionId way.

 - when we stop, keep reading records until EOF, just don't apply them.
 When we write a checkpoint at end of recovery, the unapplied
 transactions are buried alive, never to return.
 - stop where we stop, then force zeros to EOF, so that no possible
 record remains of previous transactions.
 I'm tempted by the first plan, because it is more straightforward and
 stands much less chance of me introducing 50 wierd bugs just before
 close.

But what if you restore because after that PIT everything went haywire
including the log ? Did you mean to apply the remaining changes but not 
commit those xids ? I think it is essential to only leave xlogs around 
that allow a subsequent rollforward from the same old full backup. 
Or is an instant new full backup required after restore ?

Andreas

---(end of broadcast)---
TIP 3: if posting/reading through Usenet, please send an appropriate
  subscribe-nomail command to [EMAIL PROTECTED] so that your
  message can get through to the mailing list cleanly

Re: [HACKERS] Point in Time Recovery

2004-07-06 Thread Simon Riggs

On Tue, 2004-07-06 at 08:38, Zeugswetter Andreas SB SD wrote:
   - by time - but the time stamp on each xlog record only specifies to the
  second, which could easily be 10 or more commits (we hope)
  
  Should we use a different datatype than time_t for the commit timestamp,
  one that offers more fine grained differentiation between checkpoints?
 
 Imho seconds is really sufficient. If you know a more precise position
 you will probably know it from backend log or an xlog sniffer. With those
 you can easily use the TransactionId way.
 

OK, thanks. I'll just leave the time_t datatype just the way it is.

Best Regards, Simon Riggs


---(end of broadcast)---
TIP 3: if posting/reading through Usenet, please send an appropriate
  subscribe-nomail command to [EMAIL PROTECTED] so that your
  message can get through to the mailing list cleanly

Re: [HACKERS] Point in Time Recovery

2004-07-06 Thread Simon Riggs

On Mon, 2004-07-05 at 22:46, Tom Lane wrote:
 Simon Riggs [EMAIL PROTECTED] writes:

  - when we stop, keep reading records until EOF, just don't apply them.
  When we write a checkpoint at end of recovery, the unapplied
  transactions are buried alive, never to return.
  - stop where we stop, then force zeros to EOF, so that no possible
  record remains of previous transactions.
 
 Go with plan B; it's best not to destroy data (what if you chose the
 wrong restart point the first time)?
 
 Actually this now reminds me of a discussion I had with Patrick
 Macdonald some time ago.  The DB2 practice in this connection is that
 you *never* overwrite existing logfile data when recovering.  Instead
 you start a brand new xlog segment file, which is given a new branch
 number so it can be distinguished from the future-time xlog segments
 that you chose not to apply.  I don't recall what the DB2 terminology
 was exactly --- not branch number I don't think --- but anyway the
 idea is that when you restart the database after an incomplete recovery,
 you are now in a sort of parallel universe that has its own history
 after the branch point (PITR stop point).  You need to be able to
 distinguish archived log segments of this parallel universe from those
 of previous and subsequent incarnations.  I'm not sure whether Vadim
 intended our StartUpID to serve this purpose, but it could perhaps be
 used that way, if we reflected it in the WAL file names.
 

Some more thoughts...focusing on the what do we do after we've finished
recovering. The objectives, as I see them, are to put the system into a
state, that preserves these features:
1. we never overwrite files, in case we want to re-run recovery
2. we never write files that MIGHT have been written previously
3. we need to ensure that any xlog records skipped at admins request (in
PITR mode) are never in a position to be re-applied to this timeline.
4. ensure we can re-recover, if we need to, without further problems

Tom's concept above, I'm going to call timelines. A timeline is the
sequence of logs created by the execution of a server. If you recover
the database, you create a new timeline. [This is because, if you've
invoked PITR you absolutely definitely want log records written to, say,
xlog15 to be different to those that were written to xlog15 in a
previous timeline that you have chosen not to reapply.]

Objective (1) is complex.
When we are restoring, we always start with archived copies of the xlog,
to make sure we don't finish too soon. We roll forward until we either
reach PITR stop point, or we hit end of archived logs. If we hit end of
logs on archive, then we switch to a local copy, if one exists that is
higher than those, we carry on rolling forward until either we reach
PITR stop point, or we hit end of that log. (Hopefully, there isn't more
than one local xlog higher than the archive, but its possible). 
If we are rolling forward on local copies, then they are our only
copies. We'd really like to archive them ASAP, but the archiver's not
running yet - we don't want to force that situation in case the archive
device (say a tape) is the one being used to recover right now. So we
write an archive_status of .ready for that file, ensuring that the
checkpoint won't remove it until it gets copied to archive, whenever
that starts working again. Objective (1) met.

When we have finished recovering we:
- create a new xlog at the start of a new ++timeline
- copy the last applied xlog record to it as the first record
- set the record pointer so that it matches
That way, when we come up and begin running, we never overwrite files
that might have been written previously. Objective (2) met.
We do the other stuff because recovery finishes up by pointing to the
last applied record...which is what was causing all of this extra work
in the first place.

At this point, we also reset the secondary checkpoint record, so that
should recovery be required again before next checkpoint AND the
shutdown checkpoint record written after recovery completes is
wrong/damaged, the recovery will not autorewind back past the PITR stop
point and attempt to recover the records we have just tried so hard to
reverse/ignore. Objective (3) met. (Clearly, that situation seems
unlikely, but I feel we must deal with it...a newly restored system is
actually very fragile, so a crash again within 3 minutes or so is very
commonplace, as far as these things go).

Should we need to re-recover, we can do so because the new timeline
xlogs are further forward than the old timeline, so never get seen by
any processes (all of which look backwards). Re-recovery is possible
without problems, if required. This means you're a lot safer from some
of the mistakes you might of made, such as deciding you need to go into
recovery, then realising it wasn't required (or some other painful
flapping as goes on in computer rooms at 3am).

How do we implement timelines?
The main presumption in the code is that xlogs are

Re: [HACKERS] Point in Time Recovery

2004-07-06 Thread Simon Riggs

On Tue, 2004-07-06 at 20:00, Richard Huxton wrote:
 Simon Riggs wrote:
  On Mon, 2004-07-05 at 22:46, Tom Lane wrote:
  
 Simon Riggs [EMAIL PROTECTED] writes:
 
 Should we use a different datatype than time_t for the commit timestamp,
 one that offers more fine grained differentiation between checkpoints?
 
 Pretty much everybody supports gettimeofday() (time_t and separate
 integer microseconds); you might as well use that.  Note that the actual
 resolution is not necessarily microseconds, and it'd still not be
 certain that successive commits have distinct timestamps --- so maybe
 this refinement would be pointless.  You'll still have to design a user
 interface that allows selection without the assumption of distinct
 timestamps.
  
  
  Well, I agree, though without the desired-for UI now, I think some finer
  grained mechanism would be good. This means extending the xlog commit
  record by a couple of bytes...OK, lets live a little.
 
 At the risk of irritating people, I'll repeat what I suggested a few 
 weeks ago...
 

All feedback is good. Thanks.

 Add a table: pg_pitr_checkpt (pitr_id SERIAL, pitr_ts timestamptz, 
 pitr_comment text)
 Let the user insert rows in transactions as desired. Let them stop the 
 restore when a specific (pitr_ts,pitr_comment) gets inserted (or on 
 pitr_id if they record it).
 

It's a good plan, but the recovery is currently offline recovery and no
SQL is possible. So no way to insert, no way to access tables until
recovery completes. I like that plan and probably would have used it if
it was viable.

 IMHO time is seldom relevant, event boundaries are.
 

Agreed, but time is the universally agreed way of describing two events
as being simultaneous. No other way to say recover to the point when
the message queue went wild.

As of last post to Andreas, I've said I'll not bother changing the
granularity of the timestamp.

 If you want to add special syntax for this, fine. If not, an INSERT 
 statement is a convenient way to do this anyway.

The special syntax isn't hugely important - I did suggest a kind of
SQL-like syntax previously, but thats gone now. Invoking recovery via a
command file IS, so we are able to tell the system its not in crash
recovery AND that when you've finished I want you to respond to crashes
without re-entering archive recovery.

Thanks for your comments. I'm not making this more complex than needs
be; in fact much of the code is very simple - its just the planning
that's complex.

Best regards, Simon Riggs


---(end of broadcast)---
TIP 8: explain analyze is your friend

[HACKERS] Point in Time Recovery

2004-07-05 Thread Simon Riggs

Taking advantage of the freeze bubble allowed us... there are some last
minute features to add. 

Summarising earlier thoughts, with some detailed digging and design from
myself in last few days - we're now in a position to add Point-in-Time
Recovery, on top of whats been achieved.

The target for the last record to recover to can be specified in 2 ways:
- by transactionId - not that useful, unless you have a means of
identifying what has happened from the log, then using that info to
specify how to recover - coming later - not in next few days :( 
- by time - but the time stamp on each xlog record only specifies to the
second, which could easily be 10 or more commits (we hope)

Should we use a different datatype than time_t for the commit timestamp,
one that offers more fine grained differentiation between checkpoints?
If we did, would that be portable?
Suggestions welcome, because I know very little of the details of
various *nix systems and win* on that topic.

Only COMMIT and ABORT records have timestamps, allowing us to circumvent
any discussion about partial transaction recovery and nested
transactions.

When we do recover, stopping at the timestamp is just half the battle.
We need to leave the xlog in which we stop in a state from which we can
enter production smoothly and cleanly. To do this, we could:
- when we stop, keep reading records until EOF, just don't apply them.
When we write a checkpoint at end of recovery, the unapplied
transactions are buried alive, never to return.
- stop where we stop, then force zeros to EOF, so that no possible
record remains of previous transactions.
I'm tempted by the first plan, because it is more straightforward and
stands much less chance of me introducing 50 wierd bugs just before
close.

Also, I think it is straightforward to introduce control file duplexing,
with a second copy stored and maintained in the pg_xlog directory. This
would provide additional protection for pg_control, which takes on more
importance now that archive recovery is working. pg_xlog is a natural
home, since on busy systems it's on its own disk away from everything
else, ensuring that at least one copy survives. I can't see a downside
to that, but others might... We can introduce user specifiable
duplexing, in later releases.

For later, I envisage an off-line utility that can be used to inspect
xlog records. This could provide a number of features:
- validate archived xlogs, to check they are sound.
- produce summary reports, to allow identification of transactionIds and
the effects of particular transactions
- performance analysis to allow decisions to be made about whether group
commit features could be utilised to good effect
(Not now...)

Best regards, Simon Riggs



---(end of broadcast)---
TIP 6: Have you searched our list archives?

   http://archives.postgresql.org

1 2 >

1 - 100 of 101 matches

Mail list logo