[wal-e] WAL-E test restores fail with "invalid checkpoint record"

Daniel Wed, 11 Feb 2015 09:16:56 -0800

Hi - can anyone help me please?

I've set up a postgresql database to both archive WAL files and run a 
nightly base backup to S3.  All files are available.  I'm testing the 
backups, and I get a failure with 'invalid checkpoint record'.


My setup is as follows:

1 server running ubuntu 14.04 in EC2, postgres 9.3, wal-e 0.7.3

/etc/postgresql/9.3/main/postgresql.conf contains:

archive_command = 'envdir /etc/wal-e.d/env /usr/local/bin/wal-e wal-push 
/var/lib/postgresql/9.3/main/%p'
archive_mode = 'on'
archive_timeout = '60'
data_directory = '/var/lib/postgresql/9.3/main'
wal_level = 'archive'

postgres' crontab contains:

45 4 * * * envdir /etc/wal-e.d/env /usr/local/bin/wal-e backup-push 
/var/lib/postgresql/9.3/main || logger -p local2.emerg 'ERROR Full Postgres 
backup has failed. Immediate action required.'

wal-e env contains the S3 keys and WALE_S3_PREFIX = 
s3://<bucket>/live/postgresql/svr04  where stuff in <bucket> is the s3 
bucket.

I've been running this setup for a few days now, so I have multiple full 
and WAL archive backups all on present.  I'm testing that I can restore 
from this backup.

To test restore I do the following:

- run a chef postgres cookbook using test kitchen on a vagrant / virtual 
box running on my laptop.  This recreates the live environment.
- log into the vagrant vm
- stop postgres
- update the wal-e S3 accesskey and secret to one that has permission to 
read (the backup key doesn't have this permission, both keys can list)
- remove the contents of /var/lib/postgresql/9.3/main completely.
- su - postgres
- envdir /etc/wal-e.d/env /usr/local/bin/wal-e backup-fetch 
/var/lib/postgresql/9.3/main LATEST

wal_e.worker.s3.s3_worker INFO     MSG: beginning partition download
        DETAIL: The partition being downloaded is part_00000000.tar.lzo.
        HINT: The absolute S3 key is 
live/postgresql/svr04/basebackups_005/base_000000010000001B0000006C_00000040/tar_partitions/part_00000000.tar.lzo.
        STRUCTURED: time=2015-02-11T12:00:50.299401-00 pid=27943

- create a recovery.conf file as per the wal-e info page.
  restore_command = 'envdir /etc/wal-e.d/env wal-e wal-fetch "%f" "%p"'
  
  I've tried with and without a recovery_target_time equal to the time of 
the backup and it doesn't change things.
  
- start postgres

Postgres won't start, here are the logs:

Feb 11 10:53:10 vagrant postgres[26779]: [2-1] 2015-02-11 10:53:10 GMT   
LOG:  database system was interrupted; last known up at 2015-02-11 04:45:01 
GMT
Feb 11 10:53:10 vagrant postgres[26779]: [3-1] 2015-02-11 10:53:10 GMT   
LOG:  starting point-in-time recovery to 2015-02-11 04:45:01+00
Feb 11 10:53:10 vagrant postgres[26779]: [4-1] 2015-02-11 10:53:10 GMT   
LOG:  invalid checkpoint record
Feb 11 10:53:10 vagrant postgres[26779]: [5-1] 2015-02-11 10:53:10 GMT   
FATAL:  could not locate required checkpoint record
Feb 11 10:53:10 vagrant postgres[26779]: [5-2] 2015-02-11 10:53:10 GMT   
HINT:  If you are not restoring from a backup, try removing the file 
"/var/lib/postgresql/9.3/main/backup_label".
Feb 11 10:53:10 vagrant postgres[26778]: [2-1] 2015-02-11 10:53:10 GMT   
LOG:  startup process (PID 26779) exited with exit code 1
Feb 11 10:53:10 vagrant postgres[26778]: [3-1] 2015-02-11 10:53:10 GMT   
LOG:  aborting startup due to startup process failure

If I do exactly the same procedure with two vagrant VMs (ie backup one to 
s3 and restore the other) everything goes swimmingly well, I'm using the 
same S3 keys, buckets, policies etc.

Looking around, most comments on "invalid checkpoint record" seem to be to 
do with WAL files not getting archived during the backup, however as far as 
I can see that's ok.  I assume that if postgres called the recovery_command 
to fetch a wal backup and that failed, I'd see the call to wal-e in the 
logs.

I'd be grateful if anyone could help, or even just give me a more detailed 
description of what 'invalid checkpoint record' means.

Daniel

-- 
You received this message because you are subscribed to the Google Groups 
"wal-e" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
For more options, visit https://groups.google.com/d/optout.

[wal-e] WAL-E test restores fail with "invalid checkpoint record"

Reply via email to