Hi - can anyone help me please?
I've set up a postgresql database to both archive WAL files and run a
nightly base backup to S3. All files are available. I'm testing the
backups, and I get a failure with 'invalid checkpoint record'.
My setup is as follows:
1 server running ubuntu 14.04 in EC2, postgres 9.3, wal-e 0.7.3
/etc/postgresql/9.3/main/postgresql.conf contains:
archive_command = 'envdir /etc/wal-e.d/env /usr/local/bin/wal-e wal-push
/var/lib/postgresql/9.3/main/%p'
archive_mode = 'on'
archive_timeout = '60'
data_directory = '/var/lib/postgresql/9.3/main'
wal_level = 'archive'
postgres' crontab contains:
45 4 * * * envdir /etc/wal-e.d/env /usr/local/bin/wal-e backup-push
/var/lib/postgresql/9.3/main || logger -p local2.emerg 'ERROR Full Postgres
backup has failed. Immediate action required.'
wal-e env contains the S3 keys and WALE_S3_PREFIX =
s3://<bucket>/live/postgresql/svr04 where stuff in <bucket> is the s3
bucket.
I've been running this setup for a few days now, so I have multiple full
and WAL archive backups all on present. I'm testing that I can restore
from this backup.
To test restore I do the following:
- run a chef postgres cookbook using test kitchen on a vagrant / virtual
box running on my laptop. This recreates the live environment.
- log into the vagrant vm
- stop postgres
- update the wal-e S3 accesskey and secret to one that has permission to
read (the backup key doesn't have this permission, both keys can list)
- remove the contents of /var/lib/postgresql/9.3/main completely.
- su - postgres
- envdir /etc/wal-e.d/env /usr/local/bin/wal-e backup-fetch
/var/lib/postgresql/9.3/main LATEST
wal_e.worker.s3.s3_worker INFO MSG: beginning partition download
DETAIL: The partition being downloaded is part_00000000.tar.lzo.
HINT: The absolute S3 key is
live/postgresql/svr04/basebackups_005/base_000000010000001B0000006C_00000040/tar_partitions/part_00000000.tar.lzo.
STRUCTURED: time=2015-02-11T12:00:50.299401-00 pid=27943
- create a recovery.conf file as per the wal-e info page.
restore_command = 'envdir /etc/wal-e.d/env wal-e wal-fetch "%f" "%p"'
I've tried with and without a recovery_target_time equal to the time of
the backup and it doesn't change things.
- start postgres
Postgres won't start, here are the logs:
Feb 11 10:53:10 vagrant postgres[26779]: [2-1] 2015-02-11 10:53:10 GMT
LOG: database system was interrupted; last known up at 2015-02-11 04:45:01
GMT
Feb 11 10:53:10 vagrant postgres[26779]: [3-1] 2015-02-11 10:53:10 GMT
LOG: starting point-in-time recovery to 2015-02-11 04:45:01+00
Feb 11 10:53:10 vagrant postgres[26779]: [4-1] 2015-02-11 10:53:10 GMT
LOG: invalid checkpoint record
Feb 11 10:53:10 vagrant postgres[26779]: [5-1] 2015-02-11 10:53:10 GMT
FATAL: could not locate required checkpoint record
Feb 11 10:53:10 vagrant postgres[26779]: [5-2] 2015-02-11 10:53:10 GMT
HINT: If you are not restoring from a backup, try removing the file
"/var/lib/postgresql/9.3/main/backup_label".
Feb 11 10:53:10 vagrant postgres[26778]: [2-1] 2015-02-11 10:53:10 GMT
LOG: startup process (PID 26779) exited with exit code 1
Feb 11 10:53:10 vagrant postgres[26778]: [3-1] 2015-02-11 10:53:10 GMT
LOG: aborting startup due to startup process failure
If I do exactly the same procedure with two vagrant VMs (ie backup one to
s3 and restore the other) everything goes swimmingly well, I'm using the
same S3 keys, buckets, policies etc.
Looking around, most comments on "invalid checkpoint record" seem to be to
do with WAL files not getting archived during the backup, however as far as
I can see that's ok. I assume that if postgres called the recovery_command
to fetch a wal backup and that failed, I'd see the call to wal-e in the
logs.
I'd be grateful if anyone could help, or even just give me a more detailed
description of what 'invalid checkpoint record' means.
Daniel
--
You received this message because you are subscribed to the Google Groups
"wal-e" group.
To unsubscribe from this group and stop receiving emails from it, send an email
to [email protected].
For more options, visit https://groups.google.com/d/optout.