Hi,
After upgrading from 0.7.2 to 0.8c1, I've noticed that my typical run of
wal-e backup-fetch + wal-fetch in recovery.conf now fails unless I disable
prefetching. This ultimately means that PostgreSQL will fail to start after
a backup-fetch. The logs from a successful restore / recovery without
prefetch and a failed restore / recovery with it are included below.
So far as environment goes, both runs were done from a clean Ubuntu 12.04
VM with an empty PostgreSQL 9.3 cluster directory, and both runs pulled
from the same S3 endpoint. I initially suspected that the S3 endpoint was
to blame -- it's a local Ceph mirror of what's in AWS -- but the error also
occurred when pointing to the original bucket in AWS. I was also able to
verify that the WAL file in question (0000000300000042000000F5) does not
exist in S3.
So, apart from my own user error, my guess is that the prefetch is writing
out something it shouldn't have when trying (and necessarily failing) to
fetch the non-existing, N + 1th WAL file. But, that's just a guess. I need
to spend a little more time reading through the directories that prefetch
creates (and how PostgreSQL interacts with them) to offer any better
opinion. My apologies for not doing so before posting this -- especially
since fully understanding PostgreSQL's restore_command is Worth the Trouble.
Please let me know if this should be filed as a bug on Github, and also if
there are any more details I can provide about the state of this particular
WAL-E archive. It's private data, of course, but it is a static archive at
this point (i.e. no more WAL files are being written to this particular
prefix, nor have they been for 4-5 days), so I should at least be able to
provide consistent answers to any questions you might have. This WAL-E
restore from a fresh VM is also already automated, so it's easy to re-run
this again with extra logging or whatnot, if that helps.
Kind regards,
HJB
with default (`--prefetch 8`). recovery.conf:
restore_command = '/usr/bin/env AWS_ACCESS_KEY_ID=XXX \
AWS_SECRET_ACCESS_KEY=YYY WALE_S3_ENDPOINT=https+path://Z \
/opt/wal-e/virtualenv/bin/wal-e \
--s3-prefix s3://BUCKET/PREFIX/ wal-fetch "%f" "%p"'
log:
wal_e.operator.backup INFO MSG: promoted prefetched wal segment
STRUCTURED: time=2014-10-21T23:43:31.730119-00 pid=15828
action=wal-fetch
key=s3://BUCKET/PREFIX/wal_005/0000000300000042000000EF.lzo prefix=PREFIX/
seg=0000000300000042000000EF
2014-10-21 23:43:31 UTC LOG: restored log file
"0000000300000042000000EF" from archive
wal_e.main INFO MSG: starting WAL-E
DETAIL: The subcommand is "wal-fetch".
STRUCTURED: time=2014-10-21T23:43:32.319833-00 pid=15854
wal_e.operator.backup INFO MSG: promoted prefetched wal segment
STRUCTURED: time=2014-10-21T23:43:32.365143-00 pid=15854
action=wal-fetch
key=s3://BUCKET/PREFIX/wal_005/0000000300000042000000F0.lzo prefix=PREFIX/
seg=0000000300000042000000F0
2014-10-21 23:43:32 UTC LOG: restored log file
"0000000300000042000000F0" from archive
wal_e.main INFO MSG: starting WAL-E
DETAIL: The subcommand is "wal-fetch".
STRUCTURED: time=2014-10-21T23:43:32.942442-00 pid=15872
wal_e.operator.backup INFO MSG: promoted prefetched wal segment
STRUCTURED: time=2014-10-21T23:43:32.986935-00 pid=15872
action=wal-fetch
key=s3://BUCKET/PREFIX/wal_005/0000000300000042000000F1.lzo prefix=PREFIX/
seg=0000000300000042000000F1
2014-10-21 23:43:33 UTC LOG: restored log file
"0000000300000042000000F1" from archive
wal_e.main INFO MSG: starting WAL-E
DETAIL: The subcommand is "wal-fetch".
STRUCTURED: time=2014-10-21T23:43:33.507799-00 pid=15893
wal_e.operator.backup INFO MSG: promoted prefetched wal segment
STRUCTURED: time=2014-10-21T23:43:33.549269-00 pid=15893
action=wal-fetch
key=s3://BUCKET/PREFIX/wal_005/0000000300000042000000F2.lzo prefix=PREFIX/
seg=0000000300000042000000F2
2014-10-21 23:43:33 UTC LOG: restored log file
"0000000300000042000000F2" from archive
wal_e.main INFO MSG: starting WAL-E
DETAIL: The subcommand is "wal-fetch".
STRUCTURED: time=2014-10-21T23:43:34.103761-00 pid=15907
wal_e.operator.backup INFO MSG: promoted prefetched wal segment
STRUCTURED: time=2014-10-21T23:43:34.155012-00 pid=15907
action=wal-fetch
key=s3://BUCKET/PREFIX/wal_005/0000000300000042000000F3.lzo prefix=PREFIX/
seg=0000000300000042000000F3
2014-10-21 23:43:34 UTC LOG: restored log file
"0000000300000042000000F3" from archive
wal_e.main INFO MSG: starting WAL-E
DETAIL: The subcommand is "wal-fetch".
STRUCTURED: time=2014-10-21T23:43:34.677411-00 pid=15935
wal_e.operator.backup INFO MSG: promoted prefetched wal segment
STRUCTURED: time=2014-10-21T23:43:34.726169-00 pid=15935
action=wal-fetch
key=s3://BUCKET/PREFIX/wal_005/0000000300000042000000F4.lzo prefix=PREFIX/
seg=0000000300000042000000F4
2014-10-21 23:43:34 UTC LOG: restored log file
"0000000300000042000000F4" from archive
wal_e.main INFO MSG: starting WAL-E
DETAIL: The subcommand is "wal-fetch".
STRUCTURED: time=2014-10-21T23:43:35.191866-00 pid=15947
wal_e.operator.backup INFO MSG: promoted prefetched wal segment
STRUCTURED: time=2014-10-21T23:43:35.257994-00 pid=15947
action=wal-fetch
key=s3://BUCKET/PREFIX/wal_005/0000000300000042000000F5.lzo prefix=PREFIX/
seg=0000000300000042000000F5
2014-10-21 23:43:35 UTC FATAL: archive file "0000000300000042000000F5"
has wrong size: 0 instead of 16777216
2014-10-21 23:43:35 UTC LOG: startup process (PID 15621) exited with
exit code 1
2014-10-21 23:43:35 UTC LOG: terminating any other active server
processes
with `--prefetch 0`. recovery.conf:
restore_command = '/usr/bin/env AWS_ACCESS_KEY_ID=XXX \
AWS_SECRET_ACCESS_KEY=YYY WALE_S3_ENDPOINT=https+path://Z \
/opt/wal-e/virtualenv/bin/wal-e \
--s3-prefix s3://BUCKET/PREFIX/ \
wal-fetch --prefetch 0 "%f" "%p"'
log:
STRUCTURED: time=2014-10-21T23:57:09.337312-00 pid=15932
action=wal-fetch key=s3://BUCKET/PREFIX/wal_005/0000000300000042000000F5.l
zo prefix=PREFIX/ seg=0000000300000042000000F5 state=complete
2014-10-21 23:57:09 UTC LOG: WAL file is from different database
system: WAL file database system identifier is 6072803227874029436,
pg_control database system identifier is 5943550808455905278.
2014-10-21 23:57:09 UTC LOG: redo done at 42/F456F620
2014-10-21 23:57:09 UTC LOG: last completed transaction was at log
time 2014-10-16 05:30:02.567065+00
wal_e.main INFO MSG: starting WAL-E
DETAIL: The subcommand is "wal-fetch".
STRUCTURED: time=2014-10-21T23:57:09.533006-00 pid=15939
wal_e.operator.backup INFO MSG: begin wal restore
STRUCTURED: time=2014-10-21T23:57:09.563344-00 pid=15939
action=wal-fetch key=s3://BUCKET/PREFIX/wal_005/0000000300000042000000F4.l
zo prefix=PREFIX/ seg=0000000300000042000000F4 state=begin
wal_e.blobstore.s3.s3_util INFO MSG: completed download and
decompression
DETAIL: Downloaded and decompressed
"s3://BUCKET/PREFIX/wal_005/0000000300000042000000F4.lzo" to
"pg_xlog/RECOVERYXLOG"
STRUCTURED: time=2014-10-21T23:57:10.137669-00 pid=15939
wal_e.operator.backup INFO MSG: complete wal restore
STRUCTURED: time=2014-10-21T23:57:10.138563-00 pid=15939
action=wal-fetch key=s3://BUCKET/PREFIX/wal_005/0000000300000042000000F4.l
zo prefix=PREFIX/ seg=0000000300000042000000F4 state=complete
2014-10-21 23:57:10 UTC LOG: restored log file
"0000000300000042000000F4" from archive
wal_e.main INFO MSG: starting WAL-E
DETAIL: The subcommand is "wal-fetch".
STRUCTURED: time=2014-10-21T23:57:10.353051-00 pid=15959
wal_e.operator.backup INFO MSG: begin wal restore
STRUCTURED: time=2014-10-21T23:57:10.385350-00 pid=15959
action=wal-fetch key=s3://BUCKET/PREFIX/wal_005/00000004.history.lzo prefi
x=PREFIX/ seg=00000004.history state=begin
lzop: <stdin>: not a lzop file
wal_e.blobstore.s3.s3_util WARNING MSG: could no longer locate object
while performing wal restore
DETAIL: The absolute URI that could not be located is
s3://BUCKET/PREFIX/wal_005/00000004.history.lzo.
HINT: This can be normal when Postgres is trying to detect what
timelines are available during restoration.
STRUCTURED: time=2014-10-21T23:57:10.514732-00 pid=15959
wal_e.operator.backup INFO MSG: complete wal restore
STRUCTURED: time=2014-10-21T23:57:10.515860-00 pid=15959
action=wal-fetch key=s3://BUCKET/PREFIX/wal_005/00000004.history.lzo prefi
x=PREFIX/ seg=00000004.history state=complete
2014-10-21 23:57:10 UTC LOG: selected new timeline ID: 4
wal_e.main INFO MSG: starting WAL-E
DETAIL: The subcommand is "wal-fetch".
STRUCTURED: time=2014-10-21T23:57:10.728339-00 pid=15966
wal_e.operator.backup INFO MSG: begin wal restore
STRUCTURED: time=2014-10-21T23:57:10.761165-00 pid=15966
action=wal-fetch key=s3://BUCKET/PREFIX/wal_005/00000003.history.lzo prefi
x=PREFIX/ seg=00000003.history state=begin
wal_e.blobstore.s3.s3_util INFO MSG: completed download and
decompression
DETAIL: Downloaded and decompressed
"s3://BUCKET/PREFIX/wal_005/00000003.history.lzo" to
"pg_xlog/RECOVERYHISTORY"
STRUCTURED: time=2014-10-21T23:57:10.988149-00 pid=15966
wal_e.operator.backup INFO MSG: complete wal restore
STRUCTURED: time=2014-10-21T23:57:10.989599-00 pid=15966
action=wal-fetch key=s3://BUCKET/PREFIX/wal_005/00000003.history.lzo
prefix=PREFIX/ seg=00000003.history state=complete
2014-10-21 23:57:11 UTC LOG: restored log file "00000003.history" from
archive
2014-10-21 23:57:11 UTC LOG: archive recovery complete
2014-10-21 23:57:11 UTC LOG: database system is ready to accept
connections
--
You received this message because you are subscribed to the Google Groups
"wal-e" group.
To unsubscribe from this group and stop receiving emails from it, send an email
to [email protected].
For more options, visit https://groups.google.com/d/optout.