Hi,

After upgrading from 0.7.2 to 0.8c1, I've noticed that my typical run of 
wal-e backup-fetch + wal-fetch in recovery.conf now fails unless I disable 
prefetching. This ultimately means that PostgreSQL will fail to start after 
a backup-fetch. The logs from a successful restore / recovery without 
prefetch and a failed restore / recovery with it are included below.

So far as environment goes, both runs were done from a clean Ubuntu 12.04 
VM with an empty PostgreSQL 9.3 cluster directory, and both runs pulled 
from the same S3 endpoint. I initially suspected that the S3 endpoint was 
to blame -- it's a local Ceph mirror of what's in AWS -- but the error also 
occurred when pointing to the original bucket in AWS. I was also able to 
verify that the WAL file in question (0000000300000042000000F5) does not 
exist in S3.

So, apart from my own user error, my guess is that the prefetch is writing 
out something it shouldn't have when trying (and necessarily failing) to 
fetch the non-existing, N + 1th WAL file. But, that's just a guess. I need 
to spend a little more time reading through the directories that prefetch 
creates (and how PostgreSQL interacts with them) to offer any better 
opinion. My apologies for not doing so before posting this -- especially 
since fully understanding PostgreSQL's restore_command is Worth the Trouble.

Please let me know if this should be filed as a bug on Github, and also if 
there are any more details I can provide about the state of this particular 
WAL-E archive. It's private data, of course, but it is a static archive at 
this point (i.e. no more WAL files are being written to this particular 
prefix, nor have they been for 4-5 days), so I should at least be able to 
provide consistent answers to any questions you might have. This WAL-E 
restore from a fresh VM is also already automated, so it's easy to re-run 
this again with extra logging or whatnot, if that helps.

Kind regards,

HJB

with default (`--prefetch 8`). recovery.conf:

    restore_command = '/usr/bin/env AWS_ACCESS_KEY_ID=XXX \
        AWS_SECRET_ACCESS_KEY=YYY WALE_S3_ENDPOINT=https+path://Z \
        /opt/wal-e/virtualenv/bin/wal-e \
        --s3-prefix s3://BUCKET/PREFIX/ wal-fetch "%f" "%p"'

log:

    wal_e.operator.backup INFO     MSG: promoted prefetched wal segment
            STRUCTURED: time=2014-10-21T23:43:31.730119-00 pid=15828 
action=wal-fetch 
key=s3://BUCKET/PREFIX/wal_005/0000000300000042000000EF.lzo prefix=PREFIX/ 
seg=0000000300000042000000EF
    2014-10-21 23:43:31 UTC LOG:  restored log file 
"0000000300000042000000EF" from archive
    wal_e.main   INFO     MSG: starting WAL-E
            DETAIL: The subcommand is "wal-fetch".
            STRUCTURED: time=2014-10-21T23:43:32.319833-00 pid=15854
    wal_e.operator.backup INFO     MSG: promoted prefetched wal segment
            STRUCTURED: time=2014-10-21T23:43:32.365143-00 pid=15854 
action=wal-fetch 
key=s3://BUCKET/PREFIX/wal_005/0000000300000042000000F0.lzo prefix=PREFIX/ 
seg=0000000300000042000000F0
    2014-10-21 23:43:32 UTC LOG:  restored log file 
"0000000300000042000000F0" from archive
    wal_e.main   INFO     MSG: starting WAL-E
            DETAIL: The subcommand is "wal-fetch".
            STRUCTURED: time=2014-10-21T23:43:32.942442-00 pid=15872
    wal_e.operator.backup INFO     MSG: promoted prefetched wal segment
            STRUCTURED: time=2014-10-21T23:43:32.986935-00 pid=15872 
action=wal-fetch 
key=s3://BUCKET/PREFIX/wal_005/0000000300000042000000F1.lzo prefix=PREFIX/ 
seg=0000000300000042000000F1
    2014-10-21 23:43:33 UTC LOG:  restored log file 
"0000000300000042000000F1" from archive
    wal_e.main   INFO     MSG: starting WAL-E
            DETAIL: The subcommand is "wal-fetch".
            STRUCTURED: time=2014-10-21T23:43:33.507799-00 pid=15893
    wal_e.operator.backup INFO     MSG: promoted prefetched wal segment
            STRUCTURED: time=2014-10-21T23:43:33.549269-00 pid=15893 
action=wal-fetch 
key=s3://BUCKET/PREFIX/wal_005/0000000300000042000000F2.lzo prefix=PREFIX/ 
seg=0000000300000042000000F2
    2014-10-21 23:43:33 UTC LOG:  restored log file 
"0000000300000042000000F2" from archive
    wal_e.main   INFO     MSG: starting WAL-E
            DETAIL: The subcommand is "wal-fetch".
            STRUCTURED: time=2014-10-21T23:43:34.103761-00 pid=15907
    wal_e.operator.backup INFO     MSG: promoted prefetched wal segment
            STRUCTURED: time=2014-10-21T23:43:34.155012-00 pid=15907 
action=wal-fetch 
key=s3://BUCKET/PREFIX/wal_005/0000000300000042000000F3.lzo prefix=PREFIX/ 
seg=0000000300000042000000F3
    2014-10-21 23:43:34 UTC LOG:  restored log file 
"0000000300000042000000F3" from archive
    wal_e.main   INFO     MSG: starting WAL-E
            DETAIL: The subcommand is "wal-fetch".
            STRUCTURED: time=2014-10-21T23:43:34.677411-00 pid=15935
    wal_e.operator.backup INFO     MSG: promoted prefetched wal segment
            STRUCTURED: time=2014-10-21T23:43:34.726169-00 pid=15935 
action=wal-fetch 
key=s3://BUCKET/PREFIX/wal_005/0000000300000042000000F4.lzo prefix=PREFIX/ 
seg=0000000300000042000000F4
    2014-10-21 23:43:34 UTC LOG:  restored log file 
"0000000300000042000000F4" from archive
    wal_e.main   INFO     MSG: starting WAL-E
            DETAIL: The subcommand is "wal-fetch".
            STRUCTURED: time=2014-10-21T23:43:35.191866-00 pid=15947
    wal_e.operator.backup INFO     MSG: promoted prefetched wal segment
            STRUCTURED: time=2014-10-21T23:43:35.257994-00 pid=15947 
action=wal-fetch 
key=s3://BUCKET/PREFIX/wal_005/0000000300000042000000F5.lzo prefix=PREFIX/ 
seg=0000000300000042000000F5
    2014-10-21 23:43:35 UTC FATAL:  archive file "0000000300000042000000F5" 
has wrong size: 0 instead of 16777216
    2014-10-21 23:43:35 UTC LOG:  startup process (PID 15621) exited with 
exit code 1
    2014-10-21 23:43:35 UTC LOG:  terminating any other active server 
processes

with `--prefetch 0`. recovery.conf:

    restore_command = '/usr/bin/env AWS_ACCESS_KEY_ID=XXX \
        AWS_SECRET_ACCESS_KEY=YYY WALE_S3_ENDPOINT=https+path://Z \
        /opt/wal-e/virtualenv/bin/wal-e \
        --s3-prefix s3://BUCKET/PREFIX/ \
        wal-fetch --prefetch 0 "%f" "%p"'


log:

            STRUCTURED: time=2014-10-21T23:57:09.337312-00 pid=15932 
action=wal-fetch key=s3://BUCKET/PREFIX/wal_005/0000000300000042000000F5.l
    zo prefix=PREFIX/ seg=0000000300000042000000F5 state=complete
    2014-10-21 23:57:09 UTC LOG:  WAL file is from different database 
system: WAL file database system identifier is 6072803227874029436, 
pg_control database system identifier is 5943550808455905278.
    2014-10-21 23:57:09 UTC LOG:  redo done at 42/F456F620
    2014-10-21 23:57:09 UTC LOG:  last completed transaction was at log 
time 2014-10-16 05:30:02.567065+00
    wal_e.main   INFO     MSG: starting WAL-E
            DETAIL: The subcommand is "wal-fetch".
            STRUCTURED: time=2014-10-21T23:57:09.533006-00 pid=15939
    wal_e.operator.backup INFO     MSG: begin wal restore
            STRUCTURED: time=2014-10-21T23:57:09.563344-00 pid=15939 
action=wal-fetch key=s3://BUCKET/PREFIX/wal_005/0000000300000042000000F4.l
    zo prefix=PREFIX/ seg=0000000300000042000000F4 state=begin
    wal_e.blobstore.s3.s3_util INFO     MSG: completed download and 
decompression
            DETAIL: Downloaded and decompressed 
"s3://BUCKET/PREFIX/wal_005/0000000300000042000000F4.lzo" to 
"pg_xlog/RECOVERYXLOG"
            STRUCTURED: time=2014-10-21T23:57:10.137669-00 pid=15939
    wal_e.operator.backup INFO     MSG: complete wal restore
            STRUCTURED: time=2014-10-21T23:57:10.138563-00 pid=15939 
action=wal-fetch key=s3://BUCKET/PREFIX/wal_005/0000000300000042000000F4.l
    zo prefix=PREFIX/ seg=0000000300000042000000F4 state=complete
    2014-10-21 23:57:10 UTC LOG:  restored log file 
"0000000300000042000000F4" from archive
    wal_e.main   INFO     MSG: starting WAL-E
            DETAIL: The subcommand is "wal-fetch".
            STRUCTURED: time=2014-10-21T23:57:10.353051-00 pid=15959
    wal_e.operator.backup INFO     MSG: begin wal restore
            STRUCTURED: time=2014-10-21T23:57:10.385350-00 pid=15959 
action=wal-fetch key=s3://BUCKET/PREFIX/wal_005/00000004.history.lzo prefi
    x=PREFIX/ seg=00000004.history state=begin
    lzop: <stdin>: not a lzop file
    wal_e.blobstore.s3.s3_util WARNING  MSG: could no longer locate object 
while performing wal restore
            DETAIL: The absolute URI that could not be located is 
s3://BUCKET/PREFIX/wal_005/00000004.history.lzo.
            HINT: This can be normal when Postgres is trying to detect what 
timelines are available during restoration.
            STRUCTURED: time=2014-10-21T23:57:10.514732-00 pid=15959
    wal_e.operator.backup INFO     MSG: complete wal restore
            STRUCTURED: time=2014-10-21T23:57:10.515860-00 pid=15959 
action=wal-fetch key=s3://BUCKET/PREFIX/wal_005/00000004.history.lzo prefi
    x=PREFIX/ seg=00000004.history state=complete
    2014-10-21 23:57:10 UTC LOG:  selected new timeline ID: 4
    wal_e.main   INFO     MSG: starting WAL-E
            DETAIL: The subcommand is "wal-fetch".
            STRUCTURED: time=2014-10-21T23:57:10.728339-00 pid=15966
    wal_e.operator.backup INFO     MSG: begin wal restore
            STRUCTURED: time=2014-10-21T23:57:10.761165-00 pid=15966 
action=wal-fetch key=s3://BUCKET/PREFIX/wal_005/00000003.history.lzo prefi
    x=PREFIX/ seg=00000003.history state=begin
    wal_e.blobstore.s3.s3_util INFO     MSG: completed download and 
decompression
            DETAIL: Downloaded and decompressed 
"s3://BUCKET/PREFIX/wal_005/00000003.history.lzo" to 
"pg_xlog/RECOVERYHISTORY"
            STRUCTURED: time=2014-10-21T23:57:10.988149-00 pid=15966
    wal_e.operator.backup INFO     MSG: complete wal restore
            STRUCTURED: time=2014-10-21T23:57:10.989599-00 pid=15966 
action=wal-fetch key=s3://BUCKET/PREFIX/wal_005/00000003.history.lzo 
prefix=PREFIX/ seg=00000003.history state=complete
    2014-10-21 23:57:11 UTC LOG:  restored log file "00000003.history" from 
archive
    2014-10-21 23:57:11 UTC LOG:  archive recovery complete
    2014-10-21 23:57:11 UTC LOG:  database system is ready to accept 
connections

-- 
You received this message because you are subscribed to the Google Groups 
"wal-e" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
For more options, visit https://groups.google.com/d/optout.

Reply via email to