Re: [wal-e] prefetch appears to cause fatal error archive file "X" has wrong size: 0 instead of Y (in 0.8c1)

Hunter Blanks Wed, 22 Oct 2014 13:09:55 -0700

Daniel,

Thanks for writing. Setting standby_mode = on and then triggering end of
recovery with trigger_file does work, regardless of whether prefetch is
enabled. More logs follow.


I share your sympathies on the limitations of "exit-code-as-interface" we
get from PostgreSQL's restore_command. Of course, since we've daemonized
wal-prefetch, it doesn't actually matter if we exit non-zero? Couldn't we
just raise an exception if do_lzop_get() returns false?

-HJB

Logs:

2014-10-22 19:52:08 UTC LOG:  restored log file "0000000300000042000000F4"
from archive
wal_e.main   INFO     MSG: starting WAL-E
        DETAIL: The subcommand is "wal-fetch".
        STRUCTURED: time=2014-10-22T19:52:08.937034-00 pid=18312
wal_e.operator.backup INFO     MSG: begin wal restore
        STRUCTURED: time=2014-10-22T19:52:08.970987-00 pid=18312
action=wal-fetch key=s3://BUCKET/PREFIX/wal_005/00000004.history.lzo
prefix=PREFIX/ seg=00000004.history state=begin
lzop: <stdin>: not a lzop file
wal_e.blobstore.s3.s3_util WARNING  MSG: could no longer locate object
while performing wal restore
        DETAIL: The absolute URI that could not be located is
s3://BUCKET/PREFIX/wal_005/00000004.history.lzo.
        HINT: This can be normal when Postgres is trying to detect what
timelines are available during restoration.
        STRUCTURED: time=2014-10-22T19:52:09.098063-00 pid=18312
wal_e.operator.backup INFO     MSG: complete wal restore
        STRUCTURED: time=2014-10-22T19:52:09.099036-00 pid=18312
action=wal-fetch key=s3://BUCKET/PREFIX/wal_005/00000004.history.lzo
prefix=PREFIX/ seg=00000004.history state=complete
wal_e.main   INFO     MSG: starting WAL-E
        DETAIL: The subcommand is "wal-fetch".
        STRUCTURED: time=2014-10-22T19:52:09.290002-00 pid=18332
wal_e.operator.backup INFO     MSG: begin wal restore
        STRUCTURED: time=2014-10-22T19:52:09.323112-00 pid=18332
action=wal-fetch key=s3://BUCKET/PREFIX/wal_005/00000005.history.lzo
prefix=PREFIX/ seg=00000005.history state=begin
lzop: <stdin>: not a lzop file
wal_e.blobstore.s3.s3_util WARNING  MSG: could no longer locate object
while performing wal restore
        DETAIL: The absolute URI that could not be located is
s3://BUCKET/PREFIX/wal_005/00000005.history.lzo.
        HINT: This can be normal when Postgres is trying to detect what
timelines are available during restoration.
        STRUCTURED: time=2014-10-22T19:52:09.457994-00 pid=18332
wal_e.operator.backup INFO     MSG: complete wal restore
        STRUCTURED: time=2014-10-22T19:52:09.458903-00 pid=18332
action=wal-fetch key=s3://BUCKET/PREFIX/wal_005/00000005.history.lzo
prefix=PREFIX/ seg=00000005.history state=complete
2014-10-22 19:52:09 UTC LOG:  selected new timeline ID: 5
wal_e.main   INFO     MSG: starting WAL-E
        DETAIL: The subcommand is "wal-fetch".
        STRUCTURED: time=2014-10-22T19:52:09.649697-00 pid=18339
wal_e.operator.backup INFO     MSG: begin wal restore
        STRUCTURED: time=2014-10-22T19:52:09.682317-00 pid=18339
action=wal-fetch key=s3://BUCKET/PREFIX/wal_005/00000003.history.lzo
prefix=PREFIX/ seg=00000003.history state=begin
wal_e.blobstore.s3.s3_util INFO     MSG: completed download and
decompression
        DETAIL: Downloaded and decompressed
"s3://BUCKET/PREFIX/wal_005/00000003.history.lzo" to
"pg_xlog/RECOVERYHISTORY"
        STRUCTURED: time=2014-10-22T19:52:09.915124-00 pid=18339
wal_e.operator.backup INFO     MSG: complete wal restore
        STRUCTURED: time=2014-10-22T19:52:09.916287-00 pid=18339
action=wal-fetch key=s3://BUCKET/PREFIX/wal_005/00000003.history.lzo
prefix=PREFIX/ seg=00000003.history state=complete


On Wed, Oct 22, 2014 at 10:56 AM, Daniel Farina <[email protected]> wrote:

> On Tue, Oct 21, 2014 at 10:31 PM, Hunter Blanks <[email protected]>
> wrote:
> > Daniel,
> >
> > On Tue, Oct 21, 2014 at 10:12 PM, Daniel Farina <[email protected]>
> wrote:
> >>
> >> So, no standby_mode = on in recovery.conf?  Can you give that a try
> >> and user the "trigger file" to come out of recovery?
> >
> >
> > I'll give that a try in the AM and let you know how it goes. I'd imagine
> it
> > should work fine, although it probably doesn't fix the root problem. For
> > development environments, we almost always automate WAL-E recovery up to
> the
> > last checkpoint and then kick it out of recovery. Requiring standby_mode
> =
> > on makes it so the provisioner has to figure out when to take the machine
> > out of recovery. Doing that right seems a little tricky.
>
> Yeah, looking at this more, I'm pretty sure I flubbed this:
>
>     def __exit__(self, exc_type, exc_val, exc_tb):
>         try:
>             if exc_type is None:
>                 # Success.  Mark the segment as complete.
>                 #
>                 # In event of a crash, this os.link() without an fsync
>                 # can leave a corrupt file in the prefetch directory,
>                 # but given Postgres retries corrupt archive logs
>                 # (because it itself makes no provisions to sync
>                 # them), that is assumed to be acceptable.
>                 os.link(self.tf.name, path.join(
>                     self.prefetch_dir.prefetched_dir, self.segment.name))
>         finally:
>             shutil.rmtree(self.prefetch_dir.seg_dir(self.segment))
>
> In combination with:
>
>     def wal_prefetch(self, base, segment_name):
>         url = '{0}://{1}/{2}'.format(
>             self.layout.scheme, self.layout.store_name(),
>             self.layout.wal_path(segment_name))
>         pd = prefetch.Dirs(base)
>         seg = WalSegment(segment_name)
>         pd.create(seg)
>         with pd.download(seg) as d:
>             logger.info(
>                 msg='begin wal restore',
>                 structured={'action': 'wal-prefetch',
>                             'key': url,
>                             'seg': segment_name,
>                             'prefix': self.layout.path_prefix,
>                             'state': 'begin'})
>
>             ret = do_lzop_get(self.creds, url, d.dest,
>                               self.gpg_key_id is not None, do_retry=False)
>
>             logger.info(
>                 msg='complete wal restore',
>                 structured={'action': 'wal-prefetch',
>                             'key': url,
>                             'seg': segment_name,
>                             'prefix': self.layout.path_prefix,
>                             'state': 'complete'})
>
>             return ret
>
> Note how the code immediately above uses do_lzop_get which return
> codes to signify a 404.  So the __exit__ won't clean up as
> anticipated.
>
> In the prior code, on re-thinking, the comment probably wrong or
> Postgres has a bug: Postgres, if not already, should never trust a
> RECOVERY_XLOG (semi-temporary file) in pg_xlog that is available
> a-priori, and always run the restore_command once.  Whereas, since
> WAL-E can't currently figure out if the system has been online
> continuously since it starts and exits so frequently, WAL-E's
> promotion logic is liable to commit such a mistake.
>
> The fix that is apparent to me is to find a way to ensure continuous
> system operation even between executions, such as spitting out
> boot-time to the ".wal-e" directory somewhere.  This is slightly
> non-portable and a bit grotty but I don't have a better idea right
> now.
>

-- 
You received this message because you are subscribed to the Google Groups 
"wal-e" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
For more options, visit https://groups.google.com/d/optout.

Re: [wal-e] prefetch appears to cause fatal error archive file "X" has wrong size: 0 instead of Y (in 0.8c1)

Reply via email to