On Tue, Oct 21, 2014 at 10:31 PM, Hunter Blanks <[email protected]> wrote:
> Daniel,
>
> On Tue, Oct 21, 2014 at 10:12 PM, Daniel Farina <[email protected]> wrote:
>>
>> So, no standby_mode = on in recovery.conf?  Can you give that a try
>> and user the "trigger file" to come out of recovery?
>
>
> I'll give that a try in the AM and let you know how it goes. I'd imagine it
> should work fine, although it probably doesn't fix the root problem. For
> development environments, we almost always automate WAL-E recovery up to the
> last checkpoint and then kick it out of recovery. Requiring standby_mode =
> on makes it so the provisioner has to figure out when to take the machine
> out of recovery. Doing that right seems a little tricky.

Yeah, looking at this more, I'm pretty sure I flubbed this:

    def __exit__(self, exc_type, exc_val, exc_tb):
        try:
            if exc_type is None:
                # Success.  Mark the segment as complete.
                #
                # In event of a crash, this os.link() without an fsync
                # can leave a corrupt file in the prefetch directory,
                # but given Postgres retries corrupt archive logs
                # (because it itself makes no provisions to sync
                # them), that is assumed to be acceptable.
                os.link(self.tf.name, path.join(
                    self.prefetch_dir.prefetched_dir, self.segment.name))
        finally:
            shutil.rmtree(self.prefetch_dir.seg_dir(self.segment))

In combination with:

    def wal_prefetch(self, base, segment_name):
        url = '{0}://{1}/{2}'.format(
            self.layout.scheme, self.layout.store_name(),
            self.layout.wal_path(segment_name))
        pd = prefetch.Dirs(base)
        seg = WalSegment(segment_name)
        pd.create(seg)
        with pd.download(seg) as d:
            logger.info(
                msg='begin wal restore',
                structured={'action': 'wal-prefetch',
                            'key': url,
                            'seg': segment_name,
                            'prefix': self.layout.path_prefix,
                            'state': 'begin'})

            ret = do_lzop_get(self.creds, url, d.dest,
                              self.gpg_key_id is not None, do_retry=False)

            logger.info(
                msg='complete wal restore',
                structured={'action': 'wal-prefetch',
                            'key': url,
                            'seg': segment_name,
                            'prefix': self.layout.path_prefix,
                            'state': 'complete'})

            return ret

Note how the code immediately above uses do_lzop_get which return
codes to signify a 404.  So the __exit__ won't clean up as
anticipated.

In the prior code, on re-thinking, the comment probably wrong or
Postgres has a bug: Postgres, if not already, should never trust a
RECOVERY_XLOG (semi-temporary file) in pg_xlog that is available
a-priori, and always run the restore_command once.  Whereas, since
WAL-E can't currently figure out if the system has been online
continuously since it starts and exits so frequently, WAL-E's
promotion logic is liable to commit such a mistake.

The fix that is apparent to me is to find a way to ensure continuous
system operation even between executions, such as spitting out
boot-time to the ".wal-e" directory somewhere.  This is slightly
non-portable and a bit grotty but I don't have a better idea right
now.

-- 
You received this message because you are subscribed to the Google Groups 
"wal-e" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
For more options, visit https://groups.google.com/d/optout.

Reply via email to