On Tue, Oct 21, 2014 at 10:31 PM, Hunter Blanks <[email protected]> wrote:
> Daniel,
>
> On Tue, Oct 21, 2014 at 10:12 PM, Daniel Farina <[email protected]> wrote:
>>
>> So, no standby_mode = on in recovery.conf? Can you give that a try
>> and user the "trigger file" to come out of recovery?
>
>
> I'll give that a try in the AM and let you know how it goes. I'd imagine it
> should work fine, although it probably doesn't fix the root problem. For
> development environments, we almost always automate WAL-E recovery up to the
> last checkpoint and then kick it out of recovery. Requiring standby_mode =
> on makes it so the provisioner has to figure out when to take the machine
> out of recovery. Doing that right seems a little tricky.
Yeah, looking at this more, I'm pretty sure I flubbed this:
def __exit__(self, exc_type, exc_val, exc_tb):
try:
if exc_type is None:
# Success. Mark the segment as complete.
#
# In event of a crash, this os.link() without an fsync
# can leave a corrupt file in the prefetch directory,
# but given Postgres retries corrupt archive logs
# (because it itself makes no provisions to sync
# them), that is assumed to be acceptable.
os.link(self.tf.name, path.join(
self.prefetch_dir.prefetched_dir, self.segment.name))
finally:
shutil.rmtree(self.prefetch_dir.seg_dir(self.segment))
In combination with:
def wal_prefetch(self, base, segment_name):
url = '{0}://{1}/{2}'.format(
self.layout.scheme, self.layout.store_name(),
self.layout.wal_path(segment_name))
pd = prefetch.Dirs(base)
seg = WalSegment(segment_name)
pd.create(seg)
with pd.download(seg) as d:
logger.info(
msg='begin wal restore',
structured={'action': 'wal-prefetch',
'key': url,
'seg': segment_name,
'prefix': self.layout.path_prefix,
'state': 'begin'})
ret = do_lzop_get(self.creds, url, d.dest,
self.gpg_key_id is not None, do_retry=False)
logger.info(
msg='complete wal restore',
structured={'action': 'wal-prefetch',
'key': url,
'seg': segment_name,
'prefix': self.layout.path_prefix,
'state': 'complete'})
return ret
Note how the code immediately above uses do_lzop_get which return
codes to signify a 404. So the __exit__ won't clean up as
anticipated.
In the prior code, on re-thinking, the comment probably wrong or
Postgres has a bug: Postgres, if not already, should never trust a
RECOVERY_XLOG (semi-temporary file) in pg_xlog that is available
a-priori, and always run the restore_command once. Whereas, since
WAL-E can't currently figure out if the system has been online
continuously since it starts and exits so frequently, WAL-E's
promotion logic is liable to commit such a mistake.
The fix that is apparent to me is to find a way to ensure continuous
system operation even between executions, such as spitting out
boot-time to the ".wal-e" directory somewhere. This is slightly
non-portable and a bit grotty but I don't have a better idea right
now.
--
You received this message because you are subscribed to the Google Groups
"wal-e" group.
To unsubscribe from this group and stop receiving emails from it, send an email
to [email protected].
For more options, visit https://groups.google.com/d/optout.