On Tue, May 20, 2014 at 4:47 PM, Dan Robinson <[email protected]> wrote:
> We're running wal-e 0.7.0 with postgresql 9.3.4.
>
> During base backups on some of our nodes, we're getting repeated "socket
> failure" errors, with the same 'timed out' error message every time. When
> this happens, wal-e will fail to upload more chunks. E.g.:
>
> wal_e.worker.upload INFO     MSG: Retrying send because of a socket error
>         DETAIL: The socket error's message is 'timed out'.  There have been
> 39 attempts to send the volume 325 so far.
>         STRUCTURED: time=2014-05-20T15:36:13.002438-00 pid=48972
>
>
> ... failing for the same chunk many times until the backup job times out.
> Our logs show this consistent failure pattern starting around 4 hours into
> the base backup. We get this behavior on a subset of our nodes, all of which
> are the same EC2 instance type and have comparable DB sizes.
>
> Is this something you've seen before? Is there any other info that might be
> helpful in diagnosing what's going on?

I've seen this and some other errors, such as connection reset by
peer.  Have you tried denicing the process?

I think I have a root cause for some of these stubborn problems in
general, but have yet to have figured out how to patch it (keeping it
for a rainy day, or an interested contributor).

One thing I've noticed is that WAL-E doesn't re-resolve DNS when
retrying from errors, at least if ltrace and strace are any
indication.  It is my hypothesis that bad S3 backends that occur in
the many hours that transpire when uploading a database with hundreds
of partitions (300GB+) can be a stubborn problem for WAL-E.

I did inspect the code and it looks like re-connection happens with
fairly clean state in WAL-E or so I think, and I suspect there is some
kind of caching or pooling deeper in the stack I'm not getting.

Maybe a solution: force a re-resolution of that hostname.  I have yet
to take this under a microscope and simulate it given the workaround
is (painfully) try that backup again.

Differently: bite the bullet and support a wal-e state directory and
fill that with backup-resume information so that one can tear down and
set up the WAL-E process, mitigating this and other problems.

All in all:

See what "strace" says about what IP you are connecting to and whether
it stays stubbornly the same and whether or not it matches recent
hand-rolled DNS queries against the S3 endpoint (e.g. via dig), and
then consider patches that attempt to force variance of the resolved
hostname (connect() should be seen against many target IP addresses
while being retried; so far that appears to not be the case).

So, it's a little thorny, but I'd be grateful for outside
corroboration and assistance.  I've tabled the matter for myself for
now.

-- 
You received this message because you are subscribed to the Google Groups 
"wal-e" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
For more options, visit https://groups.google.com/d/optout.

Reply via email to