I took another whack at this. I'm not seeing anything too damning in the
strace logs. (I'm running with trace=network,process,file,signal, which I
would imagine to be sufficient. Might I be missing something there?)
I tried this on a fresh bucket, so as to avoid any possible rate-limiting
issues from multiple nodes backing up to the same bucket. I'm seeing the
failures starting from the beginning this time. (In my previous report, I
listed that they only started a few hours in.)
Here's a cut from the logs:
wal_e.worker.upload INFO MSG: Retrying send because of a socket error
DETAIL: The socket error's message is '[Errno 104] Connection reset
by peer'. There have been 1 attempts to send the volume 0 so far.
STRUCTURED: time=2014-06-04T21:37:05.576854-00 pid=46822
wal_e.worker.upload INFO MSG: Retrying send because of a socket error
DETAIL: The socket error's message is 'timed out'. There have been
1 attempts to send the volume 3 so far.
STRUCTURED: time=2014-06-04T21:37:22.262263-00 pid=46822
wal_e.worker.upload INFO MSG: Retrying send because of a socket error
DETAIL: The socket error's message is '[Errno 32] Broken pipe'.
There have been 1 attempts to send the volume 1 so far.
STRUCTURED: time=2014-06-04T21:38:02.876949-00 pid=46822
...
wal_e.worker.upload INFO MSG: Retrying send because of a socket error
DETAIL: The socket error's message is 'timed out'. There have been
1 attempts to send the volume 4 so far.
STRUCTURED: time=2014-06-04T21:41:22.056998-00 pid=46822
wal_e.worker.upload INFO MSG: Retrying send because of a socket error
DETAIL: The socket error's message is 'timed out'. There have been
2 attempts to send the volume 3 so far.
STRUCTURED: time=2014-06-04T21:43:38.755493-00 pid=46822
wal_e.worker.upload INFO MSG: Retrying send because of a socket error
DETAIL: The socket error's message is 'timed out'. There have been
2 attempts to send the volume 0 so far.
STRUCTURED: time=2014-06-04T21:43:42.338737-00 pid=46822
wal_e.worker.upload INFO MSG: Retrying send because of a socket error
DETAIL: The socket error's message is 'timed out'. There have been
2 attempts to send the volume 1 so far.
STRUCTURED: time=2014-06-04T21:44:59.760043-00 pid=46822
So I'm seeing the failures starting from segment 0, but not on all of them.
Segment 2 uploaded without issue. Some of the segments that fail will fail
for a few times before uploading successfully.
The network use is high while this is happening -- around 500 Mb/s with
occasional peaks at 750 Mb/s -- but I'm not sure if this is blocked on
network throughput. (If it were, wouldn't we see a consistent plateau
around at least 750 Mb/s?)
In any case, it doesn't appear to be an incorrectly cached DNS entry, as
these failures are happening from the beginning of the backup-push
operation.
Is there any other info I can fetch that might be helpful in diagnosing
what's going on?
-Dan
On Wednesday, May 21, 2014 12:21:28 AM UTC-7, Daniel Farina wrote:
>
> On Tue, May 20, 2014 at 4:47 PM, Dan Robinson <[email protected]
> <javascript:>> wrote:
> > We're running wal-e 0.7.0 with postgresql 9.3.4.
> >
> > During base backups on some of our nodes, we're getting repeated "socket
> > failure" errors, with the same 'timed out' error message every time.
> When
> > this happens, wal-e will fail to upload more chunks. E.g.:
> >
> > wal_e.worker.upload INFO MSG: Retrying send because of a socket
> error
> > DETAIL: The socket error's message is 'timed out'. There have
> been
> > 39 attempts to send the volume 325 so far.
> > STRUCTURED: time=2014-05-20T15:36:13.002438-00 pid=48972
> >
> >
> > ... failing for the same chunk many times until the backup job times
> out.
> > Our logs show this consistent failure pattern starting around 4 hours
> into
> > the base backup. We get this behavior on a subset of our nodes, all of
> which
> > are the same EC2 instance type and have comparable DB sizes.
> >
> > Is this something you've seen before? Is there any other info that might
> be
> > helpful in diagnosing what's going on?
>
> I've seen this and some other errors, such as connection reset by
> peer. Have you tried denicing the process?
>
> I think I have a root cause for some of these stubborn problems in
> general, but have yet to have figured out how to patch it (keeping it
> for a rainy day, or an interested contributor).
>
> One thing I've noticed is that WAL-E doesn't re-resolve DNS when
> retrying from errors, at least if ltrace and strace are any
> indication. It is my hypothesis that bad S3 backends that occur in
> the many hours that transpire when uploading a database with hundreds
> of partitions (300GB+) can be a stubborn problem for WAL-E.
>
> I did inspect the code and it looks like re-connection happens with
> fairly clean state in WAL-E or so I think, and I suspect there is some
> kind of caching or pooling deeper in the stack I'm not getting.
>
> Maybe a solution: force a re-resolution of that hostname. I have yet
> to take this under a microscope and simulate it given the workaround
> is (painfully) try that backup again.
>
> Differently: bite the bullet and support a wal-e state directory and
> fill that with backup-resume information so that one can tear down and
> set up the WAL-E process, mitigating this and other problems.
>
> All in all:
>
> See what "strace" says about what IP you are connecting to and whether
> it stays stubbornly the same and whether or not it matches recent
> hand-rolled DNS queries against the S3 endpoint (e.g. via dig), and
> then consider patches that attempt to force variance of the resolved
> hostname (connect() should be seen against many target IP addresses
> while being retried; so far that appears to not be the case).
>
> So, it's a little thorny, but I'd be grateful for outside
> corroboration and assistance. I've tabled the matter for myself for
> now.
>
--
You received this message because you are subscribed to the Google Groups
"wal-e" group.
To unsubscribe from this group and stop receiving emails from it, send an email
to [email protected].
For more options, visit https://groups.google.com/d/optout.