Re: [wal-e] Socket Timeouts When Backing Up Large Instances

Dan Robinson Wed, 04 Jun 2014 15:09:07 -0700

I took another whack at this. I'm not seeing anything too damning in the 
strace logs. (I'm running with trace=network,process,file,signal, which I 
would imagine to be sufficient. Might I be missing something there?)


I tried this on a fresh bucket, so as to avoid any possible rate-limiting 
issues from multiple nodes backing up to the same bucket. I'm seeing the 
failures starting from the beginning this time. (In my previous report, I 
listed that they only started a few hours in.)

Here's a cut from the logs:

wal_e.worker.upload INFO     MSG: Retrying send because of a socket error
        DETAIL: The socket error's message is '[Errno 104] Connection reset 
by peer'.  There have been 1 attempts to send the volume 0 so far.
        STRUCTURED: time=2014-06-04T21:37:05.576854-00 pid=46822
wal_e.worker.upload INFO     MSG: Retrying send because of a socket error
        DETAIL: The socket error's message is 'timed out'.  There have been 
1 attempts to send the volume 3 so far.
        STRUCTURED: time=2014-06-04T21:37:22.262263-00 pid=46822
wal_e.worker.upload INFO     MSG: Retrying send because of a socket error
        DETAIL: The socket error's message is '[Errno 32] Broken pipe'. 
 There have been 1 attempts to send the volume 1 so far.
        STRUCTURED: time=2014-06-04T21:38:02.876949-00 pid=46822

...
wal_e.worker.upload INFO     MSG: Retrying send because of a socket error
        DETAIL: The socket error's message is 'timed out'.  There have been 
1 attempts to send the volume 4 so far.
        STRUCTURED: time=2014-06-04T21:41:22.056998-00 pid=46822
wal_e.worker.upload INFO     MSG: Retrying send because of a socket error
        DETAIL: The socket error's message is 'timed out'.  There have been 
2 attempts to send the volume 3 so far.
        STRUCTURED: time=2014-06-04T21:43:38.755493-00 pid=46822
wal_e.worker.upload INFO     MSG: Retrying send because of a socket error
        DETAIL: The socket error's message is 'timed out'.  There have been 
2 attempts to send the volume 0 so far.
        STRUCTURED: time=2014-06-04T21:43:42.338737-00 pid=46822
wal_e.worker.upload INFO     MSG: Retrying send because of a socket error
        DETAIL: The socket error's message is 'timed out'.  There have been 
2 attempts to send the volume 1 so far.
        STRUCTURED: time=2014-06-04T21:44:59.760043-00 pid=46822


So I'm seeing the failures starting from segment 0, but not on all of them. 
Segment 2 uploaded without issue. Some of the segments that fail will fail 
for a few times before uploading successfully.

The network use is high while this is happening -- around 500 Mb/s with 
occasional peaks at 750 Mb/s -- but I'm not sure if this is blocked on 
network throughput. (If it were, wouldn't we see a consistent plateau 
around at least 750 Mb/s?)

In any case, it doesn't appear to be an incorrectly cached DNS entry, as 
these failures are happening from the beginning of the backup-push 
operation.

Is there any other info I can fetch that might be helpful in diagnosing 
what's going on?

-Dan

On Wednesday, May 21, 2014 12:21:28 AM UTC-7, Daniel Farina wrote:
>
> On Tue, May 20, 2014 at 4:47 PM, Dan Robinson <[email protected] 
> <javascript:>> wrote: 
> > We're running wal-e 0.7.0 with postgresql 9.3.4. 
> > 
> > During base backups on some of our nodes, we're getting repeated "socket 
> > failure" errors, with the same 'timed out' error message every time. 
> When 
> > this happens, wal-e will fail to upload more chunks. E.g.: 
> > 
> > wal_e.worker.upload INFO     MSG: Retrying send because of a socket 
> error 
> >         DETAIL: The socket error's message is 'timed out'.  There have 
> been 
> > 39 attempts to send the volume 325 so far. 
> >         STRUCTURED: time=2014-05-20T15:36:13.002438-00 pid=48972 
> > 
> > 
> > ... failing for the same chunk many times until the backup job times 
> out. 
> > Our logs show this consistent failure pattern starting around 4 hours 
> into 
> > the base backup. We get this behavior on a subset of our nodes, all of 
> which 
> > are the same EC2 instance type and have comparable DB sizes. 
> > 
> > Is this something you've seen before? Is there any other info that might 
> be 
> > helpful in diagnosing what's going on? 
>
> I've seen this and some other errors, such as connection reset by 
> peer.  Have you tried denicing the process? 
>
> I think I have a root cause for some of these stubborn problems in 
> general, but have yet to have figured out how to patch it (keeping it 
> for a rainy day, or an interested contributor). 
>
> One thing I've noticed is that WAL-E doesn't re-resolve DNS when 
> retrying from errors, at least if ltrace and strace are any 
> indication.  It is my hypothesis that bad S3 backends that occur in 
> the many hours that transpire when uploading a database with hundreds 
> of partitions (300GB+) can be a stubborn problem for WAL-E. 
>
> I did inspect the code and it looks like re-connection happens with 
> fairly clean state in WAL-E or so I think, and I suspect there is some 
> kind of caching or pooling deeper in the stack I'm not getting. 
>
> Maybe a solution: force a re-resolution of that hostname.  I have yet 
> to take this under a microscope and simulate it given the workaround 
> is (painfully) try that backup again. 
>
> Differently: bite the bullet and support a wal-e state directory and 
> fill that with backup-resume information so that one can tear down and 
> set up the WAL-E process, mitigating this and other problems. 
>
> All in all: 
>
> See what "strace" says about what IP you are connecting to and whether 
> it stays stubbornly the same and whether or not it matches recent 
> hand-rolled DNS queries against the S3 endpoint (e.g. via dig), and 
> then consider patches that attempt to force variance of the resolved 
> hostname (connect() should be seen against many target IP addresses 
> while being retried; so far that appears to not be the case). 
>
> So, it's a little thorny, but I'd be grateful for outside 
> corroboration and assistance.  I've tabled the matter for myself for 
> now. 
>

-- 
You received this message because you are subscribed to the Google Groups 
"wal-e" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
For more options, visit https://groups.google.com/d/optout.

Re: [wal-e] Socket Timeouts When Backing Up Large Instances

Reply via email to