On Tue, May 20, 2014 at 4:47 PM, Dan Robinson <[email protected]> wrote: > We're running wal-e 0.7.0 with postgresql 9.3.4. > > During base backups on some of our nodes, we're getting repeated "socket > failure" errors, with the same 'timed out' error message every time. When > this happens, wal-e will fail to upload more chunks. E.g.: > > wal_e.worker.upload INFO MSG: Retrying send because of a socket error > DETAIL: The socket error's message is 'timed out'. There have been > 39 attempts to send the volume 325 so far. > STRUCTURED: time=2014-05-20T15:36:13.002438-00 pid=48972 > > > ... failing for the same chunk many times until the backup job times out. > Our logs show this consistent failure pattern starting around 4 hours into > the base backup. We get this behavior on a subset of our nodes, all of which > are the same EC2 instance type and have comparable DB sizes. > > Is this something you've seen before? Is there any other info that might be > helpful in diagnosing what's going on?
I've seen this and some other errors, such as connection reset by peer. Have you tried denicing the process? I think I have a root cause for some of these stubborn problems in general, but have yet to have figured out how to patch it (keeping it for a rainy day, or an interested contributor). One thing I've noticed is that WAL-E doesn't re-resolve DNS when retrying from errors, at least if ltrace and strace are any indication. It is my hypothesis that bad S3 backends that occur in the many hours that transpire when uploading a database with hundreds of partitions (300GB+) can be a stubborn problem for WAL-E. I did inspect the code and it looks like re-connection happens with fairly clean state in WAL-E or so I think, and I suspect there is some kind of caching or pooling deeper in the stack I'm not getting. Maybe a solution: force a re-resolution of that hostname. I have yet to take this under a microscope and simulate it given the workaround is (painfully) try that backup again. Differently: bite the bullet and support a wal-e state directory and fill that with backup-resume information so that one can tear down and set up the WAL-E process, mitigating this and other problems. All in all: See what "strace" says about what IP you are connecting to and whether it stays stubbornly the same and whether or not it matches recent hand-rolled DNS queries against the S3 endpoint (e.g. via dig), and then consider patches that attempt to force variance of the resolved hostname (connect() should be seen against many target IP addresses while being retried; so far that appears to not be the case). So, it's a little thorny, but I'd be grateful for outside corroboration and assistance. I've tabled the matter for myself for now. -- You received this message because you are subscribed to the Google Groups "wal-e" group. To unsubscribe from this group and stop receiving emails from it, send an email to [email protected]. For more options, visit https://groups.google.com/d/optout.
