[wal-e] Socket Timeouts When Backing Up Large Instances

Dan Robinson Tue, 20 May 2014 16:48:07 -0700

We're running wal-e 0.7.0 with postgresql 9.3.4.

During base backups on some of our nodes, we're getting repeated "socket 
failure" errors, with the same 'timed out' error message every time. When 
this happens, wal-e will fail to upload more chunks. E.g.:


wal_e.worker.upload INFO     MSG: Retrying send because of a socket error
        DETAIL: The socket error's message is 'timed out'.  There have been 
39 attempts to send the volume 325 so far.
        STRUCTURED: time=2014-05-20T15:36:13.002438-00 pid=48972


... failing for the same chunk many times until the backup job times out. 
Our logs show this consistent failure pattern starting around 4 hours into 
the base backup. We get this behavior on a subset of our nodes, all of 
which are the same EC2 instance type and have comparable DB sizes.

Is this something you've seen before? Is there any other info that might be 
helpful in diagnosing what's going on?

-- 
You received this message because you are subscribed to the Google Groups 
"wal-e" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
For more options, visit https://groups.google.com/d/optout.

[wal-e] Socket Timeouts When Backing Up Large Instances

Reply via email to