We're running wal-e 0.7.0 with postgresql 9.3.4.
During base backups on some of our nodes, we're getting repeated "socket
failure" errors, with the same 'timed out' error message every time. When
this happens, wal-e will fail to upload more chunks. E.g.:
wal_e.worker.upload INFO MSG: Retrying send because of a socket error
DETAIL: The socket error's message is 'timed out'. There have been
39 attempts to send the volume 325 so far.
STRUCTURED: time=2014-05-20T15:36:13.002438-00 pid=48972
... failing for the same chunk many times until the backup job times out.
Our logs show this consistent failure pattern starting around 4 hours into
the base backup. We get this behavior on a subset of our nodes, all of
which are the same EC2 instance type and have comparable DB sizes.
Is this something you've seen before? Is there any other info that might be
helpful in diagnosing what's going on?
--
You received this message because you are subscribed to the Google Groups
"wal-e" group.
To unsubscribe from this group and stop receiving emails from it, send an email
to [email protected].
For more options, visit https://groups.google.com/d/optout.