On Mon, Aug 18, 2014 at 11:59 AM, <[email protected]> wrote: > > > On Wednesday, February 5, 2014 10:31:25 AM UTC-8, Daniel Farina wrote: >> >> On Wed, Feb 5, 2014 at 3:28 AM, Dan Fairs <[email protected]> wrote: >> > >> >> > >> > [snip] >> > >> >> It therefore looks (with exactly 2 data points...) like boto could be >> >> the culprit - it *seems* like it may be possible for it to corrupt files >> >> in >> >> the face of connection problems. We don't have anything but circumstantial >> >> evidence for this, but if it happens again, it's the first place we'll >> >> look. >> > >> > >> > It's also worth mentioning that both our Riak CS-based system and WAL-E >> > use boto in conjunction with gevent (the Riak system uses gevent 1.0). >> >> That seems a bit scary. One of my colleagues, Greg Stark, has been >> looking into removing gevent, although not for this reason (the >> reasons are simplicity and performance). Perhaps as a windfall his >> patch can be used to test your hypothesis. >> >> Another project that has some interest is implementing checksumming >> manifests on the upload side that could be re-checked during download, >> which would maybe also help pin down the problem. > > > Has any of the above been done, or is there any advancement on understanding > this problem at all?
There is a patch that works with S3 alone (which is why it is not committed to main-line) that does some download validation. It's what I use at Heroku: https://github.com/fdr/wal-e/tree/heroku-hacks-v0.8 However, I have not even once seen this defect myself. > I'm running into this with a large ~1TB database that takes the better part > of a day to run each of a backup-push and a backup-fetch, and has for the > past several days not been able to catch up despite a constant processing of > WAL files with _tons_ of these errors about transaction 0 interspersed. > > If I remove the recovery_command, I can start the database and communicate > with it, so it seems odd that this would be corruption in the backup-push or > fetch, but I certainly haven't tried to access or manipulate all of the > records, so it's possible. What version of WAL-E? Also modern v0.8 versions sport what is nominally a much faster parallel and pipelined WAL download routines. -- You received this message because you are subscribed to the Google Groups "wal-e" group. To unsubscribe from this group and stop receiving emails from it, send an email to [email protected]. For more options, visit https://groups.google.com/d/optout.
