On Wed, Feb 5, 2014 at 3:28 AM, Dan Fairs <[email protected]> wrote: >>>> >>>> Right, that's what I was afraid of. I'm currently restoring from a pg_dump >>>> just to check that we can recover. I suspect the next step will be to take >>>> another server, restore a pg_dump'd backup on it, and try a WAL-E setup on >>>> that one. If that works, then I expect we'll have to dump and reload our >>>> production server. Frustrating, as this all worked smoothly in our test >>>> environments! That's life, I guess... >>> >>> Yeah. Testing backups is still a struggle -- even superficially >>> starting up the cluster is not enough. Some extra checking or >>> monitoring integration will probably be seen in WAL-E over time, >>> particularly with regard to Postgres checksums and figuring out how to >>> deal with file system failures for those using checksummed file >>> systems, but that is a ways off. > > [snip] > >> It therefore looks (with exactly 2 data points...) like boto could be the >> culprit - it *seems* like it may be possible for it to corrupt files in the >> face of connection problems. We don't have anything but circumstantial >> evidence for this, but if it happens again, it's the first place we'll look. > > > It's also worth mentioning that both our Riak CS-based system and WAL-E use > boto in conjunction with gevent (the Riak system uses gevent 1.0).
That seems a bit scary. One of my colleagues, Greg Stark, has been looking into removing gevent, although not for this reason (the reasons are simplicity and performance). Perhaps as a windfall his patch can be used to test your hypothesis. Another project that has some interest is implementing checksumming manifests on the upload side that could be re-checked during download, which would maybe also help pin down the problem. -- You received this message because you are subscribed to the Google Groups "wal-e" group. To unsubscribe from this group and stop receiving emails from it, send an email to [email protected]. For more options, visit https://groups.google.com/groups/opt_out.
