On Wednesday, February 5, 2014 10:31:25 AM UTC-8, Daniel Farina wrote: > > On Wed, Feb 5, 2014 at 3:28 AM, Dan Fairs <[email protected] <javascript:>> > wrote: > >
> > > > [snip] > > > >> It therefore looks (with exactly 2 data points...) like boto could be > the culprit - it *seems* like it may be possible for it to corrupt files in > the face of connection problems. We don't have anything but circumstantial > evidence for this, but if it happens again, it's the first place we'll > look. > > > > > > It's also worth mentioning that both our Riak CS-based system and WAL-E > use boto in conjunction with gevent (the Riak system uses gevent 1.0). > > That seems a bit scary. One of my colleagues, Greg Stark, has been > looking into removing gevent, although not for this reason (the > reasons are simplicity and performance). Perhaps as a windfall his > patch can be used to test your hypothesis. > > Another project that has some interest is implementing checksumming > manifests on the upload side that could be re-checked during download, > which would maybe also help pin down the problem. > Has any of the above been done, or is there any advancement on understanding this problem at all? I'm running into this with a large ~1TB database that takes the better part of a day to run each of a backup-push and a backup-fetch, and has for the past several days not been able to catch up despite a constant processing of WAL files with _tons_ of these errors about transaction 0 interspersed. If I remove the recovery_command, I can start the database and communicate with it, so it seems odd that this would be corruption in the backup-push or fetch, but I certainly haven't tried to access or manipulate all of the records, so it's possible. -- You received this message because you are subscribed to the Google Groups "wal-e" group. To unsubscribe from this group and stop receiving emails from it, send an email to [email protected]. For more options, visit https://groups.google.com/d/optout.
