On Mon, Aug 18, 2014 at 11:59 AM,  <[email protected]> wrote:
>
>
> On Wednesday, February 5, 2014 10:31:25 AM UTC-8, Daniel Farina wrote:
>>
>> On Wed, Feb 5, 2014 at 3:28 AM, Dan Fairs <[email protected]> wrote:
>>
>
>>
>> >
>> > [snip]
>> >
>> >> It therefore looks (with exactly 2 data points...) like boto could be
>> >> the culprit - it *seems* like it may be possible for it to corrupt files 
>> >> in
>> >> the face of connection problems. We don't have anything but circumstantial
>> >> evidence for this, but if it happens again, it's the first place we'll 
>> >> look.
>> >
>> >
>> > It's also worth mentioning that both our Riak CS-based system and WAL-E
>> > use boto in conjunction with gevent (the Riak system uses gevent 1.0).
>>
>> That seems a bit scary.  One of my colleagues, Greg Stark, has been
>> looking into removing gevent, although not for this reason (the
>> reasons are simplicity and performance).  Perhaps as a windfall his
>> patch can be used to test your hypothesis.
>>
>> Another project that has some interest is implementing checksumming
>> manifests on the upload side that could be re-checked during download,
>> which would maybe also help pin down the problem.
>
>
> Has any of the above been done, or is there any advancement on understanding
> this problem at all?

There is a patch that works with S3 alone (which is why it is not
committed to main-line) that does some download validation.  It's what
I use at Heroku:

https://github.com/fdr/wal-e/tree/heroku-hacks-v0.8

However, I have not even once seen this defect myself.

> I'm running into this with a large ~1TB database that takes the better part
> of a day to run each of a backup-push and a backup-fetch, and has for the
> past several days not been able to catch up despite a constant processing of
> WAL files with _tons_ of these errors about transaction 0 interspersed.
>
> If I remove the recovery_command, I can start the database and communicate
> with it, so it seems odd that this would be corruption in the backup-push or
> fetch, but I certainly haven't tried to access or manipulate all of the
> records, so it's possible.

What version of WAL-E?  Also modern v0.8 versions sport what is
nominally a much faster parallel and pipelined WAL download routines.

-- 
You received this message because you are subscribed to the Google Groups 
"wal-e" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
For more options, visit https://groups.google.com/d/optout.

Reply via email to