>>> 
>>> Right, that's what I was afraid of. I'm currently restoring from a pg_dump 
>>> just to check that we can recover. I suspect the next step will be to take 
>>> another server, restore a pg_dump'd backup on it, and try a WAL-E setup on 
>>> that one. If that works, then I expect we'll have to dump and reload our 
>>> production server. Frustrating, as this all worked smoothly in our test 
>>> environments! That's life, I guess...
>> 
>> Yeah.  Testing backups is still a struggle -- even superficially
>> starting up the cluster is not enough.  Some extra checking or
>> monitoring integration will probably be seen in WAL-E over time,
>> particularly with regard to Postgres checksums and figuring out how to
>> deal with file system failures for those using checksummed file
>> systems, but that is a ways off.

[snip]

> It therefore looks (with exactly 2 data points...) like boto could be the 
> culprit - it *seems* like it may be possible for it to corrupt files in the 
> face of connection problems. We don't have anything but circumstantial 
> evidence for this, but if it happens again, it's the first place we'll look.


It's also worth mentioning that both our Riak CS-based system and WAL-E use 
boto in conjunction with gevent (the Riak system uses gevent 1.0).

Cheers,
Dan
--
Dan Fairs | [email protected] | @danfairs | secondsync.com

-- 
You received this message because you are subscribed to the Google Groups 
"wal-e" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
For more options, visit https://groups.google.com/groups/opt_out.

Reply via email to