>>>>> I've got a restore running currently from a new base backup which just >>>>> finished. I'll let that finish, and fully recover (hopefully!) just to >>>>> satisfy myself that this setup is basically working; after that, I'll try >>>>> again from the failing base backup. It takes quite a while to do all this, >>>>> so don't worry if you don't hear from me for a short while! >>>> >>>> I'm grateful to hear anything at any time. WAL-E has grown up into >>>> long-haul software -- it'll still be here if you can find the time later. >>>> >>>> Well - unfortunately my second attempt with a newer base backup also >>>> failed. >>>> This is a bit of a concern now - I'd like to dig deeper into this. Should >>>> we >>>> take this back on the list? >>> >>> Sure. Pity to say, the more times this fails, particularly with fresh >>> base backups, the more likely it seems to me you've been hit by >>> corruption. WAL-E has a bit too much empirical reliability to be >>> easily implicated in successive defects on upload or download sides. >> >> >> Right, that's what I was afraid of. I'm currently restoring from a pg_dump >> just to check that we can recover. I suspect the next step will be to take >> another server, restore a pg_dump'd backup on it, and try a WAL-E setup on >> that one. If that works, then I expect we'll have to dump and reload our >> production server. Frustrating, as this all worked smoothly in our test >> environments! That's life, I guess... > > Yeah. Testing backups is still a struggle -- even superficially > starting up the cluster is not enough. Some extra checking or > monitoring integration will probably be seen in WAL-E over time, > particularly with regard to Postgres checksums and figuring out how to > deal with file system failures for those using checksummed file > systems, but that is a ways off.
Just an update on this. Before committing to a long night dumping and restoring our production database to get rid of apparent corruption, I thought I'd give this one final shot today - with a base backup from 5am this morning, and then a few hours's worth of 5-minute WAL files (PG 9.3, WAL-E 0.6.6). It worked fine. We actually have another product internally, which uses boto to talk to a Riak CS cluster in S3 compatibility mode. I was chatting to the guy who runs this project, and he mentioned that he thought he'd experienced file corruption when fetching files through boto from Riak CS when the cluster was having load problems, and some connections were failing. If you recall, my original logs had several connection failures and retries. It therefore looks (with exactly 2 data points...) like boto could be the culprit - it *seems* like it may be possible for it to corrupt files in the face of connection problems. We don't have anything but circumstantial evidence for this, but if it happens again, it's the first place we'll look. Cheers, Dan -- Dan Fairs | [email protected] | @danfairs | secondsync.com -- You received this message because you are subscribed to the Google Groups "wal-e" group. To unsubscribe from this group and stop receiving emails from it, send an email to [email protected]. For more options, visit https://groups.google.com/groups/opt_out.
