On Thu, Jan 30, 2014 at 1:44 AM, Dan Fairs <[email protected]> wrote: > (Bringing back on list) > >>>>> Sad to say, I did not see what I was looking for -- which is a retry >>>>> failing to process, ugly messages aside. >>>>> >>>>> If you can reproduce this defect from the same base backup >>>>> in-triplicate, it's almost certainly corruption of some kind. If not >>>>> -- e.g. sometimes it works -- it could be a WAL-E bug on the fetch side. >>>>> >>>>> I'd be very grateful if you'd give it a try. I haven't been able to >>>>> produce this defect on restoring a base backup myself. >>>>> >>>> >>>> I've got a restore running currently from a new base backup which just >>>> finished. I'll let that finish, and fully recover (hopefully!) just to >>>> satisfy myself that this setup is basically working; after that, I'll try >>>> again from the failing base backup. It takes quite a while to do all this, >>>> so don't worry if you don't hear from me for a short while! >>> >>> I'm grateful to hear anything at any time. WAL-E has grown up into >>> long-haul software -- it'll still be here if you can find the time later. >>> >>> Well - unfortunately my second attempt with a newer base backup also failed. >>> This is a bit of a concern now - I'd like to dig deeper into this. Should we >>> take this back on the list? >> >> Sure. Pity to say, the more times this fails, particularly with fresh >> base backups, the more likely it seems to me you've been hit by >> corruption. WAL-E has a bit too much empirical reliability to be >> easily implicated in successive defects on upload or download sides. > > > Right, that's what I was afraid of. I'm currently restoring from a pg_dump > just to check that we can recover. I suspect the next step will be to take > another server, restore a pg_dump'd backup on it, and try a WAL-E setup on > that one. If that works, then I expect we'll have to dump and reload our > production server. Frustrating, as this all worked smoothly in our test > environments! That's life, I guess...
Yeah. Testing backups is still a struggle -- even superficially starting up the cluster is not enough. Some extra checking or monitoring integration will probably be seen in WAL-E over time, particularly with regard to Postgres checksums and figuring out how to deal with file system failures for those using checksummed file systems, but that is a ways off. -- You received this message because you are subscribed to the Google Groups "wal-e" group. To unsubscribe from this group and stop receiving emails from it, send an email to [email protected]. For more options, visit https://groups.google.com/groups/opt_out.
