Re: [zfs-discuss] Data loss by memory corruption?

Jim Klimov Thu, 19 Jan 2012 03:18:38 -0800

2012-01-18 20:36, Nico Williams wrote:

On Wed, Jan 18, 2012 at 4:53 AM, Jim Klimov<jimkli...@cos.ru>  wrote:

2012-01-18 1:20, Stefan Ring wrote:

I don’t care too much if a single document gets corrupted – there’ll
always be a good copy in a snapshot. I do care however if a whole
directory branch or old snapshots were to disappear.


Well, as far as this problem "relies" on random memory corruptions,
you don't get to choose whether your document gets broken or some
low-level part of metadata tree ;)


Other filesystems tend to be much more tolerant of bit rot of all
types precisely because they have no block checksums.

But I'd rather have ZFS -- *with* redundancy, of course, and with ECC.

It might be useful to have a way to recover from checksum mismatches
by involving a human.  I'm imagining a tool that tests whether
accepting a block's actual contents results in making data available
that the human thinks checks out, and if so, then rewriting that
block.  Some bit errors might simply result in meaningless metadata,
but in some cases this can be corrected (e.g., ridiculous block
addresses).  But if ECC takes care of the problem then why waste the
effort?


Because RAM ECC only decreases the probability of one type of
corruption?

You still have CPUs (i.e. overclocked and overheated, as is
likely in enthusiast systems, or in laptops with blocked vents,
thus sometimes generating random garbage).

Many other parts are not SPoF in a good design, i.e. noise
on wire, bugs in HBA and HDD firmware can be mitigated by
some hardware redundancy (multipathing, mixed vendors) in
higher-end systems, and by just ZFS approaches in other systems -
such as ditto copies for metadata and by vdev redundancy; but
these can still corrupt the copies=1 data (i.e. on single-disk
laptops without explicit copies=2).


> (Partial answer: because it'd be a very neat GSoC type project!)

Good point for at least one motivator ;)

"I don't care how it is done - but it should be!
This time you may even use sorcery, I'll not ask questions!" ;)

Besides, what if that document you don't care about is your account's
entry in a banking system (as if they had no other redundancy and
double-checks)? And suddenly you "don't exist" because of some EIOIO,
or your balance is zeroed (or worse, highly negative)? ;)


This is why we have paper trails, logs, backups, redundancy at various
levels, ...


As if any of them is 100% good and reliable and readily
accessible-available ;)

//Jim
_______________________________________________
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Data loss by memory corruption?

Reply via email to