Re: [zfs-discuss] Split responsibility for data with ZFS

Miles Nordin Mon, 15 Dec 2008 13:52:10 -0800

>>>>> "nw" == Nicolas Williams <nicolas.willi...@sun.com> writes:


    nw> Your thesis is that all corruption problems observed with ZFS
    nw> on SANs are: a) phantom writes that never reached the rotating
    nw> rust, b) not bit rot, corruption in the I/O paths, ...
    nw> Correct?

yeah.  

by ``all'' I mean the several single-LUN pools that were recovered by
using an older set of ueberblocks.  Of course I don't mean ``all'' as
in all pools imagineable including this one 10 years ago on an unnamed
Major Vendor's RAID shelf that gave you a scar just above the ankle.

But it is really sounding so far like just one major problem with
single-LUN ZFS's on SAN's?  or am I wrong, there are lots of pools
which can't be recovered with old ueberblocks?

Remember the problem is losing pools.  It is not, ``for weeks I kept
losing files.  I would get errors reported in 'zpool status', and it
would tell me the filename 'blah' has uncorrectable errors.  This went
on for a while, then one day we lost the whole pool.''  I've heard
zero reports like that.

    nw> Some of the earlier problems of type (2) were triggered by
    nw> checksum verification failures on pools with no redundancy,

but checksum failures aren't caused just by bitrot in ZFS.  I get
hundreds of them after half of my iSCSI mirror bounces because of the
incomplete-resilvering bug.  

I don't know the on-disk format well, but maybe the checksum was wrong
because the label pointed to a block that wasn't an ueberblock.  Maybe
the checksum is functioning in leiu of a commit sector: maybe all four
ueberblocks were written incompletely because there is some bug or
missing-workaround in the way ZFS flushes and schedules the ueberblock
writes, so with some written sectors and some unwritten sectors the
overall block checksum is wrong.

Maybe this is a downside to the filesystem-level checksum.  For
integrity it's an upside, but the netapp block-level checksum, where
you checksum just the data plus the block-number at RAID layer, should
narrow down checksum failures to disk bit flips only and thus be
better for tracking down problems and building statistics comparable
with other systems.  We already know the 'zpool status' CKSUM column
isn't so selective, and can catch out-of-date data too.

The overall point, what I'd rather have as my ``thesis,'' is you can't
allow ZFS to exhonerate itself with an error message.  Losing the
whole pool in a situation where UFS would (or _might_, is not even
proven beyond doubt that it _would_), have corrupted a bit of data,
isn't an advantage just because ZFS can printf a warning that says
``loss of entire pool detected.  must be corruption outside ZFS!''

pgpnWFqKEeiVx.pgp
Description: PGP signature

_______________________________________________
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Split responsibility for data with ZFS

Reply via email to