On Jan 15, 2012, at 7:04 AM, Jim Klimov wrote:

> "Does raidzN actually protect against bitrot?"
> That's a kind of radical, possibly offensive, question formula
> that I have lately.

Simple answer: no. raidz provides data protection. Checksums verify
data is correct. Two different parts of the storage solution.

> Reading up on theory of RAID5, I grasped the idea of the write
> hole (where one of the sectors of the stripe, such as the parity
> data, doesn't get written - leading to invalid data upon read).
> In general, I think the same applies to bitrot of data that was
> written successfully and corrupted later - either way, upon
> reading all sectors of the stripe, we don't have a valid result
> (for the XOR-parity example, XORing all bytes does not produce
> a zero).
> The way I get it, RAID5/6 generally has no mechanism to detect
> *WHICH* sector was faulty, if all of them got read without
> error reports from the disk. Perhaps it won't even test whether
> parity matches and bytes zero out, as long as there were no read
> errors reported. In this case having a dead drive is better than
> having one with a silent corruption, because when one sector is
> known to be invalid or absent, its contents can be reconstructed
> thanks to other sectors and parity data.
> I've seen statements (do I have to scavenge for prooflinks?)
> that raidzN {sometimes or always?} has no means to detect
> which drive produced bad data either. In this case in output
> of "zpool status" we see zero CKSUM error-counts on leaf disk
> levels, and non-zero counts on raidzN levels.

raidz uses an algorithm to try permutations of data and parity to
verify against the checksum. Once the checksum matches, repair
can begin.

> Opposed to that, on mirrors (which are used in examples of
> ZFS's on-the-fly data repairs in all presentations), we do
> always know the faulty source of data and can repair it
> with a verifiable good source, if present.

Mirrors are no different, ZFS tries each side of the mirror until it finds
data that matches the checksum.

> In a real-life example, on my 6-disk raidz2 pool I see some
> irrepairable corruptions as well as several "repaired" detected
> errors. So I have a set of questions here, outlined below...
>  (DISCLAIMER: I haven't finished reading through on-disk
>  format spec in detail, but that PDF document is 5 years
>  old anyway and I've heard some things have changed).
> 1) How does raidzN protect agaist bit-rot without known full
>   death of a component disk, if it at all does?
>   Or does it only help against "loud corruption" where the
>   disk reports a sector-access error or dies completely?

raidz cannot be separated from the ZFS checksum verification
in this answer.

> 2) Do the "leaf blocks" (on-disk sectors or ranges of sectors
>   that belong to a raidzN stripe) have any ZFS checksums of
>   their own? That is, can ZFS determine which of the disks
>   produced invalid data and reconstruct the whole stripe?

No. Yes.

> 2*) How are the sector-ranges on-physical-disk addressed by
>   ZFS? Are there special block pointers with some sort of
>   physical LBA addresses in place of DVAs and with checksums?
>   I think there should be (claimed end-to-end checksumming)
>   but wanted to confirm.


> 2**) Alternatively, how does raidzN get into situation like
>   "I know there is an error somewhere, but don't know where"?
>   Does this signal simultaneous failures in different disks
>   of one stripe?
>   How *do* some things get fixed then - can only dittoed data
>   or metadata be salvaged from second good copies on raidZ?

No. See the seminal blog on raidz

> 3) Is it true that in recent ZFS the metadata is stored in
>   a mirrored layout, even for raidzN pools? That is, does
>   the raidzN layout only apply to userdata blocks now?
>   If "yes":

Yes, for Solaris 11. No, for all other implementations, at this time.

> 3*)  Is such mirroring applied over physical VDEVs or over
>   top-level VDEVs? For certain 512/4096 bytes of a metadata
>   block, are there two (ditto-mirror) or more (ditto over
>   raidz) physical sectors of storage directly involved?

It is done in the top-level vdev. For more information see the manual,
What's New in ZFS? - Oracle Solaris ZFS Administration Guide

> 3**) If small blocks, sized 1-or-few sectors, are fanned out
>   in incomplete raidz stripes (i.e. 512b parity + 512b data)
>   does this actually lead to +100% overhead for small data,
>   double that (200%) for dittoed data/copies=2?

The term "incomplete" does not apply here. The stripe written is 
complete: data + parity.

>   Does this apply to metadata in particular? ;)

lost context here, for non-Solaris 11 implementations, metadata is
no different than data with copies=[23]

>   Does this large factor apply to ZVOLs with fixed block
>   size being defined "small" (i.e. down to the minimum 512b/4k
>   available for these disks)?

NB, there are a few slides in my ZFS tutorials where we talk about this.

> 3***) In fact, for the considerations above, what is metadata? :)
>   Is it only the tree of blockpointers, or is it all the two
>   or three dozen block types except userdata (ZPL file, ZVOL
>   block) and unallocated blocks?

It is metadata, there is quite a variety. For example, there is the MOS,
zpool history, DSL configuration, etc.

> I do hope to see answers from the gurus on the list to these
> and other questions I posed recently.
> One frequently announced weakness in ZFS is the relatively small
> pool of engineering talent knowledgeable enough to hack ZFS and
> develop new features (i.e. the ex-Sunnites and very few determined
> other individuals): "We might do this, but we have few resources
> and already have other more pressing priorities".

There is quite a bit of activity going on under the illumos umbrella. In fact,
at the illumos meetup last week, there were several presentations about
upcoming changes and additions (awesome stuff!) See Deirdre's videos at

> I think there is a lot more programming talent in the greater
> user/hacker community around ZFS, including active askers on this
> list, Linux/BSD porters, and probably many more people who just
> occasionally hit upon our discussions here by googling up their
> questions. I mean programmers ready to dedicate some time to ZFS,
> which are held back by not fully understanding the architecture,
> and just do not start their developing (so as not to make matters
> worse). And the knowledge barrier to start coding is quite high.
> I do hope that instead of spending weeks to make a new feature,
> development gurus could spend a day writing replies to questions
> like mine (and many others') and then someone in the community
> would come up with a reasonable POC or finished code for new
> features and improvements.
> It is like education. Say, math: many talented mathematicians
> have spent thousands of man-years developing and refining the
> theory which now we learn over 3 or 6 years in a university.
> Maybe we're skimming overheads on lectures, but we gain enough
> understanding to deepen into any more specific subject ourselves.
> Likewise with opensource: yes, the code is there. A developer
> might read into it and possibly comprehend some in a year or so.
> Or he could spend a few days midway (when he knows enough to
> pose hard questions not googlable in some FAQ yet) in yes-no
> question sessions with the more knowledgeable people, and become
> ready to work in just a few weeks from start. Wouldn't that be
> wonderful for ZFS in general? :)

Agree 110%
 -- richard

ZFS Performance and Training

zfs-discuss mailing list

Reply via email to