Re: [zfs-discuss] Does raidzN actually protect against bitrot? If yes - how?

2012-01-16 Thread Jim Klimov

Thanks again for answering! :)

2012-01-16 10:08, Richard Elling wrote:

On Jan 15, 2012, at 7:04 AM, Jim Klimov wrote:


"Does raidzN actually protect against bitrot?"
That's a kind of radical, possibly offensive, question formula
that I have lately.


Simple answer: no. raidz provides data protection. Checksums verify
data is correct. Two different parts of the storage solution.


Meaning - data-block checksum mismatch allows to detect an error;
afterwards raidz permutations matching the checksum allow to fix
it (if enough redundancy is available)? Right?


raidz uses an algorithm to try permutations of data and parity to
verify against the checksum. Once the checksum matches, repair
can begin.


Ok, nice to have this statement confirmed so many times now ;)

How do per-disk cksum errors get counted then for raidz - thanks
to permutation of fixable errors we can detect which disk:sector
returned mismatching data? Likewise, for unfixable errors we can't
know the faulty disk - unless one had explicitly erred?

So, if my 6-disk raidz2 couldn't fix the error, it either occured
on 3 disks' parts of one stripe, or in RAM/CPU (SPOF) before writing
the data and checksum to disk? In the latter case there is definitely
no single-disk's fault for returning bad data, so per-disk cksum
counters are zero? ;)



2*) How are the sector-ranges on-physical-disk addressed by
ZFS? Are there special block pointers with some sort of
physical LBA addresses in place of DVAs and with checksums?
I think there should be (claimed end-to-end checksumming)
but wanted to confirm.


No.


Ok, so basically there is the vdev_raidz_map_alloc() algorithm
to convert DVAs into leaf addresses, and it is always going to
be the same for all raidz's?

For example, such lack of explicit addressing would not let ZFS
reallocate one disk's bad media sector into another location -
the disk is always expected to do that reliably and successfully?




2**) Alternatively, how does raidzN get into situation like
"I know there is an error somewhere, but don't know where"?
Does this signal simultaneous failures in different disks
of one stripe?
How *do* some things get fixed then - can only dittoed data
or metadata be salvaged from second good copies on raidZ?


No. See the seminal blog on raidz
http://blogs.oracle.com/bonwick/entry/raid_z



3) Is it true that in recent ZFS the metadata is stored in
a mirrored layout, even for raidzN pools? That is, does
the raidzN layout only apply to userdata blocks now?
If "yes":


Yes, for Solaris 11. No, for all other implementations, at this time.


Are there plans to do this for illumos, etc.?
I thought that my oi_148a's disks' IO patterns matched the
idea of mirroring metadata, now I'll have to explain that
data with some secondary ideas ;)




3*) Is such mirroring applied over physical VDEVs or over
top-level VDEVs? For certain 512/4096 bytes of a metadata
block, are there two (ditto-mirror) or more (ditto over
raidz) physical sectors of storage directly involved?


It is done in the top-level vdev. For more information see the manual,


  What's New in /ZFS/? - Oracle Solaris /ZFS/ Administration Guide
  

docs.oracle.com/cd/E19963-01/html/821-1448/gbscy.html



3**) If small blocks, sized 1-or-few sectors, are fanned out
in incomplete raidz stripes (i.e. 512b parity + 512b data)
does this actually lead to +100% overhead for small data,
double that (200%) for dittoed data/copies=2?


The term "incomplete" does not apply here. The stripe written is
complete: data + parity.


Just to make up, I meant variable-width stripes as opposed
to "full-width stripe" writes in other RAIDs. That is, to
update one sector of data on a 6-disk raid6 I'd need to
write 6 sectors; while on raidz2 I need to write only two.
No extra reply solicited here ;)


Does this apply to metadata in particular? ;)


lost context here, for non-Solaris 11 implementations, metadata is
no different than data with copies=[23]


The question here was whether writes of metadata (assumed
to be a small number of sectors down to one per block)
incur writes of parity, of ditto copies, or of parity and
copies, increasing storage requirements by several times.

One background thought was that I wanted to make sense of
my last year's experience with a zvol whose blocksize was
1 sector (4kb), and the metadata overhead (consumption of
free space) was about the same as userdata size. At that
time I thought it's because I have a 1-sector metadata
block to address each 1-sector data block of the volume;
but now I think the overhead would be closer to 400% of
userdata size...




Does this large factor apply to ZVOLs with fixed block
size being defined "small" (i.e. down to the minimum 512b/4k
available for these disks)?


NB, there are a few slides in my ZFS tutorials where we talk about this.
http://www.slideshare.net/relling/useni

Re: [zfs-discuss] Does raidzN actually protect against bitrot? If yes - how?

2012-01-15 Thread Richard Elling
On Jan 15, 2012, at 8:49 PM, Bob Friesenhahn wrote:

> On Sun, 15 Jan 2012, Edward Ned Harvey wrote:
>> 
>> Such failures can happen undetected with or without ECC memory.  It's simply
>> less likely with ECC.  The whole thing about ECC memory...  It's just doing
>> parity.  It's a very weak checksum.  If corruption happens in memory, it's
> 
> I am beginning to become worried now.  ECC is more than "just doing parity".

It depends. ECC is a very generic term. Most "ECC memory" is SECDED, except for 
the
high-end servers and mainframes.

> http://en.wikipedia.org/wiki/Error-correcting_code
> http://en.wikipedia.org/wiki/ECC_memory
> 
> There have been enough sequential errors now (a form of corruption) that I 
> think that you should start doing research prior to posting.

I've been collecting a number of ZFS bit error reports (courtesy of fmdump -eV) 
and I have
never seen a single-bit error in a block. The errors appear to be of the 
overwrite or stuck-at
variety that impact multiple bits. This makes sense because most disks already 
correct up to
8 bytes (or so) per sector.
 -- richard

--
ZFS Performance and Training
richard.ell...@richardelling.com
+1-760-896-4422

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Does raidzN actually protect against bitrot? If yes - how?

2012-01-15 Thread Richard Elling
On Jan 15, 2012, at 7:04 AM, Jim Klimov wrote:

> "Does raidzN actually protect against bitrot?"
> That's a kind of radical, possibly offensive, question formula
> that I have lately.

Simple answer: no. raidz provides data protection. Checksums verify
data is correct. Two different parts of the storage solution.

> Reading up on theory of RAID5, I grasped the idea of the write
> hole (where one of the sectors of the stripe, such as the parity
> data, doesn't get written - leading to invalid data upon read).
> In general, I think the same applies to bitrot of data that was
> written successfully and corrupted later - either way, upon
> reading all sectors of the stripe, we don't have a valid result
> (for the XOR-parity example, XORing all bytes does not produce
> a zero).
> 
> The way I get it, RAID5/6 generally has no mechanism to detect
> *WHICH* sector was faulty, if all of them got read without
> error reports from the disk. Perhaps it won't even test whether
> parity matches and bytes zero out, as long as there were no read
> errors reported. In this case having a dead drive is better than
> having one with a silent corruption, because when one sector is
> known to be invalid or absent, its contents can be reconstructed
> thanks to other sectors and parity data.
> 
> I've seen statements (do I have to scavenge for prooflinks?)
> that raidzN {sometimes or always?} has no means to detect
> which drive produced bad data either. In this case in output
> of "zpool status" we see zero CKSUM error-counts on leaf disk
> levels, and non-zero counts on raidzN levels.

raidz uses an algorithm to try permutations of data and parity to
verify against the checksum. Once the checksum matches, repair
can begin.

> Opposed to that, on mirrors (which are used in examples of
> ZFS's on-the-fly data repairs in all presentations), we do
> always know the faulty source of data and can repair it
> with a verifiable good source, if present.

Mirrors are no different, ZFS tries each side of the mirror until it finds
data that matches the checksum.

> In a real-life example, on my 6-disk raidz2 pool I see some
> irrepairable corruptions as well as several "repaired" detected
> errors. So I have a set of questions here, outlined below...
> 
>  (DISCLAIMER: I haven't finished reading through on-disk
>  format spec in detail, but that PDF document is 5 years
>  old anyway and I've heard some things have changed).
> 
> 
> 1) How does raidzN protect agaist bit-rot without known full
>   death of a component disk, if it at all does?
>   Or does it only help against "loud corruption" where the
>   disk reports a sector-access error or dies completely?

raidz cannot be separated from the ZFS checksum verification
in this answer.

> 2) Do the "leaf blocks" (on-disk sectors or ranges of sectors
>   that belong to a raidzN stripe) have any ZFS checksums of
>   their own? That is, can ZFS determine which of the disks
>   produced invalid data and reconstruct the whole stripe?

No. Yes.

> 2*) How are the sector-ranges on-physical-disk addressed by
>   ZFS? Are there special block pointers with some sort of
>   physical LBA addresses in place of DVAs and with checksums?
>   I think there should be (claimed end-to-end checksumming)
>   but wanted to confirm.

No.

> 2**) Alternatively, how does raidzN get into situation like
>   "I know there is an error somewhere, but don't know where"?
>   Does this signal simultaneous failures in different disks
>   of one stripe?
>   How *do* some things get fixed then - can only dittoed data
>   or metadata be salvaged from second good copies on raidZ?

No. See the seminal blog on raidz
http://blogs.oracle.com/bonwick/entry/raid_z

> 
> 3) Is it true that in recent ZFS the metadata is stored in
>   a mirrored layout, even for raidzN pools? That is, does
>   the raidzN layout only apply to userdata blocks now?
>   If "yes":

Yes, for Solaris 11. No, for all other implementations, at this time.

> 3*)  Is such mirroring applied over physical VDEVs or over
>   top-level VDEVs? For certain 512/4096 bytes of a metadata
>   block, are there two (ditto-mirror) or more (ditto over
>   raidz) physical sectors of storage directly involved?

It is done in the top-level vdev. For more information see the manual,
What's New in ZFS? - Oracle Solaris ZFS Administration Guide
docs.oracle.com/cd/E19963-01/html/821-1448/gbscy.html

> 3**) If small blocks, sized 1-or-few sectors, are fanned out
>   in incomplete raidz stripes (i.e. 512b parity + 512b data)
>   does this actually lead to +100% overhead for small data,
>   double that (200%) for dittoed data/copies=2?

The term "incomplete" does not apply here. The stripe written is 
complete: data + parity.

>   Does this apply to metadata in particular? ;)

lost context here, for non-Solaris 11 implementations, metadata is
no different than data with copies=[23]

>   Does this large factor apply to ZVOLs with fixed block
>   size being defined "small" (i.e. down to the 

Re: [zfs-discuss] Does raidzN actually protect against bitrot? If yes - how?

2012-01-15 Thread Bob Friesenhahn

On Sun, 15 Jan 2012, Edward Ned Harvey wrote:


Such failures can happen undetected with or without ECC memory.  It's simply
less likely with ECC.  The whole thing about ECC memory...  It's just doing
parity.  It's a very weak checksum.  If corruption happens in memory, it's


I am beginning to become worried now.  ECC is more than "just doing 
parity".


http://en.wikipedia.org/wiki/Error-correcting_code
http://en.wikipedia.org/wiki/ECC_memory

There have been enough sequential errors now (a form of corruption) 
that I think that you should start doing research prior to posting.


Bob
--
Bob Friesenhahn
bfrie...@simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/
GraphicsMagick Maintainer,http://www.GraphicsMagick.org/
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Does raidzN actually protect against bitrot? If yes - how?

2012-01-15 Thread Edward Ned Harvey
> From: zfs-discuss-boun...@opensolaris.org [mailto:zfs-discuss-
> boun...@opensolaris.org] On Behalf Of Gary Mills
> 
> There's actually no such thing as bitrot on a disk.  Each sector on
> the disk is accompanied by a CRC that's verified by the disk
> controller on each read.  It will either return correct data or report
> an unreadable sector.  There's nothing inbetween.

There is something in between:  Corrupt data that happens to pass the
hardware checksum.  You said CRC, but CRC is a specific algorithm.  All the
various disk manufacturers can implement various different algorithms,
including parity, CRC, LDPC, or whatever.

Of all the algorithms they use inside the actual disk, CRC would be a
relatively strong one.  And if there was "nothing in between" absolute
accuracy and absolute error...  Then there would be no point in all the
stronger checksum algorithms, such as SHA256.  There would be no point for
ZFS to bother doing checksumming, if the disks could never silently return
corrupted data.

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Does raidzN actually protect against bitrot? If yes - how?

2012-01-15 Thread Edward Ned Harvey
> From: zfs-discuss-boun...@opensolaris.org [mailto:zfs-discuss-
> boun...@opensolaris.org] On Behalf Of Jim Klimov
> 
> 2012-01-15 19:38, Edward Ned Harvey wrote:
>  >> 1) How does raidzN protect agaist bit-rot without known full
>  >>death of a component disk, if it at all does?
>  > zfs can read disks 1,2,3,4...  Then read disks 1,2,3,5...
>  > Then read disks 1,2,4,5...  ZFS can figure out which disk
>  > returned the faulty data, UNLESS the disk actually returns
>  > correct data upon subsequent retries.
> 
> Makes sense, if ZFS does actually do that ;)
> 
> Counter-examples:
> 1) For several scrubs in a row, my pool consistently found two
> vdev errors and one pool error with zero per-disk errors
> (further leading to error in some object :<0x0>).
> If the disk-read errors were transient, sometimes returning
> correct data (i.e. bad sector relocation was successful in
> the background), ZFS would receive good blocks on further
> scrubs - shouldn't it?

I can't say this is the explanation for your situation, but I can offer it
as one possible explanation:

Suppose your system is in operation, and you get corruption in your CPU or
RAM, so it calculates the wrong cksum for the data that is about to be
written.  The data gets written, along with the wrong cksum.

Later, you come along and read that data.  You discover the cksum error,
it's unrecoverable, but there are no disk errors.

I have certainly experienced CPU's that perform incorrect calculations
before - and I have certainly encountered errant memory before - Usually
when a component starts failing like that, it progressively gets worse (or
at least you can usually run some diag utils) and you can identify the
failing component.  But not always.

Such failures can happen undetected with or without ECC memory.  It's simply
less likely with ECC.  The whole thing about ECC memory...  It's just doing
parity.  It's a very weak checksum.  If corruption happens in memory, it's
FAR more likely that the error will go undetected by ECC as compared to the
Fletcher or SHA checksum that's being used by ZFS.

Even when you get down to the actual disk...  All disks store parity /
checksum information, using their FEC chip.  All disks will attempt to
detect and correct errors they encounter (this is even stronger than ECC
memory).  But nothing's perfect, not even SHA...  But the accuracy of
Fletcher or SHA is far, far greater than the ECC or FEC being used by your
memory and disks.

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Does raidzN actually protect against bitrot? If yes - how?

2012-01-15 Thread Jim Klimov

2012-01-15 20:43, Gary Mills пишет:

On Sun, Jan 15, 2012 at 04:06:33PM +, Peter Tribble wrote:

On Sun, Jan 15, 2012 at 3:04 PM, Jim Klimov  wrote:

"Does raidzN actually protect against bitrot?"
That's a kind of radical, possibly offensive, question formula
that I have lately.


Yup, it does. That's why many of us use it.


There's actually no such thing as bitrot on a disk.  Each sector on
the disk is accompanied by a CRC that's verified by the disk
controller on each read.  It will either return correct data or report
an unreadable sector.


What about UBER (uncorrectable bit-error rates)?
For example, the non-zero small chances of another block contents
matching the CRC code (circa 10^-14 - 10^-16)?

If hashes were perfect with zero collisions, they could be used
instead of original data and be much more compact, and lossless
compression algorithms would always return smaller data than
*any* random original stream ;)

Even ZFS dedup with 10^-77 collision chance proposes a mode to
verify-on-write.

>  There's nothing inbetween.

Also "inbetween" there's cabling, contacts and dialog protocols.
AFAIK some protocols and/or implementations don't bother with the
on-wire CRC/ECC, perhaps the IDE (and maybe consumer SATA) protocols?


Thanks for replies,
//Jim Klimov
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Does raidzN actually protect against bitrot? If yes - how?

2012-01-15 Thread Andrew Gabriel

Gary Mills wrote:

On Sun, Jan 15, 2012 at 04:06:33PM +, Peter Tribble wrote:
  

On Sun, Jan 15, 2012 at 3:04 PM, Jim Klimov  wrote:


"Does raidzN actually protect against bitrot?"
That's a kind of radical, possibly offensive, question formula
that I have lately.
  

Yup, it does. That's why many of us use it.



There's actually no such thing as bitrot on a disk.  Each sector on
the disk is accompanied by a CRC that's verified by the disk
controller on each read.  It will either return correct data or report
an unreadable sector.  There's nothing inbetween.
  


Actually, there are a number of disk firmware and cache faults 
inbetween, which zfs has picked up over the years.



--
Andrew Gabriel
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Does raidzN actually protect against bitrot? If yes - how?

2012-01-15 Thread Gary Mills
On Sun, Jan 15, 2012 at 04:06:33PM +, Peter Tribble wrote:
> On Sun, Jan 15, 2012 at 3:04 PM, Jim Klimov  wrote:
> > "Does raidzN actually protect against bitrot?"
> > That's a kind of radical, possibly offensive, question formula
> > that I have lately.
> 
> Yup, it does. That's why many of us use it.

There's actually no such thing as bitrot on a disk.  Each sector on
the disk is accompanied by a CRC that's verified by the disk
controller on each read.  It will either return correct data or report
an unreadable sector.  There's nothing inbetween.

Of course, if something outside of ZFS writes to the disk, then data
belonging to ZFS will be modified.  I've heard of RAID controllers or
SAN devices doing this when they modify the disk geometry or reserved
areas on the disk.

-- 
-Gary Mills--refurb--Winnipeg, Manitoba, Canada-
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Does raidzN actually protect against bitrot? If yes - how?

2012-01-15 Thread Jim Klimov

2012-01-15 20:06, Peter Tribble wrote:

(Try writing over one
half of a zfs mirror with dd and watch it cheerfully repair your data
without an actual error in sight.)


Are you certain it always works?
AFAIK, mirror reads are round-robined (which leads to parallel
read performance boosts). Only if your read hits the mismatch,
the block would be reconstructed from another copy.

And scrubs are one mechanism to force such reads of all copies
of all blocks and trigger reconstructions as needed.




1) How does raidzN protect agaist bit-rot without known full
   death of a component disk, if it at all does?
   Or does it only help against "loud corruption" where the
   disk reports a sector-access error or dies completely?

2) Do the "leaf blocks" (on-disk sectors or ranges of sectors
   that belong to a raidzN stripe) have any ZFS checksums of
   their own? That is, can ZFS determine which of the disks
   produced invalid data and reconstruct the whole stripe?


No, the checksum is against the whole stripe. And you do the
combinatorial reconstruction to work out which is bad.


Hmmm, in this case, how does ZFS precisely know which disks
contain which sector ranges of variable-width stripes?

In Max Bruning's weblog I saw a reference to kernel routine
vdev_raidz_map_alloc(). Without a layer of pointers to sectors
with data on each physical vdev (and, as I hoped, such layer
might contain checksums or ECCs), it seems like a fundamental
unchangeable part of ZFS raidz. Is it true?




2**) Alternatively, how does raidzN get into situation like
   "I know there is an error somewhere, but don't know where"?
   Does this signal simultaneous failures in different disks
   of one stripe?


If you have raidz1, and two devices give bad data, then you don't
have enough redundancy to do the reconstruction. I've not seen this
myself for random bitrot, but it's the sort of thing that can give you
trouble if you lose a whole disk and then hit a bad block on another
device during resilver.

(Regular scrubs to identify and fix bad individual blocks before you have
to do a resilver are therefore a good thing.)


That's what I did more or less regularly. Then one nice scrub
gave me such a condition... :(




   How *do* some things get fixed then - can only dittoed data
   or metadata be salvaged from second good copies on raidZ?


You can recover anything you have enough redundancy for. Which
means everything, up to the redundancy of the vdev. Beyond that,
you may be able to recover dittoed data (of which metadata is just
one example) even if you've lost an entire vdev.


And, now, with my one pool-level error and two raidz-level
errors, is it correct to interpret that attempts to read
both dittoed copies of pool::<0x0>, whatever
that is, have failed?

In particular, shouldn't the metadata redundancy (mirroring
and/or copies=2 over raidz or over its component disks) point
to specific disks that contained the block and failed to
produce it correctly?..

Thanks all for the replies,
//Jim Klimov
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Does raidzN actually protect against bitrot? If yes - how?

2012-01-15 Thread Bob Friesenhahn

On Sun, 15 Jan 2012, Jim Klimov wrote:


1) How does raidzN protect agaist bit-rot without known full
  death of a component disk, if it at all does?
  Or does it only help against "loud corruption" where the
  disk reports a sector-access error or dies completely?


Here is a layman's answer since I am not a zfs developer:

ZFS maintains a checksum for a full block which is comprised of 
multiple chunks which are stored on different disks. If one of these 
chunks proves to be faulty (i.e. drive reports error while reading 
that chunk, or overall block checksum is faulty) then zfs is able to 
re-construct the data using the 'N' redundancy chunks based on 
distributed parity.  If the drive reports success but the data it 
returned is wrong, then the zfs block checksum will fail.  Given a 
sufficiently strong checksum algorithm, there is only one permutation 
of reconstructed data which will satisfy the zfs block checksum so zfs 
can try the various reconstruction permutations (assuming that each 
returned chunk is defective in turn) until it finds a permutation 
which re-creates the correct data.  At this point, it knows which 
chunk was bad and can re-write it or take other recovery action.  If 
the number of bad chunks exceeds the available redundancy level, then 
the bad data can be detected, but can't be corrected.


The fly in the ointment is that system memory needs to operate 
reliably, which is why ECC memory is recommended/required in order for 
zfs to offer full data reliability assurance.



One frequently announced weakness in ZFS is the relatively small
pool of engineering talent knowledgeable enough to hack ZFS and
develop new features (i.e. the ex-Sunnites and very few determined
other individuals): "We might do this, but we have few resources
and already have other more pressing priorities".


One might assume that the pool of ZFS knowledge is less than other 
popular filesystems but it seems to be currently larger than UFS, FFS, 
EXT4, and XFS.  For example, Kirk McKusick is still the one fixing 
reported bugs in BSD FFS.  As filesystems become more mature, the 
number of people fluent in how their implementation works grows 
smaller due to a diminishing need to fix problems.  Filesystems which 
are the brainchild of just one person (e.g. EXT4, Reiserfs) become 
subject to the abilities of that one person.


Bob
--
Bob Friesenhahn
bfrie...@simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/
GraphicsMagick Maintainer,http://www.GraphicsMagick.org/
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Does raidzN actually protect against bitrot? If yes - how?

2012-01-15 Thread Jim Klimov

2012-01-15 19:38, Edward Ned Harvey wrote:
>> 1) How does raidzN protect agaist bit-rot without known full
>>death of a component disk, if it at all does?
> zfs can read disks 1,2,3,4...  Then read disks 1,2,3,5...
> Then read disks 1,2,4,5...  ZFS can figure out which disk
> returned the faulty data, UNLESS the disk actually returns
> correct data upon subsequent retries.

Makes sense, if ZFS does actually do that ;)

Counter-examples:
1) For several scrubs in a row, my pool consistently found two
   vdev errors and one pool error with zero per-disk errors
   (further leading to error in some object :<0x0>).
   If the disk-read errors were transient, sometimes returning
   correct data (i.e. bad sector relocation was successful in
   the background), ZFS would receive good blocks on further
   scrubs - shouldn't it?

2) Even with one bad sector consistently in place, if ZFS can
   deduce correct original block data, why report the error
   at all (especially - for many times) as uncorrectable?

This leaves me thinking of two on-disk errors, and/or lack of
checksums for leaf blocks, as the possible reasons for such
detected raidz errors with undetected faulty individual disks.

Any other options I overlooked?


You know the open-source question in regards to ZFS is pretty much
concluded, right?  What oracle called zpool version 28 was the last open
source version, currently in use on nexenta, freebsd, and some others.  The
illumos project has continued development, minimally.  If you think the
development effort is resource limited in oracle working on zfs, just try
the open source illumos community...


I do try it. I do also see some companies like Nexenta or Joyent
having discussed the NetApp problem and having moved on betting
on their work with opensourced ZFS.

Also, Oracle's closed ZFS is actually of little relevance to me
or other SOHO users (laptops, home NASes, etc.)
As Oracle doesn't deal with small customers, and people still
have problems buying or getting support for small-volume stuff,
or find Oracle's offerings prohibitively expensive, it is hard
to get Oracle noticing a bug/RFE report not backed by money.
There is nothing inherently bad with the business model, Sun
also had it (while being more open to suggestions). It's just
that in this model SOHO users have no influence on ZFS and it
becomes a closed proprietary gadget like any other FS, without
engineering interest to enhance it. And this couples with
limited understanding whether you have a right to use it
at all and not get sued by Oracle (i.e. for trying to put
Solaris 11 in your production without paying the tax).

Over the past year I have proposed or discussed a number of
features for ZFS, and while there is little chance that illumos
developers would implement any of that soon, there is near-zero
chance that Oracle ever will. And there is a greater chance that
myself or some other developer would dig into such RFEs and
publish a solution - especially if such developer is helped
with theory.

//Jim
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Does raidzN actually protect against bitrot? If yes - how?

2012-01-15 Thread Peter Tribble
On Sun, Jan 15, 2012 at 3:04 PM, Jim Klimov  wrote:
> "Does raidzN actually protect against bitrot?"
> That's a kind of radical, possibly offensive, question formula
> that I have lately.

Yup, it does. That's why many of us use it.

> The way I get it, RAID5/6 generally has no mechanism to detect
> *WHICH* sector was faulty, if all of them got read without
> error reports from the disk.

It validates the checksum on every read. If it doesn't match, then
one of the devices (at least) is returning incorrect data. So it simply
tries reconstruction assuming each device in turn is bad until it gets
the right answer. That gives you the correct data, and tells you which
device was wrong, and then you write back the correct data to the
errant device.

> Perhaps it won't even test whether
> parity matches and bytes zero out, as long as there were no read
> errors reported.

Absolutely not. It always checks, regardless. (Try writing over one
half of a zfs mirror with dd and watch it cheerfully repair your data
without an actual error in sight.)

> 1) How does raidzN protect agaist bit-rot without known full
>   death of a component disk, if it at all does?
>   Or does it only help against "loud corruption" where the
>   disk reports a sector-access error or dies completely?
>
> 2) Do the "leaf blocks" (on-disk sectors or ranges of sectors
>   that belong to a raidzN stripe) have any ZFS checksums of
>   their own? That is, can ZFS determine which of the disks
>   produced invalid data and reconstruct the whole stripe?

No, the checksum is against the whole stripe. And you do the
combinatorial reconstruction to work out which is bad.

> 2**) Alternatively, how does raidzN get into situation like
>   "I know there is an error somewhere, but don't know where"?
>   Does this signal simultaneous failures in different disks
>   of one stripe?

If you have raidz1, and two devices give bad data, then you don't
have enough redundancy to do the reconstruction. I've not seen this
myself for random bitrot, but it's the sort of thing that can give you
trouble if you lose a whole disk and then hit a bad block on another
device during resilver.

(Regular scrubs to identify and fix bad individual blocks before you have
to do a resilver are therefore a good thing.)

>   How *do* some things get fixed then - can only dittoed data
>   or metadata be salvaged from second good copies on raidZ?

You can recover anything you have enough redundancy for. Which
means everything, up to the redundancy of the vdev. Beyond that,
you may be able to recover dittoed data (of which metadata is just
one example) even if you've lost an entire vdev.

-- 
-Peter Tribble
http://www.petertribble.co.uk/ - http://ptribble.blogspot.com/
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Does raidzN actually protect against bitrot? If yes - how?

2012-01-15 Thread Edward Ned Harvey
> From: zfs-discuss-boun...@opensolaris.org [mailto:zfs-discuss-
> boun...@opensolaris.org] On Behalf Of Jim Klimov
> 
> 1) How does raidzN protect agaist bit-rot without known full
> death of a component disk, if it at all does?
> Or does it only help against "loud corruption" where the
> disk reports a sector-access error or dies completely?

Whenever raidzN encounters a cksum error, it will read the redundant copies
until it finds one that passes the cksum.  The only ways you get
unrecoverable cksum error are (a) more than N disks are failing, or (b) the
storage subsystem - i.e. HBA, bus, memory, cpu, etc - are failing.

Let's suppose one disk in a raidz (1) returns corrupt data silently.  Recall
that there is enough parity redundancy to recover from *any* complete disk
failure.  That means zfs can read disks 1,2,3,4...  Then read disks
1,2,3,5... Then read disks 1,2,4,5...  ZFS can figure out which disk
returned the faulty data, UNLESS the disk actually returns correct data upon
subsequent retries.


> How *do* some things get fixed then - can only dittoed data
> or metadata be salvaged from second good copies on raidZ?

dittoed data is a layer of redundancy over and above your sysadmin chosen
level of redundancy / raidz/mirror level.


> One frequently announced weakness in ZFS is the relatively small
> pool of engineering talent knowledgeable enough to hack ZFS and
> develop new features (i.e. the ex-Sunnites and very few determined
> other individuals): "We might do this, but we have few resources
> and already have other more pressing priorities".

While that may be true, compare it to everything else.  I prefer ZFS over
OnTap any day of the week.  And although btrfs will be good someday, it's
just barely suitable now for *any* production purposes.  

I don't see the competition developing faster than ZFS.


> Likewise with opensource: yes, the code is there. A developer
> might read into it and possibly comprehend some in a year or so.
> Or he could spend a few days midway (when he knows enough to
> pose hard questions not googlable in some FAQ yet) in yes-no
> question sessions with the more knowledgeable people, and become
> ready to work in just a few weeks from start. Wouldn't that be
> wonderful for ZFS in general? :)

You know the open-source question in regards to ZFS is pretty much
concluded, right?  What oracle called zpool version 28 was the last open
source version, currently in use on nexenta, freebsd, and some others.  The
illumos project has continued development, minimally.  If you think the
development effort is resource limited in oracle working on zfs, just try
the open source illumos community...  Since close-sourcing a little over a
year ago, oracle has continued developing and releasing new features...
They're now at ... what, zpool version 37 or something?  Illumos has
continued developing too...  Much less.

Yes, it would be nice to see more open source development, but I think the
main obstacle is the COW patent issue.  Oracle is now immune from Netapp
lawsuit over it, but if you want somebody ... for example perhaps Apple ...
to contribute development resource to the open source branch, they'll have
to duke it out with Netapp using their own legal resources.  So far, the
obstacle is just large enough that we don't see any other organizations
contributing significantly.

Linux is going with btrfs.  MS has their own thing.  Oracle continues with
ZFS closed source.  Apple needs a filesystem that doesn't suck, but they're
not showing inclinations toward ZFS or anything else that I know of.

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss