Re: [zfs-discuss] Idea: ZFS and on-disk ECC for blocks

2012-01-12 Thread David Magda
On Wed, January 11, 2012 11:40, Nico Williams wrote:

 I don't find this terribly attractive, but maybe I'm just not looking
 at it the right way.  Perhaps there is a killer enterprise feature for
 ECC here: stretching MTTDL in the face of a device failure in a mirror
 or raid-z configuration (but if failures are typically of whole drives
 rather than individual blocks, then this wouldn't help).  But without
 a good answer for where to store the ECC for the largest blocks, I
 don't see this happening.

Not so much for blocks, but talking more with sectors, there's the T10
(SCSI) Data Integrity Field (DIF):

http://www.usenix.org/event/lsf07/tech/petersen.pdf

This is a controller-drive specification. For host-controller
communication, the Data Integrity Extensions (DIX) have been define:

http://oss.oracle.com/~mkp/docs/ols2008-petersen.pdf

It's a pity that the field is only eight bytes, as if it was larger, a
useful cryptographic [HCUG]MAC could be saved there by disk encryption
software. Perhaps with 4K-sector Advanced Format drives a similar field
will be defined that's larger.


___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Idea: ZFS and on-disk ECC for blocks

2012-01-12 Thread Jim Klimov

I guess I have another practical rationale for a second
checksum, be it ECC or not: my scrubbing pool found some
unrecoverable errors. Luckily, for those files I still
have external originals, so I rsynced them over. Still,
there is one file whose broken prehistory is referenced
in snapshots, and properly fixing that would probably
require me to resend the whole stack of snapshots.
That's uncool, but a subject for another thread.

This thread is about checksums - namely, now, what are
our options when they mismatch the data? As has been
reported by many blog-posts researching ZDB, there do
happen cases when checksums are broken (i.e. bitrot in
block pointers, or rather in RAM while the checksum was
calculated - so each ditto copy of BP has the error),
but the file data is in fact intact (extracted from
disk with ZDB or DD, and compared to other copies).

For these cases bloggers asked (in vain) - why is it
not allowed for an admin to confirm validity of end-user
data and have the system reconstruct (re-checksum) the
metadata for it?.. IMHO, that's a valid RFE.

While the system is scrubbing, I was reading up on theory.
Found a nice text Keeping Bits Safe: How Hard Can It Be?
by David Rosenthal [1], where I stumbled upon an interesting
thought:
  The bits forming the digest are no different from the
  bits forming the data; neither is magically incorruptible.
  ...Applications need to know whether the digest has
  been changed.

In our case, where original checksum in the blockpointer
could be corrupted in (non-ECC) RAM of my home-NAS just
before it was dittoed to disk, another checksum - copy
of this same one, or a differently calculated one, could
provide ZFS with the means to determine whether the data
or one of the checksums got corrupted (or all of them).
Of course, this is not an absolute protection method,
but it can reduce the cases where pools have to be
destroyed, recreated and recovered from tape.

It is my belief that using dedup contributed to my issue -
there's lots more of updating the block pointers and their
checksums, so it gradually becomes more likely that the
metadata (checksum) blocks gets broken (i.e. in non-ECC
RAM), while the written-once userdata remains intact...

--
[1] http://queue.acm.org/detail.cfm?id=1866298
While the text discusses what all ZFSers mostly know
already - about bit-rot, MTTDL and such, it does so with
great detail and many examples, and gave me a better
understanding of it all even though I deal with this for
several years now. A good read, I suggest it to others ;)

//Jim Klimov
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Idea: ZFS and on-disk ECC for blocks

2012-01-12 Thread Jim Klimov

2012-01-13 2:34, Jim Klimov wrote:

I guess I have another practical rationale for a second
checksum, be it ECC or not: my scrubbing pool found some
unrecoverable errors.
...Applications need to know whether the digest has
been changed.


As Richard reminded me in another thread, both metadata
and DDT can contain checksums, hopefully of the same data
block. So for deduped data we may already have a means
to test whether the data or the checksum is incorrect...
Incdentally, the problem also seems more critical for
the deduped data ;)

Just a thought...
//Jim
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Idea: ZFS and on-disk ECC for blocks

2012-01-12 Thread Daniel Carosone
On Fri, Jan 13, 2012 at 04:48:44AM +0400, Jim Klimov wrote:
 As Richard reminded me in another thread, both metadata
 and DDT can contain checksums, hopefully of the same data
 block. So for deduped data we may already have a means
 to test whether the data or the checksum is incorrect...

It's the same chksum, calculated once - this is why turning dedup=on
implies setting checksum=sha256 

 Incdentally, the problem also seems more critical for
 the deduped data ;)

Yes.  Add this to the list of reasons to use ECC, and add 'have ECC'
to the list of constraints to circumstances where using dedup is
appropriate. 

--
Dan.

pgpXxjufkN4uS.pgp
Description: PGP signature
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Idea: ZFS and on-disk ECC for blocks

2012-01-12 Thread Richard Elling
On Jan 12, 2012, at 2:34 PM, Jim Klimov wrote:

 I guess I have another practical rationale for a second
 checksum, be it ECC or not: my scrubbing pool found some
 unrecoverable errors. Luckily, for those files I still
 have external originals, so I rsynced them over. Still,
 there is one file whose broken prehistory is referenced
 in snapshots, and properly fixing that would probably
 require me to resend the whole stack of snapshots.
 That's uncool, but a subject for another thread.
 
 This thread is about checksums - namely, now, what are
 our options when they mismatch the data? As has been
 reported by many blog-posts researching ZDB, there do
 happen cases when checksums are broken (i.e. bitrot in
 block pointers, or rather in RAM while the checksum was
 calculated - so each ditto copy of BP has the error),
 but the file data is in fact intact (extracted from
 disk with ZDB or DD, and compared to other copies).

Metadata is at least doubly redundant and checksummed.
Can you provide links to posts that describe this failure mode?

 For these cases bloggers asked (in vain) - why is it
 not allowed for an admin to confirm validity of end-user
 data and have the system reconstruct (re-checksum) the
 metadata for it?.. IMHO, that's a valid RFE.

Metadata is COW, too. Rewriting the data also rewrites the metadata.

 While the system is scrubbing, I was reading up on theory.
 Found a nice text Keeping Bits Safe: How Hard Can It Be?
 by David Rosenthal [1], where I stumbled upon an interesting
 thought:
  The bits forming the digest are no different from the
  bits forming the data; neither is magically incorruptible.
  ...Applications need to know whether the digest has
  been changed.

Hence for ZFS, the checksum (digest) is kept in the parent metadata.

The condition described above can affect T10 DIF-style checksums, but not ZFS.

 In our case, where original checksum in the blockpointer
 could be corrupted in (non-ECC) RAM of my home-NAS just
 before it was dittoed to disk, another checksum - copy
 of this same one, or a differently calculated one, could
 provide ZFS with the means to determine whether the data
 or one of the checksums got corrupted (or all of them).
 Of course, this is not an absolute protection method,
 but it can reduce the cases where pools have to be
 destroyed, recreated and recovered from tape.

Nope.

 It is my belief that using dedup contributed to my issue -
 there's lots more of updating the block pointers and their
 checksums, so it gradually becomes more likely that the
 metadata (checksum) blocks gets broken (i.e. in non-ECC
 RAM), while the written-once userdata remains intact...
 
 --
 [1] http://queue.acm.org/detail.cfm?id=1866298
 While the text discusses what all ZFSers mostly know
 already - about bit-rot, MTTDL and such, it does so with
 great detail and many examples, and gave me a better
 understanding of it all even though I deal with this for
 several years now. A good read, I suggest it to others ;)
 
 //Jim Klimov
 ___

 -- richard

-- 

ZFS and performance consulting
http://www.RichardElling.com
SCALE 10x, Los Angeles, Jan 20-22, 2012










___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Idea: ZFS and on-disk ECC for blocks

2012-01-12 Thread Daniel Carosone
On Thu, Jan 12, 2012 at 05:01:48PM -0800, Richard Elling wrote:
  This thread is about checksums - namely, now, what are
  our options when they mismatch the data? As has been
  reported by many blog-posts researching ZDB, there do
  happen cases when checksums are broken (i.e. bitrot in
  block pointers, or rather in RAM while the checksum was
  calculated - so each ditto copy of BP has the error),
  but the file data is in fact intact (extracted from
  disk with ZDB or DD, and compared to other copies).
 
 Metadata is at least doubly redundant and checksummed.

The implication is that the original calculation of the checksum was
bad in ram (undetected due to lack of ECC), and then written out
redundantly and fed as bad input to the rest of the merkle construct.
The data blocks on disk are correct, but they fail to verify against
the bad metadata.

The complaint appears to be that ZFS makes this 'worse' because the
(independently verified) valid data blocks are inaccessible. 

Worse than what? Corrupted file data that is then accurately
checksummed and readable as valid? Accurate data that is read without
any assertion of validity, in a traditional filesystem? There's
an inherent value judgement here that will vary by judge, but in each
case it's as much a judgement on the value of ECC and reliable
hardware, and your data and time enacting various kinds of recovery,
as it is the value of ZFS.

The same circumstance could, in principle, happen due to bad CPU even
with ECC.  In either case, the value of ZFS includes that an error has
been detected you would otherwise have been unaware of, and you get a
clue that you need to fix hardware and spend time. 

--
Dan.


pgpE29pepViE2.pgp
Description: PGP signature
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Idea: ZFS and on-disk ECC for blocks

2012-01-12 Thread Jim Klimov

2012-01-13 5:30, Daniel Carosone wrote:

On Thu, Jan 12, 2012 at 05:01:48PM -0800, Richard Elling wrote:

This thread is about checksums - namely, now, what are
our options when they mismatch the data? As has been
reported by many blog-posts researching ZDB, there do
happen cases when checksums are broken (i.e. bitrot in
block pointers, or rather in RAM while the checksum was
calculated - so each ditto copy of BP has the error),
but the file data is in fact intact (extracted from
disk with ZDB or DD, and compared to other copies).


Metadata is at least doubly redundant and checksummed.


The implication is that the original calculation of the checksum was
bad in ram (undetected due to lack of ECC), and then written out
redundantly and fed as bad input to the rest of the merkle construct.
The data blocks on disk are correct, but they fail to verify against
the bad metadata.


Implication is correct, that was the outlined scenario :)


The complaint appears to be that ZFS makes this 'worse' because the
(independently verified) valid data blocks are inaccessible.


Also correct, a frequent woe (generally in the context
of discussions about lack of ZFS fsck, though many of these
discussions tend to descend into flame wars and.or detailed
descriptions of how the COW and transaction engine keep
{meta}data intact - just until some such fatal bit rot that
the pool must be recreated as the only recovery option).


Worse than what?


Worse than not having a (relatively easy-to-use) ability
to confirm to the system, which part to trust - data or the
checksum (which returns us to the subject of automating this
with ECC and/or other checksums). My data, my checks into it,
my word should be final in case of dispute ;)

  Corrupted file data that is then accurately

checksummed and readable as valid? Accurate data that is read without
any assertion of validity, in a traditional filesystem?


If by ZFS automata itself - without my ability to intervene -
then probably not. It would make ZFS no better than others.

 There's

an inherent value judgement here that will vary by judge, but in each
case it's as much a judgement on the value of ECC and reliable
hardware, and your data and time enacting various kinds of recovery,
as it is the value of ZFS.


Perhaps so. I might read through a text file to see if it
is garbage or text. I might parse or display image files
and many other formats. I might compare to another copy, if
available. I just don't have a mechanism to do so with ZFS.

Apparently, a view into the data as it seems to be without
checksums would speed up the process of data comparison,
eye-reading and other methods of validation.

People do that with LOST+FOUND and such directories
on other FSes, but usually after an unreversible attempt
of recovery, correct or not...

Heck, with ZFS I might have a snapshot-like view at my
recovery options (accessible to programs like image viewers)
without changing on-disk data until I pick a variant.

Yes, okay, ZFS did inform me of some inconsistency
(even then it is not necessarily the data that is bad)
and perhaps prompted me to fix the hardware and find
other copies of data. Kudos to the team, really!
But then it stops here, without providing me with
options ro recover whatever is on disk (at my risk).

As a Solaris example, admins are allowed to confirm
which part of a broken USF+SVM mirror to trust, even
if there is not a quorum set of metadb replicas.

This trust in the human is common in the industry, and
allows to account for whatever could not be done in the
software as a one-size-fits-all solution. Also it is
the user's final chioce to kill or save the data, not
the programmers with whatever cryptic intentions he had.



The same circumstance could, in principle, happen due to bad CPU even
with ECC.  In either case, the value of ZFS includes that an error has
been detected you would otherwise have been unaware of, and you get a
clue that you need to fix hardware and spend time.


True, whenever that is possible.
Hardware will be faulty, always. We can only decrease the
extent of that. Not all implementation options (see laptops
and ECC RAM) or budgets can fix it to reasonable levels,
though.

Software must be the more resilient part, I guess - as long
as its error-detection algorithm can execute on that CPU... :)

//Jim




___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Idea: ZFS and on-disk ECC for blocks

2012-01-12 Thread Jim Klimov

2012-01-13 5:01, Richard Elling wrote:

On Jan 12, 2012, at 2:34 PM, Jim Klimov wrote:



Metadata is at least doubly redundant and checksummed.

True, and this helps if it is valid in the first place
(in RAM).

 As has been
 reported by many blog-posts researching ZDB, there do
 happen cases when checksums are broken ...
 but the file data is in fact intact

Can you provide links to posts that describe this failure mode?


I'll try in another message. That would take some googling
time ;)

I think the most apparent ones are the tutorials on ZDB
where authors poisoned their VDEVs in those sectors where
metadata was (all copies), so that filedata is factually
intact but not accessible due to mismatching checksums
along the metadata path.

Right now I can't think of any other posts like that,
but nature can produce the same phenomonons and I think
it could have been discussed on-line. I've read too much
during the past weeks :(




For these cases bloggers asked (in vain) - why is it
not allowed for an admin to confirm validity of end-user
data and have the system reconstruct (re-checksum) the
metadata for it?.. IMHO, that's a valid RFE.


Metadata is COW, too. Rewriting the data also rewrites the metadata.


COW does not help well against mis-targeted hardware
writes, bit rot, solar storms, etc. that would break
existing on-disk data.

Random bit errors can happen anywhere, RAM buffers or
committed disks alike.

It is a fact (since the first blogposts about ZDB and
ZFS internals by Marcelo Leal, Max bruning, Ben Rockwood
and countless other kind samaritans) that inquisitive
users - or those repairing their systems - can determine
DVA and ultimately LBA addresses of their data, extract
the userdata blocks and confirm (sometimes) that their
data is intact, and the problem is in metadata paths.




While the system is scrubbing, I was reading up on theory.
Found a nice text Keeping Bits Safe: How Hard Can It Be?
by David Rosenthal [1], where I stumbled upon an interesting
thought:
  The bits forming the digest are no different from the
  bits forming the data; neither is magically incorruptible.
  ...Applications need to know whether the digest has
  been changed.


Hence for ZFS, the checksum (digest) is kept in the parent metadata.


But it can still rot. And for a while they are in the
same RAM, which might lie. Probably the one good effect
there is - checksum is stored away from the data and
*likely* both at once won't get scratched by HDD head
crash ;) Unless they were coalesced to storage near
each other...

Hm... so if the checksum in metadata has bit-rotted
on-disk, this metadata block would first not match
its parent block (as it is the parent's checksummed
data), and would cause reread of a ditto copy.

But if the checksum got broken in-RAM just before the
write, so both ditto blocks have bad checksum values -
but they match their metadata-parents - currently the
data is considered bad :(

Granted, data is larger so there is seemingly a higher
chance that it would get a 1-bit error; but as I wrote,
metadata blocks are rewritten more often - so in fact
they could suffer errors more frequently.

Does your practice or theory prove this statement of
mine fundamentally wrong?




The condition described above can affect T10 DIF-style checksums, but not ZFS.


In our case, where original checksum in the blockpointer
could be corrupted in (non-ECC) RAM of my home-NAS just
before it was dittoed to disk, another checksum - copy
of this same one, or a differently calculated one, could
provide ZFS with the means to determine whether the data
or one of the checksums got corrupted (or all of them).
Of course, this is not an absolute protection method,
but it can reduce the cases where pools have to be
destroyed, recreated and recovered from tape.


Nope.


Maybe so... as I elaborate below, there are indeed some
scenarios with using several checksums of data, where
we can not unambiguously determine correctness of either.

Say, we have a data block D in RAM, which can fail always
(more probable without ECC - as is probable on consumer
devices like laptops or home-NASes). We produce two checksums
D' and then D with different algorithms while preparing to
write (these checksum values would go to all ditto blocks).
During this time a bit flopped, or whatever undetected
(non-ECC) RAM failure happened at least once. Variants:

1) Block D got broken before checksum calcs - we're out
of luck, checksums would probably match, but the data is
still wrong.

2) Block D got broken between checksum calcs - one of
checksums (always D) matches the data, another one
(always D') doesn't.

3) Block D is okay, but one of checksums broke - one of
checksums matches the data, another one doesn't.
About 50% similarity to case (2).

4) Block D is okay, and both checksums broke - block is
considered broken even if it is not...

The idea needs to be rethought, indeed ;)

Perhaps we can checksum or ECC the checksums, or a digest
of a 

Re: [zfs-discuss] Idea: ZFS and on-disk ECC for blocks

2012-01-12 Thread Jim Klimov

2012-01-13 5:30, Daniel Carosone wrote:
Corrupted file data that is then accurately
checksummed and readable as valid?


Speaking of which, is there currently any simple way to disable
checksum validation during data reads (and not cause a kernel
panic when reading garbage under the guise of metadata)?

Some posts suggested that I try setting checksum=off on a dataset.
It doesn't work, reads of files with blocks mismatching checksums
still provide IO errors ;)

Thanks,
//Jim

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


[zfs-discuss] Idea: ZFS and on-disk ECC for blocks

2012-01-11 Thread Jim Klimov

Hello all, I have a new crazy idea of the day ;)

  Some years ago there was an idea proposed in one of ZFS
developers' blogs (maybe Jeff's? sorry, can't find and
link it now) that went somewhat along these lines:

   Modern disks have some ECC/CRC codes for each sector,
   and uses them to test read-in data. If the disk fails
   to produce a sector correctly, it tries harder to read
   it and reallocates the LBA from a spare-sector region,
   if possible. This leads to some more random IO for
   linearly-numbered LBA sectors, as well as waste of
   disk space for spare sectors and checksums - at least
   in comparison to better error-detection and redundancy
   of ZFS checksums. Besides, attempts to re-read a faulty
   sector may succeed or they may produce undeteced garbage,
   and take some time (maybe seconds) if the retries fail
   consistently. Then the block is marked bad and data is
   lost.

   The article went on to suggest let's get an OEM vendor
   to give us same disks without their kludges, and we'll
   get (20%?) more platter-speed and volume, better used
   by ZFS error-detection and repair mechanisms.

I've recently had a sort of an opposite thought: yes,
ZFS redundancy is good - but also expensive in terms
of raw disk space. This is especially bad for hardware
space-constrained systems like laptops and home-NASes,
where doubling the number of HDDs (for mirrors) or
adding tens of percent of storage for raidZ is often
not practical for whatever reason.

Current ZFS checksums allow us to detect errors, but
in order for recovery to actually work, there should be
a redundant copy and/or parity block available and valid.

Hence the question: why not put ECC info into ZFS blocks?
IMHO, pluggable ECC (like pluggable compression or
varied checksums - in this case ECC algorithms allowing
for recovery of 1 or 2 bits, for example) would be cheaper
on disk space than redundancy (few % instead of 25-50% of
disk space), and still allow for recovery of certain errors,
such as on-disk or on-wire bit rot, even in single-disk
ZFS pools.

This could be an inheritable per-dataset attribute
like compression, encryption, dedup or checksum
algorithms.

Replacement of recovered faulted blocks into currently
free space is already part of ZFS, except that now it
might have to track the notion of permanently-bad block
lists and decreasing space available for addressing on
each leaf VDEV. There should also be a mechanism to
retest and clear such blocks, i.e. when a faulty drive
or LUN is replaced by a new one (perhaps with DD'ing
of an old hardware drive to a new one, and replacement,
while the pool is offline) - probably as a special
scrub-like command to zpool, also invoked during scrub.

This may be combined with the wish for OEM disks that
lack hardware ECC/spare sectors in return for more
performance; although I'm not sure how good that would
be in practice - the hardware creator's in-depth
knowledge of how to retry reading initially faulty
blocks, i.e. by changing voltage or platter speeds
or whatever, may be invaluable and not replaceable
by software.

What do you think? Doable? Useful? Why not, if not? ;)

Thanks,
//Jim Klimov
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Idea: ZFS and on-disk ECC for blocks

2012-01-11 Thread Nico Williams
On Wed, Jan 11, 2012 at 9:16 AM, Jim Klimov jimkli...@cos.ru wrote:
 I've recently had a sort of an opposite thought: yes,
 ZFS redundancy is good - but also expensive in terms
 of raw disk space. This is especially bad for hardware
 space-constrained systems like laptops and home-NASes,
 where doubling the number of HDDs (for mirrors) or
 adding tens of percent of storage for raidZ is often
 not practical for whatever reason.

Redundancy through RAID-Z and mirroring is expensive for home systems
and laptops, but mostly due to the cost of SATA/SAS ports, not the
cost of the drives.  The drives are cheap, but getting an extra disk
in a laptop is either impossible or expensive.  But that doesn't mean
you can't mirror slices or use ditto blocks.  For laptops just use
ditto blocks and either zfs send or external mirror that you
attach/detach.

 Current ZFS checksums allow us to detect errors, but
 in order for recovery to actually work, there should be
 a redundant copy and/or parity block available and valid.

 Hence the question: why not put ECC info into ZFS blocks?

RAID-Zn *is* an error correction system.  But what you are asking for
is a same-device error correction method that costs less than ditto
blocks, with error correction data baked into the blkptr_t.  Are there
enough free bits left in the block pointer for error correction codes
for large blocks?  (128KB blocks, but eventually ZFS needs to support
even larger blocks, so keep that in mind.)  My guess is: no.  Error
correction data might have to get stored elsewhere.

I don't find this terribly attractive, but maybe I'm just not looking
at it the right way.  Perhaps there is a killer enterprise feature for
ECC here: stretching MTTDL in the face of a device failure in a mirror
or raid-z configuration (but if failures are typically of whole drives
rather than individual blocks, then this wouldn't help).  But without
a good answer for where to store the ECC for the largest blocks, I
don't see this happening.

Nico
--
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Idea: ZFS and on-disk ECC for blocks

2012-01-11 Thread Jim Klimov

2012-01-11 20:40, Nico Williams пишет:

On Wed, Jan 11, 2012 at 9:16 AM, Jim Klimovjimkli...@cos.ru  wrote:

I've recently had a sort of an opposite thought: yes,
ZFS redundancy is good - but also expensive in terms
of raw disk space. This is especially bad for hardware
space-constrained systems like laptops and home-NASes,
where doubling the number of HDDs (for mirrors) or
adding tens of percent of storage for raidZ is often
not practical for whatever reason.


Redundancy through RAID-Z and mirroring is expensive for home systems
and laptops, but mostly due to the cost of SATA/SAS ports, not the
cost of the drives.  The drives are cheap, but getting an extra disk
in a laptop is either impossible or expensive.  But that doesn't mean
you can't mirror slices or use ditto blocks.  For laptops just use
ditto blocks and either zfs send or external mirror that you
attach/detach.


Yes, basically that's what we do now, and it halves the
available disk space and increases latency (extra seeks) ;)

I get (and share) your concern about ECC entry size for
larger blocks. NOTE: I don't know the ECC algorithms
deeply enough to speculate about space requirements,
except that as they are used in networking/RAM, an ECC
correction code for 4-8 bits of userdata is 1-2 bits long.

I'm reading the ZFS On-disk Format PDF (dated 2006 -
are there newer releases?), and on page 15 the blkptr_t
structure has 192 bits of padding before TXG. Can't that
be used for a reasonably large ECC code?

Besides, I see that blkptr_t is 128 bytes in size.
This leaves us with some slack space in a physical
sector, which can be abused without extra costs -
(512-128) or (4096-128) bytes worth of {ECC} data.
Perhaps the padding space (near TXG entry) could
be used to specify that the blkptr_t bytes are
immediately followed by ECC bytes (and their size,
probably dependent on data block length), so that
larger on-disk block pointer blocks could be used
on legacy systems as well (using several contiguous
512 byte sectors). After successful reads from disk,
this ECC data can be discarded to save space in
ARC/L2ARC allocation (especially if every byte of
memory is ECC protected anyway).

Even if the ideas/storage above is not practical,
perhaps ECC codes can be used for smaller blocks (i.e.
{indirect} block pointer contents and metadata might
be guaranteed to be small enough). If nothing else,
this could save mechanical seek times if a CKSUM
error is detected as is normal for ZFS reads, but a
built-in/referring block's ECC code infromation is
enough to repair this block. In this case we don't
need to re-request data from another disk... and we
have some more error-resiliency beside ditto blocks
(already enforced for metadata) or raidz/mirrors.
While it is (barely) possible that all ditto replicas
are broken, there's a non-zero chance that at least
one is recoverable :)





Current ZFS checksums allow us to detect errors, but
in order for recovery to actually work, there should be
a redundant copy and/or parity block available and valid.

Hence the question: why not put ECC info into ZFS blocks?


RAID-Zn *is* an error correction system.  But what you are asking for
is a same-device error correction method that costs less than ditto
blocks, with error correction data baked into the blkptr_t.  Are there
enough free bits left in the block pointer for error correction codes
for large blocks?  (128KB blocks, but eventually ZFS needs to support
even larger blocks, so keep that in mind.)  My guess is: no.  Error
correction data might have to get stored elsewhere.

I don't find this terribly attractive, but maybe I'm just not looking
at it the right way.  Perhaps there is a killer enterprise feature for
ECC here: stretching MTTDL in the face of a device failure in a mirror
or raid-z configuration (but if failures are typically of whole drives
rather than individual blocks, then this wouldn't help).  But without
a good answer for where to store the ECC for the largest blocks, I
don't see this happening.


Well, it is often mentioned that (by Murphy's Law if nothing
else) device failures in RAID often are not single-device
failures. So traditional RAID5s tended to die while replacing
a dead disk onto a spare and detecting an error on an existing
unreplicated disk.

Per-block ECC could be used in this case to recover from
bit-rot errors on remaining alive disks when RAID-Zn or
mirror don't help, decreasing the chance that tape backup
is the only recovery option remaining...

//Jim Klimov
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss