Re: [zfs-discuss] Idea: ZFS and on-disk ECC for blocks

Jim Klimov Wed, 11 Jan 2012 10:10:17 -0800

2012-01-11 20:40, Nico Williams пишет:

On Wed, Jan 11, 2012 at 9:16 AM, Jim Klimov<jimkli...@cos.ru>  wrote:

I've recently had a sort of an opposite thought: yes,
ZFS redundancy is good - but also expensive in terms
of raw disk space. This is especially bad for hardware
space-constrained systems like laptops and home-NASes,
where doubling the number of HDDs (for mirrors) or
adding tens of percent of storage for raidZ is often
not practical for whatever reason.


Redundancy through RAID-Z and mirroring is expensive for home systems
and laptops, but mostly due to the cost of SATA/SAS ports, not the
cost of the drives.  The drives are cheap, but getting an extra disk
in a laptop is either impossible or expensive.  But that doesn't mean
you can't mirror slices or use ditto blocks.  For laptops just use
ditto blocks and either zfs send or external mirror that you
attach/detach.


Yes, basically that's what we do now, and it halves the
available disk space and increases latency (extra seeks) ;)

I get (and share) your concern about ECC entry size for
larger blocks. NOTE: I don't know the ECC algorithms
deeply enough to speculate about space requirements,
except that as they are used in networking/RAM, an ECC
correction code for 4-8 bits of userdata is 1-2 bits long.

I'm reading the "ZFS On-disk Format" PDF (dated 2006 -
are there newer releases?), and on page 15 the blkptr_t
structure has 192 bits of padding before TXG. Can't that
be used for a reasonably large ECC code?

Besides, I see that blkptr_t is 128 bytes in size.
This leaves us with some slack space in a physical
sector, which can be "abused" without extra costs -
(512-128) or (4096-128) bytes worth of {ECC} data.
Perhaps the padding space (near TXG entry) could
be used to specify that the blkptr_t bytes are
immediately followed by ECC bytes (and their size,
probably dependent on data block length), so that
larger on-disk block pointer blocks could be used
on legacy systems as well (using several contiguous
512 byte sectors). After successful reads from disk,
this ECC data can be discarded to save space in
ARC/L2ARC allocation (especially if every byte of
memory is ECC protected anyway).

Even if the ideas/storage above is not practical,
perhaps ECC codes can be used for smaller blocks (i.e.
{indirect} block pointer contents and metadata might
be "guaranteed" to be small enough). If nothing else,
this could save mechanical seek times if a CKSUM
error is detected as is normal for ZFS reads, but a
built-in/referring block's ECC code infromation is
enough to repair this block. In this case we don't
need to re-request data from another disk... and we
have some more error-resiliency beside ditto blocks
(already enforced for metadata) or raidz/mirrors.
While it is (barely) possible that all ditto replicas
are broken, there's a non-zero chance that at least
one is recoverable :)

Current ZFS checksums allow us to detect errors, but
in order for recovery to actually work, there should be
a redundant copy and/or parity block available and valid.

Hence the question: why not put ECC info into ZFS blocks?


RAID-Zn *is* an error correction system.  But what you are asking for
is a same-device error correction method that costs less than ditto
blocks, with error correction data baked into the blkptr_t.  Are there
enough free bits left in the block pointer for error correction codes
for large blocks?  (128KB blocks, but eventually ZFS needs to support
even larger blocks, so keep that in mind.)  My guess is: no.  Error
correction data might have to get stored elsewhere.

I don't find this terribly attractive, but maybe I'm just not looking
at it the right way.  Perhaps there is a killer enterprise feature for
ECC here: stretching MTTDL in the face of a device failure in a mirror
or raid-z configuration (but if failures are typically of whole drives
rather than individual blocks, then this wouldn't help).  But without
a good answer for where to store the ECC for the largest blocks, I
don't see this happening.


Well, it is often mentioned that (by Murphy's Law if nothing
else) device failures in RAID often are not single-device
failures. So traditional RAID5s tended to die while replacing
a dead disk onto a spare and detecting an error on an existing
unreplicated disk.

Per-block ECC could be used in this case to recover from
bit-rot errors on remaining alive disks when RAID-Zn or
mirror don't help, decreasing the chance that tape backup
is the only recovery option remaining...

//Jim Klimov
_______________________________________________
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Idea: ZFS and on-disk ECC for blocks

Reply via email to