Re: [zfs-discuss] integrated failure recovery thoughts (single-bit

paul Tue, 12 Aug 2008 09:18:58 -0700

Although I don't know for sure that most such errors are in fact single bit in 
nature,
I can only surmise they most likely statistically are absent detection 
otherwise;
as with the exception of error corrected memory systems and/or check-summed
communication channels, each transition of data between hardware interfaces at 
ever
increasing clock clock rates, correspondingly increase the probability of such 
otherwise
non-detectable soft single bit error being injected at these boundaries, where 
although
the probabilities of their occurrence are small enough not to be easily 
detectable or
classifiable as a hardware failure, they none the less can occur with a high 
enough
probability that over the course of days/weeks/years and trillions of bits they 
will be
observable and should be expected and planed for within reason.

Utilizing a strong error correcting code in combination with or in lieu of a 
strong hash
code would seem like a good thing to more strongly warrant that data's 
representation in
memory at the time of it's computation is more resilient to transmission and 
subsequent
retrieval; but suspect through time as technology continues to push clock rates 
and
corresponding data pool size ever higher, that some form of uniform data 
integrity
mechanism will need to be incorporated within all the processing and 
communications
interface data paths within systems in order to improve data's resilience to 
transmission
and processing errors albeit being statistically very small for any single bit.

> Anton B. Rang wrote:
> > That brings up another interesting idea.
> >
> > ZFS currently uses a 128-bit checksum for blocks of
> up to 1048576 bits.
> >
> > If 20-odd bits of that were a Hamming code, you'd
> have something slightly stronger than SECDED, and ZFS
> could correct any single-bit errors encountered.
> >   
> 
> Yes.  But I'm not convinced that we will see single
> bit errors, since
> there is already a large number of single-bit-error
> detection and (often)
> correction capability in modern systems.  It seems
> that when we lose
> a block of data, we lose more than a single bit. 
> 
> It should be relatively easy to add code to the
> current protection schemes
> which will compare a bad block to a reconstructed,
> good block and
> deliver this information for us. I'll add an RFE.
>  -- richard
> _______________________________________________
> zfs-discuss mailing list
> zfs-discuss@opensolaris.org
> http://mail.opensolaris.org/mailman/listinfo/zfs-discu
> ss

This message posted from opensolaris.org
_______________________________________________
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] integrated failure recovery thoughts (single-bit

Reply via email to