On Mon, Apr 13, 2009 at 06:34:36PM -0700, paul wrote:
> Nice analysis. I understand you've found on the
> processor/memory-subsystem architecture you've experimented with, that
> if the data being summed isn't already cached, it's access will likely
> dominate it's use in calculations:
> 
> - Although this makes sense (presuming no mechanism may used to
> pre-cache data to be only then be subsequently check-summed when cache
> resident in the future), as processor/memory system architectures are
> still evolving, it may be prudent to account for that possibility; for
> example possible future use of streaming vector/memory units, where
> multiple data streams may be potentially summed in parallel and/or
> possibly integrated into the I/O channel such that while data is being
> retrieved its checksum may be computed without processor intervention
> may be desirable.

True;  my experience with the performance was that any fix to fletcher2 to
make it stronger slowed things down by ~2-2.5x *when the data was in-cache*,
which is most of the way to fletcher4 (~3-3.5x slower).  And that was with
hand-optimized assembly, because the compiler's inline assembly has bugs.

(that said, the bugs are being fixed; even with them fixed, the algorithms
will be slower by a significant amount)

> - With respect to Fletcher4, as the first two running sums
> (representing a traditional Fletcher checksum implementation) are
> sufficient to yield a hamming distance of at least 3 (meaning all
> 1 and 2 bit errors are detectable, and thereby 1 bit errors also
> potentially correctable if ever desired) for block sizes somewhat
> larger than 256KB, being larger than required by ZFS; I can't help but
> wonder if this may be sufficient rather than worrying about trying to
> maintain two more running sums each dependent on the previous without
> potential overflow (or even potential algorithm implementation issues
> which may impede performance in the future by having to maintain 4
> sums with the later 3 being dependant on it's predecessor's sum,
> rather than just having to schedule only 1 such dependency if only
> maintaining 2 sums, and presuming support for potentially more
> efficient data access on future platform implementations)?

I've verified by exhaustive search that fletcher4 has a hamming distance
of at least 6 for 128-byte buffers, and my intuition is it is much higher.
The i*(i+1)/2 and i*(i+1)*(i+2)/6 terms in the third and fourth accumulators
make it much harder to have cancellations line up correctly, compared to the
first two.

As an aside, the way to model undetectable errors is to split the errors into
bit clears and bit sets, and make two mostly-zero data buffers:

        f_0..(n-1), with the "bit clears" bits set, and

        g_0..(n-1), with the "bit sets" bit set.

Since the fletcher checksums are linear in the data being checksummed, if
the two checksums are equal, then the error will be undetected.  You can
further refine this by noting:

        1. Since all words before the first error word are zero, they will
           have a checksum of zero and no effect on the result.

        2. If the error is undetected, the two checksums will be equal after
           the final error word is processed, and will remain equal through
           the end of the buffer.  If the error is detected, the checksums
           will not be equal after the final error word is processed, and
           will remain unequal throughout the remaining processing.

Since we only care whether or not the checksums will be equal, we can
shorten the buffers to just big enough to cover the errors without any loss of
generality.

For fletcher2, it's fairly easy to show that at least three different error
words are required, and the precise position and value constraints.  For
fletcher4, you need at least five different error words, and the math
quickly overwhelms my ability to pen-and-paper it.  Without any access to
Maple/Mathematica/etc, I'm not going to make much more process.

> (As a caveat, it's not clear to me how much more resilient the
> checksum will be by maintaining 2 more sums than traditionally used,
> and thereby when I previously observed that fletcher4 was fine, it
> was based on it's first two terms not overflowing for data sizes
> required, without regard to the possibility that the upper two terms
> may overflow, as I didn't consider them significant, or rather didn't
> rely on their significance).

Understood.

> - However in conclusion, although without pre-fetching, streaming,
> and/or channel integration, fletcher2 (corrected to use 32b data
> with 64b sums) may be no faster than the current implementation of
> fletcher4 (being possibly not significantly more resilient than a
> corrected flecther2, but which may be refined to warrant the upper
> two term also do not overflow, and thereby improving it's resilience
> to some degree); I personally suspect it's best to BOTH refine
> fletcher4 to warrant the upper two sums do not overflow by wrapping
> the upper bits of the upper two sums every N (pick your N) iterations;
> AND fix fletcher2 because when fixed it has the potential to be
> significantly faster than the fixed fletcher4 on future or other
> platforms leveraging mechanizes to pre-fetch data, and/or synchronize
> computations with access. (Both being relatively easily fixes, so why
> not?)

It's mainly a question of defaults and not giving dangerous options.  If
fletcher4 is around the same speed as our fixed fletcher2, and provides
better fault detection (and possibly correction), why offer them both?

Perhaps what we should do is:

        rename fletcher2 and fletcher4 to fletcher2-old and fletcher4-old.
        introduce new "fletcher2" and "fletcher4" algorithms, fixed as above.
        repoint "checksum=on" to "fletcher4".

So old data will continue to be checked using the old algorithms, new data
will use the new fletcher2 and 4, and we'll default to fletcher4.

Thoughts?

- jonathan


Reply via email to