Fast alternative hash (was: Files with identical SHA1 breaks the repo)

Stefan Fuhrmann Wed, 01 Mar 2017 04:52:46 -0800

On 01.03.2017 05:17, Greg Stein wrote:

I really like this idea.

And we could take a copy of APR's sha1 code, and rejigger it toperform *both* hashes during the same scan of the raw bytes. I wouldexpect the time taken to extend by (say) 1.1X rather than a full 2X.The inner loop might cost a bit more, but we'd only scan the bytesonce. Very handy, when you're talking about megabytes in a stream-yenvironment.


(and medium-term, push this dual-sha1 computation back into APR)

On Sun, Feb 26, 2017 at 10:08 AM, Garance A Drosehn <dro...@rpi.edu<mailto:dro...@rpi.edu>> wrote:


    On 24 Feb 2017, at 15:46, Stefan Sperling wrote:
    >
    > I believe we should prepare a new working format for 1.10.0
    > which addresses this problem. I don't see a good way of fixing
    > it without a format bump. The bright side of this is that it
    > gives us a good reason to get 1.10.0 ready ASAP.
    >
    > We can switch to a better hash algorithm with a WC format
    > bump.

    One of the previous messages mentioned that better hash
    algorithms are more expensive.  So let me mention a tactic
    that I used many years ago, when MD5 was the best digest
    algorithm that I knew of, and I didn't trust it for the
    larger files I was working with at the time:

    Instead of going with a completely different hash algorithm,
    just double-down on the one you're using.  What I did was to
    calculate one digest the standard way, and then a second one
    which summed up every-other-byte (or every 3rd byte, or ...).
    So to get a collision, not only do two files have to get the
    same digest-result for all their data, but they have to also
    get the same digest-result when exactly half the data is
    skipped over.

    (I did this a long time ago, and forget the details.  What
    I may have done for performance reasons was every-other-word,
    not every-other-byte)

    My thinking was that *any* single algorithm which processes
    all the data is going to get collisions, eventually.  But it
    will be much harder for someone to generate a duplicate file
    where there will also be a collision when summing up only
    half of the data.

    I'm not claiming this is great cure-all solution, but just
    that it's an alternate tactic which might be interesting.
    People could create repositories with just the one digest,
    or upgrade it to use multiple digests if they have the need.

    I found a few benchmarks which suggest that sha-256 is maybe
    twice as expensive as sha-1, so calculating two sha-1 digests
    might be a reasonable alternative.


That is also known as bit-slicing. The neat thing is that
you create N (e.g. 4) interleaved sub-streams who's
checksum can be calculated *concurrently*, e.g using
SIMD instructions. so, you end up being ~3 times *faster*
than calculating the normal checksum.

Because the interleaved streams may look quite similar
(think bitmaps), you probably want to "salt" them. A simple
rotate or XOR might do - but I'm not an expert on this.
the goal is to end up with 4 reasonably independent
streams, hence sub-hashes.

So, the full sequence would look something like this:

* Split text T into interleaved sub-streams T1,..,T4
* Salt them S1 = salt(T1, 1), ..., S4 = salt(T4, 4)
* Calculate sub-stream hashes using bit-sliced code
  D1, ..., D4 = sha1_4x(S1, ..., S4) = sha1(S1), ..., sha1(S4)
* Calculate the final checksum D = sha2(D1|...|D4)

Not only would that solve the current sha1 issue but
neatly address the fact that nowadays we can read
data faster from disk that we could checksum it.

-- Stefan^2.

Fast alternative hash (was: Files with identical SHA1 breaks the repo)

Reply via email to