Re: [spdx-tech] spdx-tech Online Validation Tools GSOC 2017

Stefano Zacchiroli Fri, 03 Mar 2017 05:18:53 -0800

Heya, chiming in as summoned, but I might be missing some context :)

On Thu, Mar 02, 2017 at 11:14:05AM +0100, Philippe Ombredanne wrote:
> > So, this way you not only get the hash before keyword expansion is
> > done, you also get the hash for free since it's already known by the
> > VCS.
> >
> > The downside is that this internal hash is specific to the VCS, so
> > it only helps to identify the same file in other repos of the same
> > VCS. But for other VCS you could go with Philippe's option 2 and
> > calculate the file hash like Git does internally [2].
> 
> This is an interesting and intriguing approach :)
> As far as I know this is also more or less the approach taken by
> Stefano "zack" Zacchiroli and team for software heritage... [4] [5]


Yes and no. In Software Heritage we currently compute 3 different hashes
for each content blob we store: pure SHA1, "Git-style" SHA1 (which is in
essence a "salted" SHA1, where you prefix some fixed header and the
length to the blob content before computing SHA1), and SHA256. Having 3
hashes we can do cross-checks to verify collisions in any one of them.
To go unnoticed a collision will need to happen *simultaneously* on all
3 hashes, and should also not change the length of the content, as we
have that too.

So while we can easily relate to the intrinsic Git-style checksums
(because we compute them), we're not actually *relying* on them. It's
just convenient to also have them, because many people out there do use
those hashes, and can then quickly find content in our archive without
having to recompute other hashes --- which in some case they just can't
recompute, e.g., if they have the hashes but not the corresponding
content. Of course if you're doing lookups based on just *one* checksum
at a time, you are prone to collisions on that checksum alone, but it's
arguably the fault of the client asking, and not of the archive
answering it.

Finally, as additional safeguard, we don't blindly trust the checksums
computed by Git either, we recompute them before injecting objects into
our archive.

In the wake of shattered.io, we're keeping an eye on the discussions in
the Git community, to understand where they're going in terms of what
their future checksum choice (and length) --- SHA3? Blake2? which
length? --- will be, to proactively start computing them for all our
content. But again, it's just for the convenience of being somewhat
"compatible" with the most popular VCS out there; we have other means
for collision detection.

Not sure if that was useful or not for your discussion, but feel free to
ask for more details if it is any relevant. And thanks to Philippe for
getting me in the loop!

Cheers.
-- 
Stefano Zacchiroli . [email protected] . upsilon.cc/zack . . o . . . o . o
Computer Science Professor . CTO Software Heritage . . . . . o . . . o o
Former Debian Project Leader . OSI Board Director  . . . o o o . . . o .
« the first rule of tautology club is the first rule of tautology club »
_______________________________________________
Spdx-tech mailing list
[email protected]
https://lists.spdx.org/mailman/listinfo/spdx-tech

Re: [spdx-tech] spdx-tech Online Validation Tools GSOC 2017

Reply via email to