Heya, chiming in as summoned, but I might be missing some context :) On Thu, Mar 02, 2017 at 11:14:05AM +0100, Philippe Ombredanne wrote: > > So, this way you not only get the hash before keyword expansion is > > done, you also get the hash for free since it's already known by the > > VCS. > > > > The downside is that this internal hash is specific to the VCS, so > > it only helps to identify the same file in other repos of the same > > VCS. But for other VCS you could go with Philippe's option 2 and > > calculate the file hash like Git does internally [2]. > > This is an interesting and intriguing approach :) > As far as I know this is also more or less the approach taken by > Stefano "zack" Zacchiroli and team for software heritage... [4] [5]
Yes and no. In Software Heritage we currently compute 3 different hashes for each content blob we store: pure SHA1, "Git-style" SHA1 (which is in essence a "salted" SHA1, where you prefix some fixed header and the length to the blob content before computing SHA1), and SHA256. Having 3 hashes we can do cross-checks to verify collisions in any one of them. To go unnoticed a collision will need to happen *simultaneously* on all 3 hashes, and should also not change the length of the content, as we have that too. So while we can easily relate to the intrinsic Git-style checksums (because we compute them), we're not actually *relying* on them. It's just convenient to also have them, because many people out there do use those hashes, and can then quickly find content in our archive without having to recompute other hashes --- which in some case they just can't recompute, e.g., if they have the hashes but not the corresponding content. Of course if you're doing lookups based on just *one* checksum at a time, you are prone to collisions on that checksum alone, but it's arguably the fault of the client asking, and not of the archive answering it. Finally, as additional safeguard, we don't blindly trust the checksums computed by Git either, we recompute them before injecting objects into our archive. In the wake of shattered.io, we're keeping an eye on the discussions in the Git community, to understand where they're going in terms of what their future checksum choice (and length) --- SHA3? Blake2? which length? --- will be, to proactively start computing them for all our content. But again, it's just for the convenience of being somewhat "compatible" with the most popular VCS out there; we have other means for collision detection. Not sure if that was useful or not for your discussion, but feel free to ask for more details if it is any relevant. And thanks to Philippe for getting me in the loop! Cheers. -- Stefano Zacchiroli . [email protected] . upsilon.cc/zack . . o . . . o . o Computer Science Professor . CTO Software Heritage . . . . . o . . . o o Former Debian Project Leader . OSI Board Director . . . o o o . . . o . « the first rule of tautology club is the first rule of tautology club » _______________________________________________ Spdx-tech mailing list [email protected] https://lists.spdx.org/mailman/listinfo/spdx-tech
