RE: Re-using VCS hashes as SPDX File Checksums

Schuberth, Sebastian Thu, 19 May 2016 00:17:10 -0700

Hi Daniel,

if your question is whether a set of distributed sources files is unambiguously 
assignable to a specific Git commit, I agree this usually is not possible. 
Committing exactly the same files in two different repositories will lead to 
different commit SHA1 as some meta-data (like commit time etc.) will be 
different. However, the SHA1s of the blobs (i.e. file contents) will be the 
same. It’s just that these are usually not visible to the user.


That said, what you *can* easily do is to prove (or disprove) that a specific 
file from Git went into a distribution. This is a use-case that could be 
accelerated by having the SHA1GIT algorithm specified.

Note that rebasing commits in Git changes the commit SHA1s, but not the blob 
SHA1s, so that would not be an issue. If the author deletes the Git repository, 
you have the same issue as with regular SHA1: The source of the files to run 
your checksum tool (be it “sha1sum” or “git hash-object”) is gone, so you 
cannot verify the files.

Finally, I agree that the commit SHA1 can be seen as a version identifier. But 
what I was talking about is the blob SHA1s, not the commit SHA1s.

Regards,
Sebastian


From: [email protected] [mailto:[email protected]] On Behalf Of dmg
Sent: Wednesday, May 18, 2016 20:01
To: Schuberth, Sebastian <[email protected]>
Cc: [email protected]
Subject: Re: Re-using VCS hashes as SPDX File Checksums

Dear Sebastian,

is the commit-id verifiable from the source code?  I think it would require 
extra work to make it verifiable, and, if the distribution contains less files 
than the
repo (which is common), then it will never be verifiable. Also, what happens if 
the author decides to rebase or simply delete the repo?

The commit-id is really a replacement of the "version" identifier.



On Wed, May 18, 2016 at 4:29 AM, Schuberth, Sebastian 
<[email protected]<mailto:[email protected]>> wrote:
Hi,

nowadays most source code is stored in some sort of VCS. Particularly popular 
in the OSS world, but also in commercial software development, is Git as a 
DVCS. Git's internal data structures are based on simple hierarchies of SHA-1 
hashes: Contents of files ("blobs") are hashed, entries of blobs are hashed to 
"trees", trees are hashes to "commits" etc.

So basically Git already knows the hashes of all its files, and there's usually 
no need to recalculate the hashes for the purpose of creating SPDX File 
Checksum entries. The only hitch is that Git's SHA1 of a blob is *slightly* 
different from the SHA1 of purely the file contents: Git prefixes the file 
contents with "blob <size>\0" where <size> is the size of the file. The "git 
hash-object <file>" command calculates this SHA1 on the contents of <file> with 
the prefix added, and the script at [1] illustrates how Git internally performs 
the calculation.

In order to reuse Git's SHA1 of blobs when creating an SPDX file for files 
stored in Git, I'd like to propose a new "SHA1GIT" algorithm. The hash value 
for that algorithm must match the output of "git hash-object <file>". Having 
the Git-style SHA1 also allows easier matching of a given SPDX File Checksum to 
Git repositories by doing something like "git rev-list --objects --all | grep 
<sha1git>".

Benefitting from the new SHA1GIT algorithm the most would also require to make 
the existing SHA1 algorithm non-mandatory. From a file consistency point of 
view it does not really make sense to compute both ("git hash-object <file>" 
also works on files not committed to Git), and neither does it form a 
performance point of view.

Please let me know what you think about this proposal.

[1] https://github.com/sschuberth/dev-scripts/blob/master/git/git-hash-blob.sh

Regards,
Sebastian


_______________________________________________
Spdx-tech mailing list
[email protected]<mailto:[email protected]>
https://lists.spdx.org/mailman/listinfo/spdx-tech



--
--dmg

---
Daniel M. German
http://turingmachine.org

_______________________________________________
Spdx-tech mailing list
[email protected]
https://lists.spdx.org/mailman/listinfo/spdx-tech

RE: Re-using VCS hashes as SPDX File Checksums

Reply via email to