Hi Yev, I see how making the SHA1 algorithm non-mandatory would be a breaking change, and that we'd like to avoid that. But maybe we could at least allow SHA1GIT as an additional algorithm and add it to the spec.
WRT the use-case you're asking for: It's all about performance. In our case scanners actually *do* scan Git checkouts most of the time, as dependencies (be it build time or runtime time) are usually included as Git submodules. When scanning these files, it does not make much sense to force the scanner to calculate the SHA1 on each file (in order to create valid SPDX) if the SHA1GIT is already known. However, I have to admit that getting the blob SHA1 for a given file name is a rather slow operation in Git, and for single small files (which is not uncommon for source code files) it might actually be faster to calculate the SHA1 instead of looking up the known SHA1GIT. Finally, there's also the "reverse" use-case: Suppose you have an SPDX file with a bunch of File Checksums given, an you'd like to know which are the candidate Git commits these files can originate from. If only the SHA1s are given, you'd have to iterate over all eligible commits in you Git repositiory, checkout the files, and calculate the SHA1 on them to see whether there's a match. With the SHA1GIT on the other hand, you could directly search Git's object database to find the trees / commits that contain the given blobs. I agree it probably is an edge-case, but maybe still enough reason to at least *allow* SHA1GIT as a File Checksum algorithm. Regards, Sebastian > -----Original Message----- > From: Yev Bronshteyn [mailto:[email protected]] > Sent: Wednesday, May 18, 2016 16:34 > To: Schuberth, Sebastian <[email protected]>; spdx- > [email protected] > Subject: Re: Re-using VCS hashes as SPDX File Checksums > > Sebastian, > > Usually SPDX with files is produced by tools that have scanned the entire file > contents of the project. These tools may not always scan git checkouts, > because they’d also want to include dependencies pulled in by build tools. > > Making the existing sha1 non-mandatory would be a breaking change – > consumers of prior versions of documents may rely on Sha1 being present. > > It should be pointed out that in SPDX 2.1, files themselves are not required, > so if you’re a developer building up a bill of materials by hand or using an > “SPDX Editor” rather than a file scanner, chances are, you won’t be including > files in the first place. > > Do you have a particular use case in which using sha1 sums to identify files > would be particularly difficult? > > Yev > > On 5/18/16, 7:29 AM, "[email protected] on behalf of > Schuberth, Sebastian" <[email protected] on behalf of > [email protected]> wrote: > > >Hi, > > > >nowadays most source code is stored in some sort of VCS. Particularly > popular in the OSS world, but also in commercial software development, is > Git as a DVCS. Git's internal data structures are based on simple hierarchies > of > SHA-1 hashes: Contents of files ("blobs") are hashed, entries of blobs are > hashed to "trees", trees are hashes to "commits" etc. > > > >So basically Git already knows the hashes of all its files, and there's > >usually > no need to recalculate the hashes for the purpose of creating SPDX File > Checksum entries. The only hitch is that Git's SHA1 of a blob is *slightly* > different from the SHA1 of purely the file contents: Git prefixes the file > contents with "blob <size>\0" where <size> is the size of the file. The "git > hash-object <file>" command calculates this SHA1 on the contents of <file> > with the prefix added, and the script at [1] illustrates how Git internally > performs the calculation. > > > >In order to reuse Git's SHA1 of blobs when creating an SPDX file for files > stored in Git, I'd like to propose a new "SHA1GIT" algorithm. The hash value > for that algorithm must match the output of "git hash-object <file>". Having > the Git-style SHA1 also allows easier matching of a given SPDX File Checksum > to Git repositories by doing something like "git rev-list --objects --all | > grep > <sha1git>". > > > >Benefitting from the new SHA1GIT algorithm the most would also require to > make the existing SHA1 algorithm non-mandatory. From a file consistency > point of view it does not really make sense to compute both ("git hash- > object <file>" also works on files not committed to Git), and neither does it > form a performance point of view. > > > >Please let me know what you think about this proposal. > > > >[1] https://github.com/sschuberth/dev-scripts/blob/master/git/git-hash- > blob.sh > > > >Regards, > >Sebastian > > > > > >_______________________________________________ > >Spdx-tech mailing list > >[email protected] > >https://lists.spdx.org/mailman/listinfo/spdx-tech _______________________________________________ Spdx-tech mailing list [email protected] https://lists.spdx.org/mailman/listinfo/spdx-tech
