Hi Sebastian,
Since we already have SHA256(), MD5() as an optional checksums, adding
SHA1GIT is definitely worth considering.
I've opened: https://bugs.linuxfoundation.org/show_bug.cgi?id=1356 to
track this.
Kate
On Thu, May 19, 2016 at 2:40 AM, Schuberth, Sebastian <
[email protected]> wrote:
> Hi Yev,
>
> I see how making the SHA1 algorithm non-mandatory would be a breaking
> change, and that we'd like to avoid that. But maybe we could at least allow
> SHA1GIT as an additional algorithm and add it to the spec.
>
> WRT the use-case you're asking for: It's all about performance. In our
> case scanners actually *do* scan Git checkouts most of the time, as
> dependencies (be it build time or runtime time) are usually included as Git
> submodules. When scanning these files, it does not make much sense to force
> the scanner to calculate the SHA1 on each file (in order to create valid
> SPDX) if the SHA1GIT is already known. However, I have to admit that
> getting the blob SHA1 for a given file name is a rather slow operation in
> Git, and for single small files (which is not uncommon for source code
> files) it might actually be faster to calculate the SHA1 instead of looking
> up the known SHA1GIT.
>
> Finally, there's also the "reverse" use-case: Suppose you have an SPDX
> file with a bunch of File Checksums given, an you'd like to know which are
> the candidate Git commits these files can originate from. If only the SHA1s
> are given, you'd have to iterate over all eligible commits in you Git
> repositiory, checkout the files, and calculate the SHA1 on them to see
> whether there's a match. With the SHA1GIT on the other hand, you could
> directly search Git's object database to find the trees / commits that
> contain the given blobs.
>
> I agree it probably is an edge-case, but maybe still enough reason to at
> least *allow* SHA1GIT as a File Checksum algorithm.
>
> Regards,
> Sebastian
>
>
> > -----Original Message-----
> > From: Yev Bronshteyn [mailto:[email protected]]
> > Sent: Wednesday, May 18, 2016 16:34
> > To: Schuberth, Sebastian <[email protected]>; spdx-
> > [email protected]
> > Subject: Re: Re-using VCS hashes as SPDX File Checksums
> >
> > Sebastian,
> >
> > Usually SPDX with files is produced by tools that have scanned the
> entire file
> > contents of the project. These tools may not always scan git checkouts,
> > because they’d also want to include dependencies pulled in by build
> tools.
> >
> > Making the existing sha1 non-mandatory would be a breaking change –
> > consumers of prior versions of documents may rely on Sha1 being present.
> >
> > It should be pointed out that in SPDX 2.1, files themselves are not
> required,
> > so if you’re a developer building up a bill of materials by hand or
> using an
> > “SPDX Editor” rather than a file scanner, chances are, you won’t be
> including
> > files in the first place.
> >
> > Do you have a particular use case in which using sha1 sums to identify
> files
> > would be particularly difficult?
> >
> > Yev
> >
> > On 5/18/16, 7:29 AM, "[email protected] on behalf of
> > Schuberth, Sebastian" <[email protected] on behalf of
> > [email protected]> wrote:
> >
> > >Hi,
> > >
> > >nowadays most source code is stored in some sort of VCS. Particularly
> > popular in the OSS world, but also in commercial software development, is
> > Git as a DVCS. Git's internal data structures are based on simple
> hierarchies of
> > SHA-1 hashes: Contents of files ("blobs") are hashed, entries of blobs
> are
> > hashed to "trees", trees are hashes to "commits" etc.
> > >
> > >So basically Git already knows the hashes of all its files, and there's
> usually
> > no need to recalculate the hashes for the purpose of creating SPDX File
> > Checksum entries. The only hitch is that Git's SHA1 of a blob is
> *slightly*
> > different from the SHA1 of purely the file contents: Git prefixes the
> file
> > contents with "blob <size>\0" where <size> is the size of the file. The
> "git
> > hash-object <file>" command calculates this SHA1 on the contents of
> <file>
> > with the prefix added, and the script at [1] illustrates how Git
> internally
> > performs the calculation.
> > >
> > >In order to reuse Git's SHA1 of blobs when creating an SPDX file for
> files
> > stored in Git, I'd like to propose a new "SHA1GIT" algorithm. The hash
> value
> > for that algorithm must match the output of "git hash-object <file>".
> Having
> > the Git-style SHA1 also allows easier matching of a given SPDX File
> Checksum
> > to Git repositories by doing something like "git rev-list --objects
> --all | grep
> > <sha1git>".
> > >
> > >Benefitting from the new SHA1GIT algorithm the most would also require
> to
> > make the existing SHA1 algorithm non-mandatory. From a file consistency
> > point of view it does not really make sense to compute both ("git hash-
> > object <file>" also works on files not committed to Git), and neither
> does it
> > form a performance point of view.
> > >
> > >Please let me know what you think about this proposal.
> > >
> > >[1] https://github.com/sschuberth/dev-scripts/blob/master/git/git-hash-
> > blob.sh
> > >
> > >Regards,
> > >Sebastian
> > >
> > >
> > >_______________________________________________
> > >Spdx-tech mailing list
> > >[email protected]
> > >https://lists.spdx.org/mailman/listinfo/spdx-tech
>
> _______________________________________________
> Spdx-tech mailing list
> [email protected]
> https://lists.spdx.org/mailman/listinfo/spdx-tech
>
--
Kate Stewart
Sr. Director of Strategic Programs, The Linux Foundation
Mobile: +1.512.657.3669
Email / Google Talk: [email protected]
_______________________________________________
Spdx-tech mailing list
[email protected]
https://lists.spdx.org/mailman/listinfo/spdx-tech