On Thu, Mar 2, 2017 at 10:49 AM, Schuberth, Sebastian
<[email protected]> wrote:
>>> So, instead of the hash processing a file with this text
>>>
>>>   $Id: foo.c 123456 2015-01-31 12:34:56 mdb $
>>>
>>> as is found in a file, it would instead process the above text as if
>>> it were written $Id$
>>>
>>> This would allow two files that are identical other than RCS Keyword
>>> values to have the same 'hash' for an SPDX report.
>>
>>Mark:
>>this is eventually a problem with no simple answer. Luckily this is going 
>>away eventually in the future as as far as I know git does not support 
>>>keyword expansions (IMHO for the better).
>
> While Git does not support keyword expansion directly, it can be achieved 
> using the more general clean / smudge filter approach.

Great! this makes it hard then and I like it this way!


>>That said, there are various ways I have handled this practically:
> Note that the standard currently requires plain "SHA1" to be present. 
> Omitting that in favor of any other / custom hash would render your file 
> non-spec-compliant :-(


Agreed. But this is a part of the spec that should be flexible IMHO.
Especially with a shattered SHA1 [3]
And IMHO such LSH checksum should be readily addable to an SPDX document.


>>3. You use a non-crypto, "locality sensitive" checksum hash that you use for 
>>approximate file comparison.
> That option is very useful in general to have an indication about similarity 
> of files.
> 4. option, you simply use the hash of how the file is stored internally to 
> the VCS. In a way that is similar to Philippe's option 2 as it refers to the 
> file before keyword expansion. But instead of actually checking out the file 
> without doing keyword expansion, you simply query the VCS for its internal 
> hash of the file. At the example of Git and the AUTHORS.rst file of ScanCode 
> [1] that would work like:
>
> $ ARRAY=( $(git ls-tree HEAD AUTHORS.rst) ) ; echo ${ARRAY[2]}
> d89c7ba9918d7fe249875ac44b8c61cb11cac4ac
>
> So, this way you not only get the hash before keyword expansion is done, you 
> also get the hash for free since it's already known by the VCS.
>
> The downside is that this internal hash is specific to the VCS, so it only 
> helps to identify the same file in other repos of the same VCS. But for other 
> VCS you could go with Philippe's option 2 and calculate the file hash like 
> Git does internally [2].


This is an interesting and intriguing approach :)
As far as I know this is also more or less the approach taken by
Stefano "zack" Zacchiroli and team for software heritage... [4] [5]
But this is practical only for Git and Hg OR if you import in Git or Hg, right?


> [1] 
> https://github.com/nexB/scancode-toolkit/blob/bd424eae1dcdbb3f873169bbc01d252e4e20e4f4/AUTHORS.rst
> [2] https://github.com/sschuberth/dev-scripts/blob/master/git/git-hash-blob.sh


[3] https://shattered.it/
[4] https://www.softwareheritage.org/
[5] 
https://fosdem.org/2017/schedule/event/software_heritage/attachments/slides/1537/export/events/attachments/software_heritage/slides/1537/fosdem17_software_heritage.pdf

-- 
Cordially
Philippe Ombredanne
_______________________________________________
Spdx-tech mailing list
[email protected]
https://lists.spdx.org/mailman/listinfo/spdx-tech

Reply via email to