Hey SPDX-Tech team,

I have a suggestion I want to promote. Instead of rigorously defining a
new "canonical" format for computation of hashes, I suggest to define an
algorithm which recursively computes an cryptographic hash, purely based
on the structure, and its values. Similar to the Package verification
code. See bellow for a short or [1] for a complete definition of such an
algorithm.

This resulting hash can then be used to validate that the content of the
SPDX is matching.


Advantages (in my opinion):

- it is easy to implement, just took me 2h to implement it in Haskell
  and Go (see [1]). It is especially easier than building another
  serializer that needs to escape strings correctly, might have
  performance issues and can not use existing libraries.

- it is easy to test, and a list of test cases can be provided (see [2])

- one less "format" to support / just works with JSON output

- the canonicalisation meeting would just have the task to define,
 at which point a JSON is canonical regarding different structures,
 representing the same SPDX data. This is still hard.

- it feels less fragile

- it can be computed from a stream and there is no 500MB string that
  needs to be created in memory



Example definition of such an algorithm:

> Lets say for an arbitrary JSON I compute its hash by defining:
>
>     Hashing of base values:
>             hash( null ) = sha256("null")
>             hash( true ) = sha256("true")
>             hash( false ) = sha256("false")
>             hash( 123 ) = sha256("123")
>             hash( 123e2 ) = sha256("12300")
>             hash( "some string" ) = sha256("\"some string\"")
>
> Recursive datatypes can be hashed, by composing a string that contains
> the hashes of all parts and hashing that:
>
>     Hashing of recursive datatypes, via recursion:
>             hash( [] ) = sha256("[]")
>             hash( [123, "abc", { "key": "value" }] ) =
>                    sha256("[" +
>                           hash( 123 ) +
>                           "," +
>                           hash( "abc" ) +
>                           "," +
>                           hash( { "key": "value" } ) +
>                           "]")
>
> The same can be done with objects, by hashing each key:value pair,
> sorting the hashes, and hashing the result similar to arrays.


There are also other similar algorithms published: [3], [4]

Best
Max

[1] https://github.com/maxhbr/recursiveHashing
[2] https://github.com/maxhbr/recursiveHashing/blob/main/testdata.csv
[3] https://github.com/marekventur/json-hash
[4] https://github.com/oyamist/merkle-json

​​​​​​--
Maximilian Huber * [email protected] * +49-174-3410223
TNG Technology Consulting GmbH, Betastr. 13a, 85774 Unterföhring
Geschäftsführer: Henrik Klagges, Dr. Robert Dahlke, Thomas Endres
Sitz: Unterföhring * Amtsgericht München * HRB 135082


-=-=-=-=-=-=-=-=-=-=-=-
Links: You receive all messages sent to this group.
View/Reply Online (#4519): https://lists.spdx.org/g/Spdx-tech/message/4519
Mute This Topic: https://lists.spdx.org/mt/91310709/21656
Group Owner: [email protected]
Unsubscribe: https://lists.spdx.org/g/Spdx-tech/unsub [[email protected]]
-=-=-=-=-=-=-=-=-=-=-=-


Reply via email to