I like this idea! It helps with sorting of the key-value pairs. However, I'm not sure how the group thinks about maintaining a library in different programming languages for this.

On 5/24/22 04:46, Maximilian Huber via lists.spdx.org wrote:
Hey SPDX-Tech team,

I have a suggestion I want to promote. Instead of rigorously defining a
new "canonical" format for computation of hashes, I suggest to define an
algorithm which recursively computes an cryptographic hash, purely based
on the structure, and its values. Similar to the Package verification
code. See bellow for a short or [1] for a complete definition of such an
algorithm.

This resulting hash can then be used to validate that the content of the
SPDX is matching.


Advantages (in my opinion):

- it is easy to implement, just took me 2h to implement it in Haskell
   and Go (see [1]). It is especially easier than building another
   serializer that needs to escape strings correctly, might have
   performance issues and can not use existing libraries.

- it is easy to test, and a list of test cases can be provided (see [2])

- one less "format" to support / just works with JSON output

- the canonicalisation meeting would just have the task to define,
  at which point a JSON is canonical regarding different structures,
  representing the same SPDX data. This is still hard.

- it feels less fragile

- it can be computed from a stream and there is no 500MB string that
   needs to be created in memory



Example definition of such an algorithm:

Lets say for an arbitrary JSON I compute its hash by defining:

     Hashing of base values:
             hash( null ) = sha256("null")
             hash( true ) = sha256("true")
             hash( false ) = sha256("false")
             hash( 123 ) = sha256("123")
             hash( 123e2 ) = sha256("12300")
             hash( "some string" ) = sha256("\"some string\"")

Recursive datatypes can be hashed, by composing a string that contains
the hashes of all parts and hashing that:

     Hashing of recursive datatypes, via recursion:
             hash( [] ) = sha256("[]")
             hash( [123, "abc", { "key": "value" }] ) =
                    sha256("[" +
                           hash( 123 ) +
                           "," +
                           hash( "abc" ) +
                           "," +
                           hash( { "key": "value" } ) +
                           "]")

The same can be done with objects, by hashing each key:value pair,
sorting the hashes, and hashing the result similar to arrays.

There are also other similar algorithms published: [3], [4]

Best
Max

[1] https://github.com/maxhbr/recursiveHashing
[2] https://github.com/maxhbr/recursiveHashing/blob/main/testdata.csv
[3] https://github.com/marekventur/json-hash
[4] https://github.com/oyamist/merkle-json

​​​​​​--
Maximilian Huber * [email protected] * +49-174-3410223
TNG Technology Consulting GmbH, Betastr. 13a, 85774 Unterföhring
Geschäftsführer: Henrik Klagges, Dr. Robert Dahlke, Thomas Endres
Sitz: Unterföhring * Amtsgericht München * HRB 135082





--
nisha



-=-=-=-=-=-=-=-=-=-=-=-
Links: You receive all messages sent to this group.
View/Reply Online (#4538): https://lists.spdx.org/g/Spdx-tech/message/4538
Mute This Topic: https://lists.spdx.org/mt/91310709/21656
Group Owner: [email protected]
Unsubscribe: https://lists.spdx.org/g/Spdx-tech/unsub [[email protected]]
-=-=-=-=-=-=-=-=-=-=-=-


Reply via email to