Re: [spdx-tech] An alternative to canonical serialization

David Kemp Tue, 24 May 2022 09:07:34 -0700

Max,

If it achieves the same results as defining a canonical series of bytes,
then it would work.  But I don't see how that is possible. When given two
inputs "Mary" and "Mary " (with trailing whitespace) does it produce the
same or different results?


Canonicalization is synonymous with information modeling.  An information
model defines what is "significant" (information) and what is insignificant
(noise).  If whitespace is defined as insignificant, then the canonical
representations and hashes of inputs that differ only in the amount of
whitespace must be identical.

Whitespace is a trivial example, but integers with or without leading
zeroes, unordered lists with different item orders are typical
equivalencies.

Bob, Fred

and

Fred, Bob

are semantically equivalent in an unordered list, but semantically
different in an ordered list. The information model, or canonicalization
algorithm, must be able to distinguish what is significant and what is not.
Canonicalization must be able to losslessly convert between representations
such as CBOR, JSON and XML, where "lossless" means semantically-meaningful
information is preserved and insignificant data is not required to be.

Dave




On Tue, May 24, 2022 at 9:19 AM Maximilian Huber via lists.spdx.org
<[email protected]> wrote:

>
> Hey SPDX-Tech team,
>
> I have a suggestion I want to promote. Instead of rigorously defining a
> new "canonical" format for computation of hashes, I suggest to define an
> algorithm which recursively computes an cryptographic hash, purely based
> on the structure, and its values. Similar to the Package verification
> code. See bellow for a short or [1] for a complete definition of such an
> algorithm.
>
> This resulting hash can then be used to validate that the content of the
> SPDX is matching.
>
>
> Advantages (in my opinion):
>
> - it is easy to implement, just took me 2h to implement it in Haskell
>   and Go (see [1]). It is especially easier than building another
>   serializer that needs to escape strings correctly, might have
>   performance issues and can not use existing libraries.
>
> - it is easy to test, and a list of test cases can be provided (see [2])
>
> - one less "format" to support / just works with JSON output
>
> - the canonicalisation meeting would just have the task to define,
>  at which point a JSON is canonical regarding different structures,
>  representing the same SPDX data. This is still hard.
>
> - it feels less fragile
>
> - it can be computed from a stream and there is no 500MB string that
>   needs to be created in memory
>
>
>
> Example definition of such an algorithm:
>
> > Lets say for an arbitrary JSON I compute its hash by defining:
> >
> >     Hashing of base values:
> >             hash( null ) = sha256("null")
> >             hash( true ) = sha256("true")
> >             hash( false ) = sha256("false")
> >             hash( 123 ) = sha256("123")
> >             hash( 123e2 ) = sha256("12300")
> >             hash( "some string" ) = sha256("\"some string\"")
> >
> > Recursive datatypes can be hashed, by composing a string that contains
> > the hashes of all parts and hashing that:
> >
> >     Hashing of recursive datatypes, via recursion:
> >             hash( [] ) = sha256("[]")
> >             hash( [123, "abc", { "key": "value" }] ) =
> >                    sha256("[" +
> >                           hash( 123 ) +
> >                           "," +
> >                           hash( "abc" ) +
> >                           "," +
> >                           hash( { "key": "value" } ) +
> >                           "]")
> >
> > The same can be done with objects, by hashing each key:value pair,
> > sorting the hashes, and hashing the result similar to arrays.
>
>
> There are also other similar algorithms published: [3], [4]
>
> Best
> Max
>
> [1] https://github.com/maxhbr/recursiveHashing
> [2] https://github.com/maxhbr/recursiveHashing/blob/main/testdata.csv
> [3] https://github.com/marekventur/json-hash
> [4] https://github.com/oyamist/merkle-json
>
> --
> Maximilian Huber * [email protected] * +49-174-3410223
> TNG Technology Consulting GmbH, Betastr. 13a, 85774 Unterföhring
> Geschäftsführer: Henrik Klagges, Dr. Robert Dahlke, Thomas Endres
> Sitz: Unterföhring * Amtsgericht München * HRB 135082
>
>
> 
>
>
>


-=-=-=-=-=-=-=-=-=-=-=-
Links: You receive all messages sent to this group.
View/Reply Online (#4521): https://lists.spdx.org/g/Spdx-tech/message/4521
Mute This Topic: https://lists.spdx.org/mt/91310709/21656
Group Owner: [email protected]
Unsubscribe: https://lists.spdx.org/g/Spdx-tech/unsub [[email protected]]
-=-=-=-=-=-=-=-=-=-=-=-

Re: [spdx-tech] An alternative to canonical serialization

Reply via email to