William, We may be saying the same thing - defining what is a "fundamental data type" and defining serialization rules for a canonical hashing of each type is what canonicalization is. As for "being authentic to what the creator provided", canonicalization defines what is authentic (significant) and what is not.
To reiterate, is JSON pretty-printed with 2-space indentation "authentic" to JSON pretty-printed with 4-space indentation and "authentic" to JSON minified with no line breaks or spaces? if they all have the identical hash, then whitespace is insignificant and the author's original spacing is not preserved in the canonical/hashed byte sequence. It obviously is preserved if you copy the original. Dave On Tue, May 24, 2022 at 12:33 PM William Bartholomew (CELA) < [email protected]> wrote: > I agree with Maximilian and this is the same as my suggestion here: > https://lists.spdx.org/g/Spdx-tech/message/4514. > > > > To David’s comments, for each fundamental data type we would define the > rules for canonicalization for hashing, to date we have only used a handful > of data types. Given that we want to be be as authentic to what the creator > provided as possible (which assists with round-tripping) then we would > preserve as much as physically possible. > > > > > > *From:* [email protected] <[email protected]> *On Behalf Of > *David Kemp via lists.spdx.org > *Sent:* Tuesday, May 24, 2022 9:07 AM > *To:* [email protected] > *Cc:* SPDX-list <[email protected]> > *Subject:* [EXTERNAL] Re: [spdx-tech] An alternative to canonical > serialization > > > > Max, > > If it achieves the same results as defining a canonical series of bytes, > then it would work. But I don't see how that is possible. When given two > inputs "Mary" and "Mary " (with trailing whitespace) does it produce the > same or different results? > > Canonicalization is synonymous with information modeling. An information > model defines what is "significant" (information) and what is insignificant > (noise). If whitespace is defined as insignificant, then the canonical > representations and hashes of inputs that differ only in the amount of > whitespace must be identical. > > Whitespace is a trivial example, but integers with or without leading > zeroes, unordered lists with different item orders are typical > equivalencies. > > Bob, Fred > > > > and > > Fred, Bob > > > are semantically equivalent in an unordered list, but semantically > different in an ordered list. The information model, or canonicalization > algorithm, must be able to distinguish what is significant and what is not. > Canonicalization must be able to losslessly convert between representations > such as CBOR, JSON and XML, where "lossless" means semantically-meaningful > information is preserved and insignificant data is not required to be. > > Dave > > > > > > On Tue, May 24, 2022 at 9:19 AM Maximilian Huber via lists.spdx.org > <https://nam06.safelinks.protection.outlook.com/?url=http%3A%2F%2Flists.spdx.org%2F&data=05%7C01%7Cwillbar%40microsoft.com%7C0af6ee28b982485b807f08da3d9f8347%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C637890052573595160%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=NnwsQ7BiMdhTILcgcblt7B7C6eVaheMyp90Gcg8JaXM%3D&reserved=0> > <[email protected]> wrote: > > > Hey SPDX-Tech team, > > I have a suggestion I want to promote. Instead of rigorously defining a > new "canonical" format for computation of hashes, I suggest to define an > algorithm which recursively computes an cryptographic hash, purely based > on the structure, and its values. Similar to the Package verification > code. See bellow for a short or [1] for a complete definition of such an > algorithm. > > This resulting hash can then be used to validate that the content of the > SPDX is matching. > > > Advantages (in my opinion): > > - it is easy to implement, just took me 2h to implement it in Haskell > and Go (see [1]). It is especially easier than building another > serializer that needs to escape strings correctly, might have > performance issues and can not use existing libraries. > > - it is easy to test, and a list of test cases can be provided (see [2]) > > - one less "format" to support / just works with JSON output > > - the canonicalisation meeting would just have the task to define, > at which point a JSON is canonical regarding different structures, > representing the same SPDX data. This is still hard. > > - it feels less fragile > > - it can be computed from a stream and there is no 500MB string that > needs to be created in memory > > > > Example definition of such an algorithm: > > > Lets say for an arbitrary JSON I compute its hash by defining: > > > > Hashing of base values: > > hash( null ) = sha256("null") > > hash( true ) = sha256("true") > > hash( false ) = sha256("false") > > hash( 123 ) = sha256("123") > > hash( 123e2 ) = sha256("12300") > > hash( "some string" ) = sha256("\"some string\"") > > > > Recursive datatypes can be hashed, by composing a string that contains > > the hashes of all parts and hashing that: > > > > Hashing of recursive datatypes, via recursion: > > hash( [] ) = sha256("[]") > > hash( [123, "abc", { "key": "value" }] ) = > > sha256("[" + > > hash( 123 ) + > > "," + > > hash( "abc" ) + > > "," + > > hash( { "key": "value" } ) + > > "]") > > > > The same can be done with objects, by hashing each key:value pair, > > sorting the hashes, and hashing the result similar to arrays. > > > There are also other similar algorithms published: [3], [4] > > Best > Max > > [1] https://github.com/maxhbr/recursiveHashing > <https://nam06.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fmaxhbr%2FrecursiveHashing&data=05%7C01%7Cwillbar%40microsoft.com%7C0af6ee28b982485b807f08da3d9f8347%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C637890052573595160%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=IV8ZSHQETE7HVc1FU1WpO8bfCXcGQPfaDBA3LUxHbOQ%3D&reserved=0> > [2] https://github.com/maxhbr/recursiveHashing/blob/main/testdata.csv > <https://nam06.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fmaxhbr%2FrecursiveHashing%2Fblob%2Fmain%2Ftestdata.csv&data=05%7C01%7Cwillbar%40microsoft.com%7C0af6ee28b982485b807f08da3d9f8347%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C637890052573595160%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=8F8r1ZF5juCgKAXvV2wgCVASHazqkGMCF9LZ4UlrFeA%3D&reserved=0> > [3] https://github.com/marekventur/json-hash > <https://nam06.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fmarekventur%2Fjson-hash&data=05%7C01%7Cwillbar%40microsoft.com%7C0af6ee28b982485b807f08da3d9f8347%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C637890052573595160%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=XInmhE5XkFYhYLakXP4x73HxAyjm87QW3zQNC9CyggM%3D&reserved=0> > [4] https://github.com/oyamist/merkle-json > <https://nam06.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Foyamist%2Fmerkle-json&data=05%7C01%7Cwillbar%40microsoft.com%7C0af6ee28b982485b807f08da3d9f8347%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C637890052573595160%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=GoESeqMSKf2uwAEQM8kmafRzirIyFNFh0anFYF04fZg%3D&reserved=0> > > -- > Maximilian Huber * [email protected] * +49-174-3410223 > TNG Technology Consulting GmbH, Betastr. 13a, 85774 Unterföhring > Geschäftsführer: Henrik Klagges, Dr. Robert Dahlke, Thomas Endres > Sitz: Unterföhring * Amtsgericht München * HRB 135082 > > > > > > -=-=-=-=-=-=-=-=-=-=-=- Links: You receive all messages sent to this group. View/Reply Online (#4523): https://lists.spdx.org/g/Spdx-tech/message/4523 Mute This Topic: https://lists.spdx.org/mt/91310709/21656 Group Owner: [email protected] Unsubscribe: https://lists.spdx.org/g/Spdx-tech/unsub [[email protected]] -=-=-=-=-=-=-=-=-=-=-=-
