This illustrates the problem that canonicalization solves. There are two protocol design philosophies in the IETF: Postel's principle: 1) "be liberal in what you accept and conservative in what you send", and its converse 2) design and maintain protocols to reduce or eliminate the need to accept anything that people come up with: https://tools.ietf.org/id/draft-thomson-postel-was-wrong-03.html
Canonicalization means strictly defining what you send. But information modeling (robustness) means unambiguously defining what non-canonical data you accept. Selecting the output is only half of the canonicalization problem; defining the input is the other half. If we are going to say there is one and only one canonical representation of an SBOM then we're done, everything that is not that one exact sequence of bytes is rejected. By saying we support more than one serialization we have accepted the robustness principle - the identical SBOM can be represented as JSON, JSON-LD, XML, XML/RDF, maybe tag:value, spreadsheets, RDF/Turtle, Protobuf, CBOR, and concise (machine-to-machine optimized) JSON. Enumeration means "the act or process of making or stating a list of things one after another", also "the list itself". When we enumerate hash algorithms the result is an unambiguous list of those algorithms. 1 https://datatracker.ietf.org/doc/html/rfc3174 2 https://datatracker.ietf.org/doc/html/rfc3874 3 https://datatracker.ietf.org/doc/html/rfc4634#section-5.1, 256 bits etc The canonicalization question is what will we produce and what will we accept as valid values for each item in the enumeration. The reason Alexios and I don't care about the produce part is that the important question is the accept part. Algorithm 3 could be identified as the integer 3, the string "3", the string "SHA256", the string "SHA-256", the string "sha256", the string "sha-256", the string "<algorithm rdf:resource="spdx:checksumAlgorithm_sha1"/>”, and many others. *They all mean the same algorithm.* It's fine to pick "sha256" as the canonical output for verbose JSON serialization of the RFC4634 256 bit algorithm. But do we want to reject or accept other non-canonical strings as designators of that algorithm? That is the question - what values do we want to accept for each serialization. For concise JSON serialization the canonical value would be the string "3" and for CBOR the canonical value would be the integer 3 - there are no non-canonical values that satisfy the conciseness goal of those serializations. That is why I would prefer CBOR to be the canonical byte sequence to be hashed - it is the simplest way to minimize the set of non-canonical inputs that need to be accommodated. But that's not the question on the table today. Dave On Fri, Jun 24, 2022 at 10:54 AM Dick Brooks < [email protected]> wrote: > +1 for strings. Strings can represent alphanumeric data. > > > > Thanks, > > > > Dick Brooks > > > > *Active Member of the CISA Critical Manufacturing Sector, * > > *Sector Coordinating Council – A Public-Private Partnership* > > > > *Never trust software, always verify and report! > <https://reliableenergyanalytics.com/products>* ™ > > http://www.reliableenergyanalytics.com > > Email: [email protected] > > Tel: +1 978-696-1788 > > > > *From:* [email protected] <[email protected]> *On Behalf Of > *Gary O'Neall > *Sent:* Friday, June 24, 2022 10:53 AM > *To:* 'David Kemp' <[email protected]>; 'SPDX-list' < > [email protected]> > *Subject:* Re: [spdx-tech] Canonicalization - enumerations > > > > One strong vote for strings. The spec clearly defines the string > serializations already and introducing numbers is an unnecessary additional > complexity for some tooling. > > > > Gary > > > > *From:* [email protected] <[email protected]> *On Behalf Of > *David Kemp > *Sent:* Friday, June 24, 2022 7:37 AM > *To:* SPDX-list <[email protected]> > *Subject:* [spdx-tech] Canonicalization - enumerations > > > > Sebastian called a vote on whether "the" canonical representation of > enumerated lists such as hash algorithms and relationship types should be > strings or numbers. > > My vote is "doesn't matter". I lean toward efficient serializations > because they are more likely to be rigorously correct, but the critical > requirement is that the model defines the equivalence tables for all > enumerations: > > Hash Algorithms: > 1 SHA1 > 2 SHA224 > 3 SHA256 > > Software Purposes > 1 APPLICATION > 2 FRAMEWORK > > 3 LIBRARY > > . > etc. > We can say today that the canonical serialization will use human readable > values then work through the details of translating to and from concise > serializations. At that point, when all translations are guaranteed to be > lossless, our work is done. > > We could then throw the switch (using Sebastian's analogy) and say the > canonical hash is computed over CBOR data and everything would still work > perfectly, because any format can be converted into any other. > > Routers are designed to parse IP packets in optimized format ( > https://datatracker.ietf.org/doc/html/rfc791#section-3.1), but optimized > data can be displayed to humans by tools like Wireshark. Routers could be > designed to process data in human-readable format. They would be much less > efficient, but they would work correctly as long as the semantic > equivalence between efficient and human-readable data is precisely defined. > > If SBOMs become as ubiquitous in machine-to-machine operations as IP they > will surely be processed in an efficient format, and humans will use tools > like Wireshark to display/debug them. But for now, we can design canonical > hashes to hash over the string "TCP" instead of the number 6 ( > https://en.wikipedia.org/wiki/List_of_IP_protocol_numbers) for > convenience. > > Dave > -=-=-=-=-=-=-=-=-=-=-=- Links: You receive all messages sent to this group. View/Reply Online (#4623): https://lists.spdx.org/g/Spdx-tech/message/4623 Mute This Topic: https://lists.spdx.org/mt/91965644/21656 Group Owner: [email protected] Unsubscribe: https://lists.spdx.org/g/Spdx-tech/unsub [[email protected]] -=-=-=-=-=-=-=-=-=-=-=-
