[spdx-tech] Canonicalization - enumerations

David Kemp Sat, 25 Jun 2022 08:04:21 -0700

This illustrates the problem that canonicalization solves.  There are two
protocol design philosophies in the IETF: Postel's principle:
1) "be liberal in what you accept and conservative in what you send", and
its converse
2) design and maintain protocols to reduce or eliminate the need to accept
anything that people come up with:
https://tools.ietf.org/id/draft-thomson-postel-was-wrong-03.html

Canonicalization means strictly defining what you send.  But information
modeling (robustness) means unambiguously defining what non-canonical data
you accept.  Selecting the output is only half of the canonicalization
problem; defining the input is the other half.  If we are going to say
there is one and only one canonical representation of an SBOM then we're
done, everything that is not that one exact sequence of bytes is rejected.
By saying we support more than one serialization we have accepted the
robustness principle - the identical SBOM can be represented as JSON,
JSON-LD, XML, XML/RDF, maybe tag:value, spreadsheets, RDF/Turtle, Protobuf,
CBOR, and concise (machine-to-machine optimized) JSON.

Enumeration means "the act or process of making or stating a list of things
one after another", also "the list itself".  When we enumerate hash
algorithms the result is an unambiguous list of those algorithms.

1 https://datatracker.ietf.org/doc/html/rfc3174
2 https://datatracker.ietf.org/doc/html/rfc3874
3 https://datatracker.ietf.org/doc/html/rfc4634#section-5.1, 256 bits
   etc

The canonicalization question is what will we produce and what will we
accept as valid values for each item in the enumeration.  The reason
Alexios and I don't care about the produce part is that the important
question is the accept part.  Algorithm 3 could be identified as the
integer 3, the string "3", the string "SHA256", the string "SHA-256", the
string "sha256", the string "sha-256", the string "<algorithm
rdf:resource="spdx:checksumAlgorithm_sha1"/>”, and many others.  *They all
mean the same algorithm.*  It's fine to pick "sha256" as the canonical
output for verbose JSON serialization of the RFC4634 256 bit algorithm. But
do we want to reject or accept other non-canonical strings as designators
of that algorithm?  That is the question - what values do we want to accept
for each serialization.

For concise JSON serialization the canonical value would be the string "3"
and for CBOR the canonical value would be the integer 3 - there are no
non-canonical values that satisfy the conciseness goal of those
serializations.  That is why I would prefer CBOR to be the canonical byte
sequence to be hashed - it is the simplest way to minimize the set of
non-canonical inputs that need to be accommodated.  But that's not the
question on the table today.

Dave

On Fri, Jun 24, 2022 at 10:54 AM Dick Brooks <
[email protected]> wrote:

> +1 for strings. Strings can represent alphanumeric data.
>
>
>
> Thanks,
>
>
>
> Dick Brooks
>
>
>
> *Active Member of the CISA Critical Manufacturing Sector, *
>
> *Sector Coordinating Council – A Public-Private Partnership*
>
>
>
> *Never trust software, always verify and report!
> <https://reliableenergyanalytics.com/products>* ™
>
> http://www.reliableenergyanalytics.com
>
> Email: [email protected]
>
> Tel: +1 978-696-1788
>
>
>
> *From:* [email protected] <[email protected]> *On Behalf Of
> *Gary O'Neall
> *Sent:* Friday, June 24, 2022 10:53 AM
> *To:* 'David Kemp' <[email protected]>; 'SPDX-list' <
> [email protected]>
> *Subject:* Re: [spdx-tech] Canonicalization - enumerations
>
>
>
> One strong vote for strings.  The spec clearly defines the string
> serializations already and introducing numbers is an unnecessary additional
> complexity for some tooling.
>
>
>
> Gary
>
>
>
> *From:* [email protected] <[email protected]> *On Behalf Of
> *David Kemp
> *Sent:* Friday, June 24, 2022 7:37 AM
> *To:* SPDX-list <[email protected]>
> *Subject:* [spdx-tech] Canonicalization - enumerations
>
>
>
> Sebastian called a vote on whether "the" canonical representation of
> enumerated lists such as hash algorithms and relationship types should be
> strings or numbers.
>
> My vote is "doesn't matter".  I lean toward efficient serializations
> because they are more likely to be rigorously correct, but the critical
> requirement is that the model defines the equivalence tables for all
> enumerations:
>
> Hash Algorithms:
>  1 SHA1
>  2 SHA224
>  3 SHA256
>
> Software Purposes
>  1 APPLICATION
>  2 FRAMEWORK
>
>  3 LIBRARY
>
>   .
> etc.
> We can say today that the canonical serialization will use human readable
> values then work through the details of translating to and from concise
> serializations.  At that point, when all translations are guaranteed to be
> lossless, our work is done.
>
> We could then throw the switch (using Sebastian's analogy) and say the
> canonical hash is computed over CBOR data and everything would still work
> perfectly, because any format can be converted into any other.
>
> Routers are designed to parse IP packets in optimized format (
> https://datatracker.ietf.org/doc/html/rfc791#section-3.1), but optimized
> data can be displayed to humans by tools like Wireshark.  Routers could be
> designed to process data in human-readable format.  They would be much less
> efficient, but they would work correctly as long as the semantic
> equivalence between efficient and human-readable data is precisely defined.
>
> If SBOMs become as ubiquitous in machine-to-machine operations as IP they
> will surely be processed in an efficient format, and humans will use tools
> like Wireshark to display/debug them.  But for now, we can design canonical
> hashes to hash over the string "TCP" instead of the number 6 (
> https://en.wikipedia.org/wiki/List_of_IP_protocol_numbers) for
> convenience.
>
> Dave
>

-=-=-=-=-=-=-=-=-=-=-=-
Links: You receive all messages sent to this group.
View/Reply Online (#4623): https://lists.spdx.org/g/Spdx-tech/message/4623
Mute This Topic: https://lists.spdx.org/mt/91965644/21656
Group Owner: [email protected]
Unsubscribe: https://lists.spdx.org/g/Spdx-tech/unsub [[email protected]]
-=-=-=-=-=-=-=-=-=-=-=-

[spdx-tech] Canonicalization - enumerations

Reply via email to