[spdx-tech] canonicalization

David Kemp Tue, 05 Jul 2022 17:32:16 -0700

Sebastian led a productive canonicalization meeting last Friday, dedicated
to the topic of URIs.  Henk introduced the CRI specification
https://www.ietf.org/archive/id/draft-ietf-core-href-10.html, that
identifies the individual components of a URI that can be modeled and
serialized:


This document defines the Constrained Resource Identifier (CRI) by
constraining URIs to a simplified subset and serializing their components
in Concise Binary Object Representation (CBOR) [RFC8949] instead of a
sequence of characters.


Why does this matter?  Even if URIs are always serialized as character
strings, the components have semantic meaning. An unambiguous model of
"scheme", "authority", "path", "query" and "fragment" allows each component
to be extracted and processed within applications, and supports determining
whether an arbitrary string is a valid URI.

There is also sentiment for aligning the information model with available
tools.  The .net library  models URIs with seven components
<https://docs.microsoft.com/en-us/dotnet/api/system.uricomponents?view=net-6.0#fields>:
Scheme, UserInfo, Host, Port, LocalPath, Query, and Fragment.  URI is
implemented as a class
<https://docs.microsoft.com/en-us/dotnet/api/system.uri?view=net-6.0> with
properties and methods; the implementation details are opaque but instances
can be parsed from serialized data, manipulated through a property and
method interface, and serialized back into data. This is a perfect example
of how information modeling works, and it is implemented in
widely-available libraries.  URI is a primitive type so its serialization
methods
<https://docs.microsoft.com/en-us/dotnet/api/system.runtime.serialization.iserializable?view=net-6.0>
are
straightforward to write for most data formats. Compound types are more
complex but only need to be written once per information type for each data
format. The same string serialization would be used for all text data
formats, and perhaps even for machine-to-machine formats such as CBOR and
Protobuf. But the component structure makes it possible for machine formats
to optimize Host and Port components as integers instead of strings if that
seems worthwhile.

One topic we didn't discuss was separating a URI into namespace and local
parts.  Playing with the MakeRelativeURI
<https://docs.microsoft.com/en-us/dotnet/api/system.runtime.serialization.iserializable?view=net-6.0>
API
reveals that the .net library adjusts the boundary to a reserved character,
which can lead to unexpected results.

When applied to the URI *http://www.contoso.com/foo/bar/index.htm?date=today
<http://www.contoso.com/foo/bar/index.htm?date=today>*
a namespace *http://www.contoso.com/foo/bar/
<http://www.contoso.com/foo/bar/>* yields the expected local
*index.htm?date=today*
But the namespace  *http://www.contoso.com/foo/bar
<http://www.contoso.com/foo/bar> (*without trailing slash) yields the
bizarre local part *bar/index.htm?date=today.*

This is arguably a bug.  But even if it is intentional behavior, I suggest
that for ease of both understanding and implementation, SPDX defines pure
string concatenation as the algorithm to separate and combine namespace and
local portions of a URI/IRI.

Dave


-=-=-=-=-=-=-=-=-=-=-=-
Links: You receive all messages sent to this group.
View/Reply Online (#4637): https://lists.spdx.org/g/Spdx-tech/message/4637
Mute This Topic: https://lists.spdx.org/mt/92197691/21656
Group Owner: [email protected]
Unsubscribe: https://lists.spdx.org/g/Spdx-tech/unsub [[email protected]]
-=-=-=-=-=-=-=-=-=-=-=-

[spdx-tech] canonicalization

Reply via email to