Sebastian led a productive canonicalization meeting last Friday, dedicated to the topic of URIs. Henk introduced the CRI specification https://www.ietf.org/archive/id/draft-ietf-core-href-10.html, that identifies the individual components of a URI that can be modeled and serialized:
This document defines the Constrained Resource Identifier (CRI) by constraining URIs to a simplified subset and serializing their components in Concise Binary Object Representation (CBOR) [RFC8949] instead of a sequence of characters. Why does this matter? Even if URIs are always serialized as character strings, the components have semantic meaning. An unambiguous model of "scheme", "authority", "path", "query" and "fragment" allows each component to be extracted and processed within applications, and supports determining whether an arbitrary string is a valid URI. There is also sentiment for aligning the information model with available tools. The .net library models URIs with seven components <https://docs.microsoft.com/en-us/dotnet/api/system.uricomponents?view=net-6.0#fields>: Scheme, UserInfo, Host, Port, LocalPath, Query, and Fragment. URI is implemented as a class <https://docs.microsoft.com/en-us/dotnet/api/system.uri?view=net-6.0> with properties and methods; the implementation details are opaque but instances can be parsed from serialized data, manipulated through a property and method interface, and serialized back into data. This is a perfect example of how information modeling works, and it is implemented in widely-available libraries. URI is a primitive type so its serialization methods <https://docs.microsoft.com/en-us/dotnet/api/system.runtime.serialization.iserializable?view=net-6.0> are straightforward to write for most data formats. Compound types are more complex but only need to be written once per information type for each data format. The same string serialization would be used for all text data formats, and perhaps even for machine-to-machine formats such as CBOR and Protobuf. But the component structure makes it possible for machine formats to optimize Host and Port components as integers instead of strings if that seems worthwhile. One topic we didn't discuss was separating a URI into namespace and local parts. Playing with the MakeRelativeURI <https://docs.microsoft.com/en-us/dotnet/api/system.runtime.serialization.iserializable?view=net-6.0> API reveals that the .net library adjusts the boundary to a reserved character, which can lead to unexpected results. When applied to the URI *http://www.contoso.com/foo/bar/index.htm?date=today <http://www.contoso.com/foo/bar/index.htm?date=today>* a namespace *http://www.contoso.com/foo/bar/ <http://www.contoso.com/foo/bar/>* yields the expected local *index.htm?date=today* But the namespace *http://www.contoso.com/foo/bar <http://www.contoso.com/foo/bar> (*without trailing slash) yields the bizarre local part *bar/index.htm?date=today.* This is arguably a bug. But even if it is intentional behavior, I suggest that for ease of both understanding and implementation, SPDX defines pure string concatenation as the algorithm to separate and combine namespace and local portions of a URI/IRI. Dave -=-=-=-=-=-=-=-=-=-=-=- Links: You receive all messages sent to this group. View/Reply Online (#4637): https://lists.spdx.org/g/Spdx-tech/message/4637 Mute This Topic: https://lists.spdx.org/mt/92197691/21656 Group Owner: [email protected] Unsubscribe: https://lists.spdx.org/g/Spdx-tech/unsub [[email protected]] -=-=-=-=-=-=-=-=-=-=-=-
