I would suggest a starting point for the ID format to be the current SPDX 2.2 
which uses the following:

 

<namespace>#SPDXRef-[idstring] for elements, or

<namespace>#LicenseRef-[idstring] for extractedLicenseInformation

 

[idstring] is a unique (to the document) string containing letters, numbers, . 
and/or -.

 

The only requirement for namespace is that it is unique and follows a URI 
format without “#”.

 

Reasoning:

*       Compatible with previous versions of SPDX
*       Meets most of the criteria Sean mentioned below
*       Lossless mapping of URI based ID to short IDs for various non-URI based 
serialization formats (e.g. YAML, Tag/Value)
*       Prefixing with SPDXRef and LicenseRef makes it easy to create 
non-conflicting ID’s for objects outside the SPDX defined SBOM format (e.g., I 
would like to extend a document to include something not defined in SPDX but 
using the same document namespace)

 

As I write this, I realize this violates the opaqueness principle which I 
agreed to previously.  I would still keep the [idstring] opaque, but require 
the prefixes of SPDXRef- and LicenseRef- for the reasons listed above.

 

I believe we can implement an SPDX 2.2 compatible ID naming convention and meet 
Sean’s points below.  If this is not the case, we should discuss further to 
make sure we don’t create something which isn’t usable for all serialization 
formats.

 

Gary

 

From: [email protected] <[email protected]> On Behalf Of David 
Kemp
Sent: Tuesday, July 20, 2021 3:40 PM
To: Sean Barnum <[email protected]>
Cc: [email protected]; SPDX-list <[email protected]>
Subject: Re: [EXT] Re: [spdx-tech] SPDX IDs - internal or meaningful?

 

Sean,

Would it be helpful to discuss the distinction between:
  *  "graph" - a knowledge instance that conforms to the SPDXv3 knowledge 
model, used within an environment

  *  "document" - a unit of interchange between environments.

A collection of documents must be able to losslessly communicate an entire 
graph from one environment to another. Wolfram Alpha is an example of an 
environment.  Not all of its knowledge must have existed in documents in the 
past, but in the future it must be possible to clone the entire knowledge graph 
from one environment into another environment using one or more documents.  Is 
this a correct statement (in principle, as a design goal)?

 

I agree with your bullets 2, 3, and partially 4.

Taking your last bullet first:

*       I don’t believe we ever fully decided on how we would explicitly denote 
which ids are using such a shorthand so a deserializer would know when to 
construct the full id

The UCO / STIX pattern does not allow that to be solved.  Making the 
non-namespaced id part of the URI path makes it impossible to tell where the 
namespace ends.  I suppose a heuristic could say that the last path component 
MUST BE the non-namespaced id and that the latter MUST NOT contain the path 
separator "/", but that is unsatisfying.  It would be more standard to treat 
the full URI path as the namespace and define id as a fragment within that 
namespace.  I'll use the <namespace>#<id> convention below; substitute "/" if 
you wish.


I would say:

*       the non-namespace portion of the identifier must uniquely identify an 
element within the namespace.  (The namespace is globally unique and the 
authority for a namespace is fully authoritative for the IDs within that 
namespace.)  At the strongest I'd say an authority MAY choose to use UUID v4 or 
v5 for it's IDs, but there is no reason to prohibit other big IDs (such as 
vehicle VINs) or small IDs (such as one-up serial numbers). The following would 
all be valid full IDs if so designated by the namespace authority:

*       <namespace>#File-154c8aa4-98ba-4642-b84e-fe8e283299bc
*       <namespace>#1394
*       <namespace>#File-1394
*       <namespace>#1394-File
*       <namespace>#1394-BEERWARE-4.2
*       <namespace>#License-BEERWARE-4.2

These are all opaque IDs, meaning that the only property that can be counted on 
is that namespace is separated from non-namespace ID by "#".  I'm suggesting 
that we consider adding a third defined level:

*       "<namespace>#<uid>/<label>" where uid is guaranteed to be unique within 
the namespace and any "/<label>" MUST be ignored when matching Element IDs.

<uid>/<label> is a syntactically-valid RFC 3986 fragment, so the entire ID 
including namespace, uid, and label is a valid URI.

All of the examples above would be valid in the 3-part ID, in addition to the 
following:

*       <namespace>#154c8aa4-98ba-4642-b84e-fe8e283299bc/File
*       <namespace>#154c8aa4-98ba-4642-b84e-fe8e283299bc/File-ld.so.4
*       <namespace>#1394/File-ld.so.4
*       <namespace>#1394/License-BEERWARE-4.2

In his email William stated that IDs should be opaque, which means that 
processors MUST NOT examine the content of  the ID and make decisions based on 
its content.  Element type (File, License, Identity, etc) must exist in a 
property other than ID in order to be visible to applications.

 

I agree with your serialization bullets, except:

*       Documents must specify the content of all elements that they define 
(inline) in the element property.  I don't know what it would mean for an 
"element" property value to be an ID reference.  References are used in 
properties other than element; element contains only the document's inline 
definitions.
*       I don't disagree that Elements specified in a document MUST use only 
the non-namespace portion of the ID.  But it seems potentially useful for a 
Document to contain a copy of element definitions from another namespace, as a 
convenience.

If the two or three part ID structure is defined by SPDXv3 as normative, 
serializers would use "#" and "/" as the separators between <namespace>, <uid>, 
and <label>.

Dave

 

 

On Tue, Jul 20, 2021 at 11:58 AM Sean Barnum <[email protected] 
<mailto:[email protected]> > wrote:

All,

 

Trying to get caught up from my absence and looking forward to get back in the 
game.

 

I am concerned by some of the id discussions I am seeing here as it looks like 
they may not be based on the consensus and results of the very extensive 
3T-SBOM conversations that occurred regarding scoping of Elements and IDs.

 

Here is a very quick stab at outlining the results of our previous discussions:

*       Elements can be defined and referenced independent of the scope/context 
of any Document.
*       At the model level, Element ids uniquely identify an Element within the 
universe across all contexts and therefore MUST be globally unique, not simply 
unique within a given Document.
*       There is significant value in full Element ids being IRIs
*       Element ids should consist of a namespace (globally unique to some 
specific authority (e.g., a single producer/definer)) combined with an 
identifier that is globally unique within the id namespace.

*       The namespace could be a general namespace for a given 
producer/definer, could be tied to a specific Document, or could be something 
else
*       The non-namespace portion of the identifier would ideally contain a UUID

*       The UUID portion could be random (v4) or could be deterministic (v5)

*       I do not recall any discussion of intent that the identifier should be 
opaque and I do not see the value in having it so. I would concur with William 
that there is value in the value of readable ids.
*       In UCO, STIX and other efforts I have been part of we have found the 
following structural pattern to be most effective: “<namespace>/<Element 
type>-<UUID>”

*       In serialization:

*       Elements can be expressed as a simple flat set
*       Any Element expressed outside of a Document MUST use it full Element ID
*       Documents, in their “element” property can either reference relevant 
Elements via their Element ID or can specify/express Elements inline
*       Any references inside the Document to Elements defined outside the 
Document MUST have an entry in the ExternalMap for the Document

*       All ExternalMap entries MUST use full Element Ids

*       Any Elements specified/expressed inline within the Document MAY use 
only the non-namespace portion of its Element ID as its id

*       When such Elements are deserialized their full Element ID is 
constructed by combining the Document namespace with the the non-namespace 
portion of its Element ID being used as shorthand within the serialization
*       I don’t believe we ever fully decided on how we would explicitly denote 
which ids are using such a shorthand so a deserializer would know when to 
construct the full id

 

I would consider these characteristics of Elements and Ids to be among the most 
important considerations for the practical feasibility of our efforts.

 

I am sure we can discuss further but wanted to make sure to get this concern 
out there.

 

sean

 





-=-=-=-=-=-=-=-=-=-=-=-
Links: You receive all messages sent to this group.
View/Reply Online (#4106): https://lists.spdx.org/g/Spdx-tech/message/4106
Mute This Topic: https://lists.spdx.org/mt/84335757/21656
Group Owner: [email protected]
Unsubscribe: https://lists.spdx.org/g/Spdx-tech/unsub [[email protected]]
-=-=-=-=-=-=-=-=-=-=-=-


Reply via email to