I would suggest a starting point for the ID format to be the current SPDX 2.2 which uses the following:
<namespace>#SPDXRef-[idstring] for elements, or <namespace>#LicenseRef-[idstring] for extractedLicenseInformation [idstring] is a unique (to the document) string containing letters, numbers, . and/or -. The only requirement for namespace is that it is unique and follows a URI format without “#”. Reasoning: * Compatible with previous versions of SPDX * Meets most of the criteria Sean mentioned below * Lossless mapping of URI based ID to short IDs for various non-URI based serialization formats (e.g. YAML, Tag/Value) * Prefixing with SPDXRef and LicenseRef makes it easy to create non-conflicting ID’s for objects outside the SPDX defined SBOM format (e.g., I would like to extend a document to include something not defined in SPDX but using the same document namespace) As I write this, I realize this violates the opaqueness principle which I agreed to previously. I would still keep the [idstring] opaque, but require the prefixes of SPDXRef- and LicenseRef- for the reasons listed above. I believe we can implement an SPDX 2.2 compatible ID naming convention and meet Sean’s points below. If this is not the case, we should discuss further to make sure we don’t create something which isn’t usable for all serialization formats. Gary From: [email protected] <[email protected]> On Behalf Of David Kemp Sent: Tuesday, July 20, 2021 3:40 PM To: Sean Barnum <[email protected]> Cc: [email protected]; SPDX-list <[email protected]> Subject: Re: [EXT] Re: [spdx-tech] SPDX IDs - internal or meaningful? Sean, Would it be helpful to discuss the distinction between: * "graph" - a knowledge instance that conforms to the SPDXv3 knowledge model, used within an environment * "document" - a unit of interchange between environments. A collection of documents must be able to losslessly communicate an entire graph from one environment to another. Wolfram Alpha is an example of an environment. Not all of its knowledge must have existed in documents in the past, but in the future it must be possible to clone the entire knowledge graph from one environment into another environment using one or more documents. Is this a correct statement (in principle, as a design goal)? I agree with your bullets 2, 3, and partially 4. Taking your last bullet first: * I don’t believe we ever fully decided on how we would explicitly denote which ids are using such a shorthand so a deserializer would know when to construct the full id The UCO / STIX pattern does not allow that to be solved. Making the non-namespaced id part of the URI path makes it impossible to tell where the namespace ends. I suppose a heuristic could say that the last path component MUST BE the non-namespaced id and that the latter MUST NOT contain the path separator "/", but that is unsatisfying. It would be more standard to treat the full URI path as the namespace and define id as a fragment within that namespace. I'll use the <namespace>#<id> convention below; substitute "/" if you wish. I would say: * the non-namespace portion of the identifier must uniquely identify an element within the namespace. (The namespace is globally unique and the authority for a namespace is fully authoritative for the IDs within that namespace.) At the strongest I'd say an authority MAY choose to use UUID v4 or v5 for it's IDs, but there is no reason to prohibit other big IDs (such as vehicle VINs) or small IDs (such as one-up serial numbers). The following would all be valid full IDs if so designated by the namespace authority: * <namespace>#File-154c8aa4-98ba-4642-b84e-fe8e283299bc * <namespace>#1394 * <namespace>#File-1394 * <namespace>#1394-File * <namespace>#1394-BEERWARE-4.2 * <namespace>#License-BEERWARE-4.2 These are all opaque IDs, meaning that the only property that can be counted on is that namespace is separated from non-namespace ID by "#". I'm suggesting that we consider adding a third defined level: * "<namespace>#<uid>/<label>" where uid is guaranteed to be unique within the namespace and any "/<label>" MUST be ignored when matching Element IDs. <uid>/<label> is a syntactically-valid RFC 3986 fragment, so the entire ID including namespace, uid, and label is a valid URI. All of the examples above would be valid in the 3-part ID, in addition to the following: * <namespace>#154c8aa4-98ba-4642-b84e-fe8e283299bc/File * <namespace>#154c8aa4-98ba-4642-b84e-fe8e283299bc/File-ld.so.4 * <namespace>#1394/File-ld.so.4 * <namespace>#1394/License-BEERWARE-4.2 In his email William stated that IDs should be opaque, which means that processors MUST NOT examine the content of the ID and make decisions based on its content. Element type (File, License, Identity, etc) must exist in a property other than ID in order to be visible to applications. I agree with your serialization bullets, except: * Documents must specify the content of all elements that they define (inline) in the element property. I don't know what it would mean for an "element" property value to be an ID reference. References are used in properties other than element; element contains only the document's inline definitions. * I don't disagree that Elements specified in a document MUST use only the non-namespace portion of the ID. But it seems potentially useful for a Document to contain a copy of element definitions from another namespace, as a convenience. If the two or three part ID structure is defined by SPDXv3 as normative, serializers would use "#" and "/" as the separators between <namespace>, <uid>, and <label>. Dave On Tue, Jul 20, 2021 at 11:58 AM Sean Barnum <[email protected] <mailto:[email protected]> > wrote: All, Trying to get caught up from my absence and looking forward to get back in the game. I am concerned by some of the id discussions I am seeing here as it looks like they may not be based on the consensus and results of the very extensive 3T-SBOM conversations that occurred regarding scoping of Elements and IDs. Here is a very quick stab at outlining the results of our previous discussions: * Elements can be defined and referenced independent of the scope/context of any Document. * At the model level, Element ids uniquely identify an Element within the universe across all contexts and therefore MUST be globally unique, not simply unique within a given Document. * There is significant value in full Element ids being IRIs * Element ids should consist of a namespace (globally unique to some specific authority (e.g., a single producer/definer)) combined with an identifier that is globally unique within the id namespace. * The namespace could be a general namespace for a given producer/definer, could be tied to a specific Document, or could be something else * The non-namespace portion of the identifier would ideally contain a UUID * The UUID portion could be random (v4) or could be deterministic (v5) * I do not recall any discussion of intent that the identifier should be opaque and I do not see the value in having it so. I would concur with William that there is value in the value of readable ids. * In UCO, STIX and other efforts I have been part of we have found the following structural pattern to be most effective: “<namespace>/<Element type>-<UUID>” * In serialization: * Elements can be expressed as a simple flat set * Any Element expressed outside of a Document MUST use it full Element ID * Documents, in their “element” property can either reference relevant Elements via their Element ID or can specify/express Elements inline * Any references inside the Document to Elements defined outside the Document MUST have an entry in the ExternalMap for the Document * All ExternalMap entries MUST use full Element Ids * Any Elements specified/expressed inline within the Document MAY use only the non-namespace portion of its Element ID as its id * When such Elements are deserialized their full Element ID is constructed by combining the Document namespace with the the non-namespace portion of its Element ID being used as shorthand within the serialization * I don’t believe we ever fully decided on how we would explicitly denote which ids are using such a shorthand so a deserializer would know when to construct the full id I would consider these characteristics of Elements and Ids to be among the most important considerations for the practical feasibility of our efforts. I am sure we can discuss further but wanted to make sure to get this concern out there. sean -=-=-=-=-=-=-=-=-=-=-=- Links: You receive all messages sent to this group. View/Reply Online (#4106): https://lists.spdx.org/g/Spdx-tech/message/4106 Mute This Topic: https://lists.spdx.org/mt/84335757/21656 Group Owner: [email protected] Unsubscribe: https://lists.spdx.org/g/Spdx-tech/unsub [[email protected]] -=-=-=-=-=-=-=-=-=-=-=-
