Re: [EXT] [spdx-tech] Element IDs

Gary O'Neall Wed, 04 Aug 2021 12:24:43 -0700

My concern is how to translate the ID’s back to and from non-linked-data 
serialization formats.


 

The easiest approach would be to just include the entire ID String in formats 
like tag/value.

 

This would end up with something like:

 

               …

               SPDXID: 
http://spdx.org/spdxdocs/spdx-example-444504E0-4F89-41D3-9A0C-0305E82C3301#SPDXRef-File

               …

 

Rather than the current format of:

 

               DocumentNamespace: 
http://spdx.org/spdxdocs/spdx-example-444504E0-4F89-41D3-9A0C-0305E82C3301

               …

               SPDXID: SPDXRef-File

               …

 

Similar issues for the Spreadsheet and YAML formats.  We also have a 
non-linked-data JSON format which would also have the same ID issues.

 

If the above change is acceptable to those using the non-linked-data 
serialization formats, I would definitely go with the simpler approach.

 

If we want the ID’s to be short, however, we’ll need to introduce something 
like namespaces and have pre-defined rules to make it possible to reliable to 
translate between the linked-data formats (which will always use URI’s) and the 
non-linked-data formats.

 

Here are the rules for SPDX-2.0:

*       The full URI is formed by concatenating the documentNamespace + ‘#’ + 
SPDXID in non-linked-data formats
*       Linked Data formats must include a default namespace in their 
serialization – this is the same namespace used as the documentNamespace 
property used in the non-linked-data format appended by ‘#’
*       SPDX ID’s are restricted to the format SPDXRef-[idString] where 
idString is a unique string containing letters, numbers, ., and/or -.
*       Any ID’s not defined within the SPDX document use the format 
DocumentRef-[idString]:SpdxRef-[idString] for non-linked-data formats and uses 
the external map to form the full URI

 

Note – there are similar rules for LicenseRef’s.

 

Sean raised a valid issue regarding the required use of ‘#’.

 

I have a proposed solution below:

 

In thinking about this, since we have the documentNamespace and XMLNS 
properties (for RDF/XML), we could relax this requirement and allow any valid 
URI namespace prefix.  This creates a minor incompatibility since we would need 
to append a ‘#’ to the documentNamespace property for any pre-3.0 SPDX 
documents.

 

I would still suggest restricting the characters available for SPDXRef’s to 
make it possible to parse the ID’s in the non-linked-data formats.  We could, 
however, extend some of the characters (e.g. add “/” as an allowed character).  
As per previous discussions, we could also remove the requirements for the 
SPDXRef- prefix.

 

This would solve some of the issues raised previously yet still allow support 
for both linked-data and non-linked data.

 

Here’s a proposed set of rules for 3.0:

*       The full URI is formed by concatenating the documentNamespace + SPDXID 
in non-linked-data formats.  The documentNamespace property would be optional.  
If the documentNamespace not included, the SPDXID must be the full URI.
*       Linked Data formats may include a default namespace in their 
serialization – this is the same namespace used as the documentNamespace 
property used in the non-linked-data format
*       SPDX ID’s are restricted to be a unique (within the document) string 
containing only letters, numbers, ., /, and/or -.
*       Any ID’s not defined within the SPDX document use the format 
DocumentRef-[idString]:[idString] for non-linked-data formats (NOTE: the ‘:’ 
must not be an allowed character in the idString)

 

I would further proposed some recommended practices:

*       Namespaces are used and must be unique
*       SPDX ID’s have a format the conveys information about the type (per 
previous conversations)
*       Namespaces not include ‘#’ to make the URI’s more HTTP addressable (per 
Sean’s concern)

 

Variations on a theme:

*       We could introduce a separator character for the namespace that would 
be appended to the documentNamespace.  This would relax the requirement for an 
XMLNS property in the RDF serializations since we could then parse – although 
I’m not sure how reliable the parsing would be.
*       Require a namespace – this would make the tag/value more readable and 
the expense of flexibility

 

Let me know if this sounds reasonable.

 

Gary

 

 

 

From: [email protected] <[email protected]> On Behalf Of Alexios 
Zavras
Sent: Wednesday, August 4, 2021 10:02 AM
To: David Kemp <[email protected]>; Sean Barnum <[email protected]>
Cc: SPDX-list <[email protected]>
Subject: Re: [EXT] [spdx-tech] Element IDs

 

I am actually very conflicted about this…

On one hand splitting an idString into a two things (namespace and 
in-namespace-id, if you excuse the awful wording) sounds natural and appeals to 
my design aesthetics, especially since we all agree that an id will use such a 
combination.

On the other hand, the worry expressed by Sean about complexity in the model is 
real and I am not sure that introducing such complexity is justifiable.

 

In case it wasn’t clearly understood: if we split, and an Element, instead of a 
single property “id”, has two properties (say “namespace” and “inNsId”) we lose 
the easy way of referencing to an element. That means that *every* other class 
that points to an Element (or anything else, really), will have to specify both 
properties to refer to something.

As an example, we will not have a Relationship of type CONTAINS from Element-1 
to Element-2; we should have a Relationship of type CONTAINS from-namespace Ns1 
and from-innsid I1 to-namespace Ns2 and to-nssid i2.

Or a Document will have a rootElement-namespace and rootElement-innsid in order 
to point to something…

 

… which of course also contradicts the most basic principle of linked data: 
each thing should have a single URI that can be addressed with.

 

Judging pros and cons, I therefore vote for a single property “id”.

 

It will be a string, a URI, and it’s up to us to define how we join the 
namespace and in-namespace-id parts. We’ve already said ‘#’ and ‘/’ are not 
suitable.

A very simple encoding (a single “-“ or “_”, for example) may not be 
sufficient, because we want to be able to also do the reverse transformation: 
from a single string understand the namespace and in-namespace-id parts.

Back in the old days we were using non-printable characters to separate 
strings; “Unit Separator” (dec 31, hex 1F) is still in ASCII tables; we can use 
the three printable characters “%1F”. Or go for the section sign § and use 
“%A7”. Or a sequence of one or more “~” signs. Or…

 

 

-- zvr

 

From: [email protected] <mailto:[email protected]>  
<[email protected] <mailto:[email protected]> > On Behalf Of 
David Kemp
Sent: Wednesday, 4 August, 2021 05:28
To: Sean Barnum <[email protected] <mailto:[email protected]> >
Cc: SPDX-list <[email protected] <mailto:[email protected]> >
Subject: Re: [EXT] [spdx-tech] Element IDs

 

My assertion on the call is that any presumption of “globally unique” based 
soley on the probability space of possible values is a poor general approach 
because it does not explicitly take into account the instantial value space 
where the number of objects may be very large and increase the probability of 
collisions. It does not deterministically prevent collisions. While extremely 
unlikely, it is possible to have a conflict with only two objects.

 

You should speak with a cryptographer.  For a 256 bit hash value, the chance of 
birthday collision is 1 / 2^128, or 1 / 3.4*10^38.  That's 10 with 38 zeros.  
For comparison, the chance of winning the Powerball lottery jackpot is 1 in 292 
million, or 1 / 3*10^8, so the chance of collision is about the same as the 
chance of winning a powerball jackpot 1,000,000,000,000,000,000,000,000,000,000 
times in a row. The age of the universe is 436,117,076,640,000,000 seconds, so 
you'd have to be running those lotteries at 1,000,000,000,000 times a second 
for the whole age of the universe before getting a 50% chance of a collision.

Compare that to the reliability of trying to deconflict namespaces using a 
global registration system.  "Extremely unlikely" is easy to say, but it 
doesn't come close to doing the mathematics justice. The chance of collision 
due to an error in a global managed system is infinitely greater (yes, that's 
hyperbole) than in a cryptographic system.

 

So, to support use cases such as linked data we need namespaces to be URIs 
themselves.

 

Yes, that goes without saying, just as UUIDs are included in URIs.

I am avoiding using “local id” as it may imply that that portion of the 
identifier is only local to that namespace 


That's OK.  The identifier is local to the namespace, and since it can be 
anything within a namespace, nothing prevents many namespaces from using the 
same id.  The full "namespace:id" is different, meaning the Elements are 
different regardless of whether the ids are the same.  I think "component" is 
misleading because it implies that several namespaces using the same id are 
using it to refer to the same "thing"/component, which clearly is not required. 
 The id has no semantics, it is opaque, but I'm not going to quibble over what 
to call it.

 

I think it is important that we realize that the identifier (idString) is a 
valid URI that is composed of the namespace and the component id. It is not 
adequate to split these properties and store them in separate properties.

 

On the contrary, it is essential to recognize that the model represents 
semantics.  By using the words namespace and id we are assigning meaning within 
the compound identifier.  Pretending that that meaning doesn't exist, wishing 
it weren't real, and modeling an Element identifier as a lump without two 
components is the root cause of the discussion going around in circles for 
months.  Sebastian observed that the "#" character (or whatever other character 
we use) is not part of the semantics at all, it is part of the syntax.  Taking 
a namespace and an id and forming a single Element identifier (and putting that 
identifier in URI format) is by definition syntax, whenever and wherever it is 
done in any kind of application.  The Element identifier always has a namespace 
and an id, that's its semantic meaning across all applications, period.

 

Each serialization of Elements MUST maintain integrity and consistency of the 
fully composed identifier string during serialization and deserialization.

 

I fully agree.  You say that Elements don't have to be associated with any 
document.  If those Elements have integrity and consistency of their 
identifiers, and they aren't associated with a document, then wherever they 
come from they must have a proper namespace and id.   Gary suggested Elements 
could come from single-element documents that provided the namespace for that 
one Element; that's certainly possible.  I haven't seen your non-Document 
serialization, but I agree that it too must provide a proper compound 
identifier for its Element.

Dave

Intel Deutschland GmbH
Registered Address: Am Campeon 10, 85579 Neubiberg, Germany
Tel: +49 89 99 8853-0, www.intel.de <http://www.intel.de> 
Managing Directors: Christin Eisenschmid, Sharon Heck, Tiffany Doon Silva   
Chairperson of the Supervisory Board: Nicole Lau
Registered Office: Munich
Commercial Register: Amtsgericht Muenchen HRB 186928





-=-=-=-=-=-=-=-=-=-=-=-
Links: You receive all messages sent to this group.
View/Reply Online (#4152): https://lists.spdx.org/g/Spdx-tech/message/4152
Mute This Topic: https://lists.spdx.org/mt/84649986/21656
Group Owner: [email protected]
Unsubscribe: https://lists.spdx.org/g/Spdx-tech/unsub [[email protected]]
-=-=-=-=-=-=-=-=-=-=-=-

Re: [EXT] [spdx-tech] Element IDs

Reply via email to