Hi David,
Thanks for the analysis and reply – very good question and points. Responses inline below. Gary From: [email protected] <[email protected]> On Behalf Of David Kemp Sent: Monday, December 13, 2021 9:24 AM To: [email protected] Cc: SPDX-list <[email protected]> Subject: Re: [spdx-tech] SPDX Java tools update related to hasFiles and Contains property Gary, I agree that "Package with a File listed in the hasFiles property is semantically the same as Package has a CONTAINS relationship with File." And that "model store" is the set of deserialized data (in SPDXv3 the model store is the set of deserialized Elements.) Question: In Java Tools, is the lifetime of the model store greater than what is necessary to process a single SPDX document? In other words, is the workflow 1. Create empty model store 2. Deserialize exactly one SPDX document 3. Perform operations on model store 4. Serialize SPDX document(s) 5. Delete model store [G.O.] The design of the Java tools allows for the model store to persist beyond the serialization/deserializations of the documents but does not require it. For example, you could implement a model store backed by a relational database which is used to store multiple documents. If the lifetime of the model store is one SPDX file, then the semantics are identical. But if the object store applies to two or more SPDX files, a difference could creep in: * SPDX document 1: Package A hasFiles X and Y * SPDX document 2: Relationship CONTAINS from Package A to Files X, Y, and Z Documents 1 and 2 have different creation info, and Document 3 serialized from the object store would have a third different creation info, so there is no hard inconsistency. But in an environment where software OEMs create SPDX documents to describe their packages, the goal is to be able to re-use the OEM documents. [G.O.] In the store interface, the SPDX Document URI is used to always distinguish unique elements which I think solves this issue. In the above example, Files X and Y would be external document references from document 2 to document 1 and Z would be defined locally to document 2. There would be no modifications to these documents while serializing Document 3. A hasFiles property on the Package, once created by the OEM, implies that the list of files will remain static for that version of the package, while a CONTAINS relationship implies that later documents can modify the list of files in a given package. [G.O.] A very interesting point. Using the hasFiles property would require someone amending the Package definition to include another file to create a new unique Package with the superset hasFiles property. The amended Package should have an AMENDS relationships to indicate the change. It would be much simpler to have set of CONTAINS relationships where someone could just add an additional Relationship. In both cases, however, we would have the Document and Creation information to clearly separate which is the original and which was changed. BTW – The Document containing the amended Package should also have an Amends relationship to the original Document. Thinking through this, I think both approach could reliably handle the situation, but using Relationships is simpler. Tooling can of course enforce rules to detect and possibly disallow attempts to change a package's contents, but I think the error detection would be more obvious and intuitive if Java Tools followed Alternative "B: Translate all CONTAINS relationships to a hasFiles property in the model store when deserializing." So I vote for B. [G.O.] Going with B would also solve the issue Simon raised with the inconsistency between RDF and JSON/YAML/XML. On a more general note, I think you pointed out a more general issue. If we make ANY changes to the properties or relationships, do we need to update the URI for the changed objects? Do we also need to update the creation information? The proposal makes changes to either the hasFile properties, the CONTAINS relationships, or both. The RDF deserializer current “upgrades” older versions of the spec on deserialization – so it already has this issue – I added this issue <https://github.com/spdx/spdx-java-rdf-store/issues/28> to track. Dave On Mon, Dec 13, 2021 at 1:50 AM Simon Avery via lists.spdx.org <http://lists.spdx.org> <[email protected] <mailto:[email protected]> > wrote: Gary, A few comments: I agree with your assertion that a CONTAINS (or a CONTAINED_BY) relationship is semantically identical to a RDF/JSON/YAML/XML package that has the hasFile(s) property or a tag/value file where the files follow the package. Option C looks good but I’m confused as to why you are excluding RDF. That’s the only format where hasFile is defined in the spec. If you're going to always use hasFile for JSON/YAML/XML then why not do the same for RDF? Converting both ways of expressing the package/files relationship to be stored in a single consistent manner when serialized and deserialized is definitely A Good Thing to prevent duplication. I think that you should have the check for duplication happen for all file types, not just Tag/Value. The same issue exists for JSON files that could have both the documentDescribes property set and/or a relationship stating which packages are DESCRIBED_BY or the document DESCRIBES certain packages. How are your tools handling that situation? Simon Avery On Fri, Dec 10, 2021 at 1:52 PM Steve Winslow <[email protected] <mailto:[email protected]> > wrote: Hi Gary, a couple of quick initial reactions / thoughts: For the Golang tools, I believe it currently handles things similarly to the way you described the Java tools. The in-memory representation of a Package has a <https://github.com/spdx/tools-golang/blob/9813e3e9ab9528c405c798c153e2da336b37cec9/spdx/package.go#L251> "Files" property, which maps SPDX IDs to File objects. A File has a similar <https://github.com/spdx/tools-golang/blob/9813e3e9ab9528c405c798c153e2da336b37cec9/spdx/file.go#L156> "Snippets" property. When parsing a tag-value SPDX 2.2 or 2.1 document, Files and Snippets are added into those maps based on positioning in the document, as described in 5.2.3 <https://spdx.github.io/spdx-spec/composition-of-an-SPDX-document/#523-file-information-section> for the tag-value format. There is no assumed equivalence of CONTAINS Relationships; none are auto-generated, and such Relationships are only created if the document explicitly includes them. I'm not saying this is the "correct" approach -- just describing how the Golang tools work today. One other thing I'd highlight is the fact that CONTAINS Relationships can reference SPDX elements from separate SPDX documents. So, at least syntactically, I think you can have a situation like the following: * Document A: * defines a File SPDXRef-FileA1 * Tag-value Document B: * defines a Package SPDXRef-PackageB * defines Files -B1 and -B2 immediately afterward (so PackageB implicitly CONTAINS them) * also states that PackageB CONTAINS FileA1 from Document A (so PackageB explicitly CONTAINS them, but from a different document) I don't know why you _would_ do this, but I think you _could_ and it would be syntactically valid. The reason I'm mentioning this is just that, at least for tag-value documents, we might think of the "list of Files following a Package" as being equivalent to "the Package CONTAINS the Files". But it's possible to contrive an example where some Files can be expressed as the latter but not as the former. This might be a rabbit hole that isn't helpful or applicable, so feel free to disregard if so :) Steve On Fri, Dec 10, 2021 at 4:20 PM Gary O'Neall <[email protected] <mailto:[email protected]> > wrote: I would like to get some feedback from the community on some changes I’m making to the SPDX Java tools related to the hasFiles property in JSON and the CONTAINS relationship. If you’re a user of the SPDX Java tools, please review the following since it may introduce an incompatibility with prior versions. If you’re an implementer of tools that read or write SPDX, you may also want to review this and let us know if you agree with the approach. If you’re working on the SPDX 3.0 spec, you may find this issue relevant to some upcoming topics related to serialization/deserialization. I’d like to get feedback over the next week or two before I update the tools with the changes. Problem statement: The SPDX Java tools are currently representing the relationships between the Package and the files contained in the Package in two possibly inconsistent ways – using a hasFile property and using the CONTAINS relationship between the Package and the File. This could lead to inconsistent results depending on how the SPDX file was serialized. Current state of the SPDX Spec: * The relationship CONTAINS is documented and can be used to describe a package CONTAINing a file in all supported serialization formats * Section 5.2.3 <https://spdx.github.io/spdx-spec/composition-of-an-SPDX-document/#523-file-information-section> describes how the position of file and package declarations are used to denote which files belong to which package * Section 5.2.3 <https://spdx.github.io/spdx-spec/composition-of-an-SPDX-document/#523-file-information-section> states “When implementing file information in RDF, the spdx:hasFile property is used to associate the package with the file.” * The RDF OWL property hasFile is defined as “Indicates that a particular file belongs to a package.” * The RDF OWL documentation for the CONTAINS relationship includes the comment “A Relationship of relationshipType_contains expresses that an SPDXElement contains the relatedSPDXElement. For example, a Package contains a File. (relationshipType_contains introduced in SPDX 2.0 deprecates property 'hasFile' from SPDX 1.2)” * Note that comment in parenthesis is inconsistent with the hasFile documentation in the OWL document (it is not deprecated) and also inconsistent with section 5.2.32 * The JSON schema defines a hasFiles property in the JSON Schema file with the same definition as RDF Current state of the Tools-Java version 1.0.3: * The Model object SpdxPackage has a property “files” which is a collection based on a hasFile property in the underlying object store. * When deserialized, Tag/Value, JSON, YAML, XML, and Spreadsheets, will store any files contained by a package as a hasFile property in the underlying store and not as a CONTAINS relationship * If a package has a stated CONTAINS relationship between a package and a file, it will be stored as a relationship (possibly duplicating information in hasFile) I would assert that a Package with a File listed in the hasFiles property is semantically the same as Package has a CONTAINS relationship with File. This leads to the inconsistency described in the problem statement. There are 3 alternatives I’ve looked at to resolve the inconsistency: A. Leave the tools as is and live with the inconsistency. B. Translate all CONTAINS relationships to a hasFiles property in the model store when deserializing. C. Translate all hasFiles properties into CONTAINS relationships when deserializing and translating back to the hasFiles property in the JSON/YAML/XML formats (not in the Tag/Value or RDF formats) I’ve taken approach C in a large part due to the SPDX 3.0 discussions where we plan to allow more compact serializations and convert to Relationships when deserializing. If nothing else, this would be a good experiment to see how this approach works in practice. Approach C has the following implications on the Java-Tools: * Runtime model: * In the runtime model, any addition to the files collection for a package will automatically create a CONTAINS relationship between the package and the file * In the runtime model, and modification to the CONTAINS relationships between a package and file will be reflected in the files collection * There is no longer any possibility of duplication or inconsistencies between the CONTAINS relationship and the files collection for a package. * Tag/Value: * When deserializing, a CONTAINS relationship between the package and the file will be created based on the position of the files and packages per the spec * A check will be made to make sure we don’t add any duplicate CONTAINS relationships * Files serialized will aways include the CONTAINS relationships in addition to maintaining the proper relative positions of the packages and files * Note: I could remove these relationships in the serialization since they are redundant with the position, however, I personally think the resultant tag/value is clearing having the additional relationships. Feedback is welcome on this point. * JSON/XML/YAML: * When deserializing, a CONTAINS relationship between the package and the file will be created for every element of the hasFiles list. * Files serialized will always use the hasFiles property for any CONTAINS relationship and not include the CONTAINS relationships. * RDF/XML: * When deserializing, a CONTAINS relationship between the package and the file will be created for every <Package,hasFile,File> triple * When serializing, the CONTAINS relationships will be serialized. * Note: I’m quite interested in feedback if this translation to a Relationship makes it harder for semantic reasoners or other implementations using RDF Thanks for reading through all this! Let me know any concerns, thoughts, questions. Gary ------------------------------------------------- Gary O'Neall Principal Consultant Source Auditor Inc. Mobile: 408.805.0586 Email: [email protected] <mailto:[email protected]> CONFIDENTIALITY NOTE: The information transmitted, including attachments, is intended only for the person(s) or entity to which it is addressed and may contain confidential and/or privileged material. Any review, re-transmission, dissemination or other use of, or taking of any action in reliance upon this information by persons or entities other than the intended recipient is prohibited. If you received this in error, please contact the sender and destroy any copies of this information. -=-=-=-=-=-=-=-=-=-=-=- Links: You receive all messages sent to this group. View/Reply Online (#4290): https://lists.spdx.org/g/Spdx-tech/message/4290 Mute This Topic: https://lists.spdx.org/mt/87646486/21656 Group Owner: [email protected] Unsubscribe: https://lists.spdx.org/g/Spdx-tech/unsub [[email protected]] -=-=-=-=-=-=-=-=-=-=-=-
