Thanks, Gary, for the helpful commentary! Looks like I will be making several changes based on your suggestions.
I have indeed seen the class diagram; I found it useful. But again, not a one-to-one correspondence to how a relational DB would look (classic problem, I know...) hence my questions. I did look briefly at triplestores. I'll admit, I put off looking into them in detail since I'm very familiar with relational DBs and very unfamiliar with the technology surrounding RDF :) Certainly I'll have to get familiar with it, if only to properly support SPDX document generation in RDF format. For the record: the link I provided below was broken for a time as I was moving things around; it has since been corrected. Not to mention, it actually works with a specific DBMS now. (I went with Postgres specifically for the CHECK constraint support, and overall unsurprising behavior compared to MySQL :) I have also added (+ am adding) some additional documentation on some of the quirks of this schema, in case anyone finds it useful. Tom On Wed, May 27, 2015 at 09:07:48PM -0700, Gary O'Neall wrote: > Hi Tom, > > I agree it would be a challenge to store the SPDX data in a relational DB. > The spec was designed in an object oriented fashion and it can be a > challenge to map objects to relations (or at least I find it to be a > challenge). > > For me, it is easier to understand the spec with a visual. If you haven't > already, take a look at the class diagram: > http://wiki.spdx.org/view/Technical_Team/Model_2_0 > > Some responses inline below. > > > -----Original Message----- > > From: [email protected] [mailto:spdx-tech- > > [email protected]] On Behalf Of Thomas T Gurney > > Sent: Friday, May 22, 2015 6:33 PM > > To: [email protected] > > Subject: SPDX 2.0 database schema > > > > Hey all, > > > > Tom Gurney here, undergrad student from the open source research lab at > > University of Nebraska Omaha. > > > > I have been digging into SPDX 2.0 since its official release. In trying > > to build a relational database that will store SPDX 2.0 documents, I've > > realized it's a lot tougher to store 2.0 data in a relational form than > > 1.2 data. A _lot_ tougher. (At least, from my limited perspective, it > > is.) > > > > Here's my attempt at a schema (beware, I threw it together in an > > evening): > > https://github.com/ttgurney/spdx2.0-schema/blob/master/spdx2_schema.sql > > It's like SQL pseudocode in that no actual DBMS will accept it, but it > > should make sense. > > > > So here's what's thrown me for a loop, and resulted in some odd design > > choices: > > > > - SPDX identifiers that can be associated with a file, document or > > package, but > > must be unique within a document > [Gary] > [Gary] Correct > > - An SPDX document can describe files that are not part of any package > > (and > > it can contain multiple packages too? Not sure I'm reading the spec > > right) > [Gary] Correct > > > - Relationships between identifiers > [Gary] I think of it as relationships between SpdxElements which have > identifiers as a property, but having the relationship between ID's makes > sense to me for a relational DB. > > You can have external references to identifiers as well - they are made > unique by the use of the SPDX Document Namespace, so including the document > namespace or the document ID in the relationships table for the left and > right relationships would allow the database to properly map external > references and hold multiple SPDX documents. > > > - License expression syntax (I don't see a way to sensibly accomodate > > this > > in a relational DB) > [Gary] It wasn't easy to write in Java ;) > You could implement them as sets and operators (similar to the object > model), but it would be rather complex > > - Multiple checksum types supported (I stuck to just SHA1 for the above > > schema) > [Gary] If you want it highly normalized, you could create a separate table > which checksums and have a reference (foreign key) to the checksum table > from the file. The checksum table would have a value and algorithm columns > > - What can we say about a file from its checksum? If two files have the > > same > > checksum, can we say that they are the same file in every aspect, and > > thereby > > carry with them all the same SPDX metadata, regardless of what > > package each > > is in? I'm not sure. > [Gary] This has been debated and there are different opinions on this. As > far as the spec goes, we include the file name along with the checksum when > calculating the validation. My personal view is that the checksum states > the content is extremely likely to be the same (depending on the checksum > algorithm, I may even say the content is the same), but the placement of the > file itself may be relevant to how it is used and may impact the metadata. > > > > Has anyone run into similar difficulties? Ideas on how to overcome > > them? Or is the idea of using a relational database to store this type > > of data absolutely silly? Many thanks in advance. > [Gary] Not silly, but difficult. Our commercial application store license, > package, and file data in a RDMS and translates to/from SPDX without storing > any data outside the DB. That being said, we don't have to worry about all > possible SPDX documents - only the ones likely to be used in our > application. > > An interesting thing to research would be using a storage facility for RDF > (e.g. triplestore) since the RDF schema has already been created. > > > > Tom > > _______________________________________________ > > Spdx-tech mailing list > > [email protected] > > https://lists.spdx.org/mailman/listinfo/spdx-tech > > _______________________________________________ Spdx-tech mailing list [email protected] https://lists.spdx.org/mailman/listinfo/spdx-tech
