Thanks, Gary, for the helpful commentary! Looks like I will be making several
changes based on your suggestions.

I have indeed seen the class diagram; I found it useful. But again, not a
one-to-one correspondence to how a relational DB would look (classic problem,
I know...) hence my questions.

I did look briefly at triplestores. I'll admit, I put off looking into them in
detail since I'm very familiar with relational DBs and very unfamiliar with the
technology surrounding RDF :) Certainly I'll have to get familiar with it, if
only to properly support SPDX document generation in RDF format.

For the record: the link I provided below was broken for a time as I was moving
things around; it has since been corrected. Not to mention, it actually works
with a specific DBMS now. (I went with Postgres specifically for the CHECK
constraint support, and overall unsurprising behavior compared to MySQL :)

I have also added (+ am adding) some additional documentation on some of the
quirks of this schema, in case anyone finds it useful.

Tom

On Wed, May 27, 2015 at 09:07:48PM -0700, Gary O'Neall wrote:
> Hi Tom,
> 
> I agree it would be a challenge to store the SPDX data in a relational DB.
> The spec was designed in an object oriented fashion and it can be a
> challenge to map objects to relations (or at least I find it to be a
> challenge). 
> 
> For me, it is easier to understand the spec with a visual.  If you haven't
> already, take a look at the class diagram:
> http://wiki.spdx.org/view/Technical_Team/Model_2_0
> 
> Some responses inline below.
> 
> > -----Original Message-----
> > From: [email protected] [mailto:spdx-tech-
> > [email protected]] On Behalf Of Thomas T Gurney
> > Sent: Friday, May 22, 2015 6:33 PM
> > To: [email protected]
> > Subject: SPDX 2.0 database schema
> > 
> > Hey all,
> > 
> > Tom Gurney here, undergrad student from the open source research lab at
> > University of Nebraska Omaha.
> > 
> > I have been digging into SPDX 2.0 since its official release. In trying
> > to build a relational database that will store SPDX 2.0 documents, I've
> > realized it's a lot tougher to store 2.0 data in a relational form than
> > 1.2 data. A _lot_ tougher. (At least, from my limited perspective, it
> > is.)
> > 
> > Here's my attempt at a schema (beware, I threw it together in an
> > evening):
> > https://github.com/ttgurney/spdx2.0-schema/blob/master/spdx2_schema.sql
> > It's like SQL pseudocode in that no actual DBMS will accept it, but it
> > should make sense.
> > 
> > So here's what's thrown me for a loop, and resulted in some odd design
> > choices:
> > 
> > - SPDX identifiers that can be associated with a file, document or
> > package, but
> >   must be unique within a document
> [Gary] 
> [Gary] Correct
> > - An SPDX document can describe files that are not part of any package
> > (and
> >   it can contain multiple packages too? Not sure I'm reading the spec
> > right)
> [Gary] Correct
> 
> > - Relationships between identifiers
> [Gary] I think of it as relationships between SpdxElements which have
> identifiers as a property, but having the relationship between ID's makes
> sense to me for a relational DB.
> 
> You can have external references to identifiers as well - they are made
> unique by the use of the SPDX Document Namespace, so including the document
> namespace or the document ID in the relationships table for the left and
> right relationships would allow the database to properly map external
> references and hold multiple SPDX documents. 
> 
> > - License expression syntax (I don't see a way to sensibly accomodate
> > this
> >   in a relational DB)
> [Gary] It wasn't easy to write in Java ;)
> You could implement them as sets and operators (similar to the object
> model), but it would be rather complex
> > - Multiple checksum types supported (I stuck to just SHA1 for the above
> > schema)
> [Gary] If you want it highly normalized, you could create a separate table
> which checksums and have a reference (foreign key) to the checksum table
> from the file.  The checksum table would have a value and algorithm columns
> > - What can we say about a file from its checksum? If two files have the
> > same
> >   checksum, can we say that they are the same file in every aspect, and
> > thereby
> >   carry with them all the same SPDX metadata, regardless of what
> > package each
> >   is in? I'm not sure.
> [Gary] This has been debated and there are different opinions on this.  As
> far as the spec goes, we include the file name along with the checksum when
> calculating the validation.  My personal view is that the checksum states
> the content is extremely likely to be the same (depending on the checksum
> algorithm, I may even say the content is the same), but the placement of the
> file itself may be relevant to how it is used and may impact the metadata.
> > 
> > Has anyone run into similar difficulties? Ideas on how to overcome
> > them? Or is the idea of using a relational database to store this type
> > of data absolutely silly? Many thanks in advance.
> [Gary] Not silly, but difficult. Our commercial application store license,
> package, and file data in a RDMS and translates to/from SPDX without storing
> any data outside the DB.  That being said, we don't have to worry about all
> possible SPDX documents - only the ones likely to be used in our
> application.
> 
> An interesting thing to research would be using a storage facility for RDF
> (e.g. triplestore) since the RDF schema has already been created.
> > 
> > Tom
> > _______________________________________________
> > Spdx-tech mailing list
> > [email protected]
> > https://lists.spdx.org/mailman/listinfo/spdx-tech
> 
> 
_______________________________________________
Spdx-tech mailing list
[email protected]
https://lists.spdx.org/mailman/listinfo/spdx-tech

Reply via email to