Hi Tom,

I agree it would be a challenge to store the SPDX data in a relational DB.
The spec was designed in an object oriented fashion and it can be a
challenge to map objects to relations (or at least I find it to be a
challenge). 

For me, it is easier to understand the spec with a visual.  If you haven't
already, take a look at the class diagram:
http://wiki.spdx.org/view/Technical_Team/Model_2_0

Some responses inline below.

> -----Original Message-----
> From: [email protected] [mailto:spdx-tech-
> [email protected]] On Behalf Of Thomas T Gurney
> Sent: Friday, May 22, 2015 6:33 PM
> To: [email protected]
> Subject: SPDX 2.0 database schema
> 
> Hey all,
> 
> Tom Gurney here, undergrad student from the open source research lab at
> University of Nebraska Omaha.
> 
> I have been digging into SPDX 2.0 since its official release. In trying
> to build a relational database that will store SPDX 2.0 documents, I've
> realized it's a lot tougher to store 2.0 data in a relational form than
> 1.2 data. A _lot_ tougher. (At least, from my limited perspective, it
> is.)
> 
> Here's my attempt at a schema (beware, I threw it together in an
> evening):
> https://github.com/ttgurney/spdx2.0-schema/blob/master/spdx2_schema.sql
> It's like SQL pseudocode in that no actual DBMS will accept it, but it
> should make sense.
> 
> So here's what's thrown me for a loop, and resulted in some odd design
> choices:
> 
> - SPDX identifiers that can be associated with a file, document or
> package, but
>   must be unique within a document
[Gary] 
[Gary] Correct
> - An SPDX document can describe files that are not part of any package
> (and
>   it can contain multiple packages too? Not sure I'm reading the spec
> right)
[Gary] Correct

> - Relationships between identifiers
[Gary] I think of it as relationships between SpdxElements which have
identifiers as a property, but having the relationship between ID's makes
sense to me for a relational DB.

You can have external references to identifiers as well - they are made
unique by the use of the SPDX Document Namespace, so including the document
namespace or the document ID in the relationships table for the left and
right relationships would allow the database to properly map external
references and hold multiple SPDX documents. 

> - License expression syntax (I don't see a way to sensibly accomodate
> this
>   in a relational DB)
[Gary] It wasn't easy to write in Java ;)
You could implement them as sets and operators (similar to the object
model), but it would be rather complex
> - Multiple checksum types supported (I stuck to just SHA1 for the above
> schema)
[Gary] If you want it highly normalized, you could create a separate table
which checksums and have a reference (foreign key) to the checksum table
from the file.  The checksum table would have a value and algorithm columns
> - What can we say about a file from its checksum? If two files have the
> same
>   checksum, can we say that they are the same file in every aspect, and
> thereby
>   carry with them all the same SPDX metadata, regardless of what
> package each
>   is in? I'm not sure.
[Gary] This has been debated and there are different opinions on this.  As
far as the spec goes, we include the file name along with the checksum when
calculating the validation.  My personal view is that the checksum states
the content is extremely likely to be the same (depending on the checksum
algorithm, I may even say the content is the same), but the placement of the
file itself may be relevant to how it is used and may impact the metadata.
> 
> Has anyone run into similar difficulties? Ideas on how to overcome
> them? Or is the idea of using a relational database to store this type
> of data absolutely silly? Many thanks in advance.
[Gary] Not silly, but difficult. Our commercial application store license,
package, and file data in a RDMS and translates to/from SPDX without storing
any data outside the DB.  That being said, we don't have to worry about all
possible SPDX documents - only the ones likely to be used in our
application.

An interesting thing to research would be using a storage facility for RDF
(e.g. triplestore) since the RDF schema has already been created.
> 
> Tom
> _______________________________________________
> Spdx-tech mailing list
> [email protected]
> https://lists.spdx.org/mailman/listinfo/spdx-tech

_______________________________________________
Spdx-tech mailing list
[email protected]
https://lists.spdx.org/mailman/listinfo/spdx-tech

Reply via email to