Hi Sam,
Thanks for sharing your own feedback on license templatization / regexes.
Here's mine.
a) have you used the existing markup for matching purposes?
No.
i) if no, why not?
In a nutshell, partly due to timing of our dev efforts ahead of SPDX
templatization rollout, partly due to performance in context of our internal
Use Case of scanning every single open source file (not just a handful of
applications).
Details:
Black Duck has a corpus of license variants extracted from our Knowledge Base
of around half a billion unique open source source/text and binary files.
Several years ago, prior to SPDX license templatization, we went through
multiple iterations of grouping license text variants, as well as license
name/nickname variants.
Our groupings were based on applying similarity algorithms, followed by human
review. Our methodology has been in place for some time prior to the templates
/ regular expressions subsequently rolled out by SPDX.
Variations we've encountered, such as the street address of an organization, or
typos / word substitutions that SPDX license templates might not have covered
are some of the reasons we haven't yet gone through the exercise to see which
license variants the SPDX templates might not 'match' to the license id's we've
grouped them under.
We use our license scanner to scan (and rescan) every single open source file
to populate our Knowledgebase of file-level license data. Our scanner uses
multiple techniques to discover license references, but does not use regular
expressions because of performance concerns.
Because Black Duck does not provide legal advice, consumers of our tools are
able review the license text which our tools highlight in order to make a final
determination. This fits well with the SPDX concept of separating discovered
and concluded information.
Outlook:
SPDX license templates / regexes aren't as useful or efficient as other
matching techniques we have, when automatically bulk scanning large codebases
to determine ‘LicenseFoundInFile'.
But when a human is in the loop producing a final SPDX Document with Concluded
License, SPDX license templates / regexes could be useful to focus legal review
on deviations which may or may not be significant.
From:
<[email protected]<mailto:[email protected]>> on
behalf of Sam Ellis
Date: Wednesday, September 16, 2015 at 7:00 AM
To: J Lovejoy, SPDX-legal
Cc: "[email protected]<mailto:[email protected]>"
Subject: RE: SPDX Legal call this Thursday
3) License matching templates/markup:
We have a task to add markup to some of the standard headers and have also had
input to add/edit markup on existing licenses. As a result of the latter, it
has been raised that perhaps the markup could be improved. Before adding more
markup (to standard headers, license text or both), it seemed prudent to start
a discussion as to whether the existing markup is effective. Please ponder the
following questions:
a) have you used the existing markup for matching purposes?
i) if no, why not?
ii) if yes, has it been helpful/effective? Could it be
improved, and if so, how? (this will likely involve putting forward a proposal
for review)
Please also add thoughts (preferably in a new section or with your initials if
added to others) here: http://wiki.spdx.org/view/Legal_Team/Templatizing
I will share a few points from my experience in templatization. I currently use
a different templatization syntax that predates SPDX, but the principle of
using regular expressions embedded within the license text is similar.
The major barrier to me adopting the SPDX templates is insufficient
templatization within the existing licenses. The SPDX templates currently
encode what I perceive to be the ‘official’ variations, i.e. organization name,
person name, product name etc. However, real-world licenses contain may minor
variations that may be inconsequential from a legal perspective, but
nonetheless do not warrant separating out as separate licenses. Here is an
example from the GPL-3.0 notice where it is common to see two variations in one
of the sentences:
distributed in the hope that it will be useful
distributed in the hope that they will be useful
The example above is fairly uncontroversial, I would hope. However, there are
plenty of other examples that border on having a legal impact. For example, in
these two BSD-2-Clause variations it is necessary to consider whether the
additional word constitutes an acceptable minor variation or warrants a
different classification altogether:
Redistributions of source code must retain the above copyright notice, this
list of…
Redistributions of source code must retain the above copyright notice
unmodified, this list of…
It is the grey cases like these that make expanding the use of templating
difficult. Inevitably it leads to having to make some judgements about the
impact of a particular word or phrase on the legal interpretation, something
that I am aware SPDX tries to avoid.
Whether it is worth templating all the cases like these primarily depends on
the goals of the SPDX templates. If they are for human use to see what official
variations are permitted, then they are not necessary. On the other hand, if
they are to be used by automated license scanning tools, then covering these
cases is essential in order to have a tool that works effectively on real-world
code. So I think an important point is to gain clarity on the purpose of the
templates.
In terms of the current application of the templates, I have a technical
concern over the use of unbounded regular expressions, for example:
<<var;name=copyrightHolderAsIs;original=THE COPYRIGHT HOLDERS AND
CONTRIBUTORS;match=.+>>
This is unbounded because it will match any number of characters for the
copyrightHolderAsIs field. The practical consequence of this is that regular
expression matching can explode in terms of time. I don’t have a concrete
example to hand, but my own experience with using the same unbounded regular
expressions on real-world licenses is that I have seen it take minutes just to
process one regular expression on a single file, and this does not scale well
when there are millions of files to process. Clearly, in terms of English
language there is no maximum size on the length of a copyright statement. Using
an unbounded regular expression is therefore correct in theory but difficult to
use in practice. I have had to use size bounded regular expressions in order to
have a scanning tool that will complete in a reasonable time. The problem in
switching to bounded regular expressions is in deciding on what is an
acceptable upper bound on the size, and this can really only be judged by
experimentation against real-world licenses, and does then require on-going
tweaking as new license variations are discovered.
Neither of these are problems with templatization per-se, and they are more to
do with the extent and way in which they are currently applied.
-- IMPORTANT NOTICE: The contents of this email and any attachments are
confidential and may also be privileged. If you are not the intended recipient,
please notify the sender immediately and do not disclose the contents to any
other person, use it for any purpose, or store or copy the information in any
medium. Thank you.
ARM Limited, Registered office 110 Fulbourn Road, Cambridge CB1 9NJ, Registered
in England & Wales, Company No: 2557590
ARM Holdings plc, Registered office 110 Fulbourn Road, Cambridge CB1 9NJ,
Registered in England & Wales, Company No: 2548782
_______________________________________________
Spdx-tech mailing list
[email protected]
https://lists.spdx.org/mailman/listinfo/spdx-tech