Re: SPDX Legal call this Thursday

2015-09-17 Thread Bill Schineller
Hi Sam,
   Thanks for sharing your own feedback on license templatization / regexes.
   Here's mine.

a) have you used the existing markup for matching purposes?
No.
i) if no, why not?
In a nutshell, partly due to timing of our dev efforts ahead of SPDX 
templatization rollout, partly due to performance in context of our internal 
Use Case of scanning every single open source file (not just a handful of 
applications).

Details:
Black Duck has a corpus of license variants extracted from our Knowledge Base 
of around half a billion unique open source source/text and binary files.
Several years ago, prior to SPDX license templatization, we went through 
multiple iterations of grouping license text variants, as well as license 
name/nickname variants.
Our groupings were based on applying similarity algorithms, followed by human 
review.  Our methodology has been in place for some time prior to the templates 
/ regular expressions subsequently rolled out by SPDX.

Variations we've encountered, such as the street address of an organization, or 
typos / word substitutions that SPDX license templates might not have covered 
are some of the reasons we haven't yet gone through the exercise to see which 
license variants the SPDX templates might not 'match' to the license id's we've 
grouped them under.

We use our license scanner to scan (and rescan) every single open source file 
to populate our Knowledgebase of file-level license data. Our scanner uses 
multiple techniques to discover license references, but does not use regular 
expressions because of performance concerns.

Because Black Duck does not provide legal advice, consumers of our tools are 
able review the license text which our tools highlight in order to make a final 
determination. This fits well with the SPDX concept of separating discovered 
and concluded information.

Outlook:
SPDX license templates / regexes aren't as useful or efficient as other 
matching techniques we have, when automatically bulk scanning large codebases 
to determine ‘LicenseFoundInFile'.

But when a human is in the loop producing a final SPDX Document with Concluded 
License, SPDX license templates / regexes could be useful to focus legal review 
on deviations which may or may not be significant.





From: 
<spdx-tech-boun...@lists.spdx.org<mailto:spdx-tech-boun...@lists.spdx.org>> on 
behalf of Sam Ellis
Date: Wednesday, September 16, 2015 at 7:00 AM
To: J Lovejoy, SPDX-legal
Cc: "spdx-tech@lists.spdx.org<mailto:spdx-tech@lists.spdx.org>"
Subject: RE: SPDX Legal call this Thursday

3) License matching templates/markup:
We have a task to add markup to some of the standard headers and have also had 
input to add/edit markup on existing licenses.  As a result of the latter, it 
has been raised that perhaps the markup could be improved. Before adding more 
markup (to standard headers, license text or both), it seemed prudent to start 
a discussion as to whether the existing markup is effective.  Please ponder the 
following questions:
a) have you used the existing markup for matching purposes?
i) if no, why not?
ii) if yes, has it been helpful/effective?  Could it be 
improved, and if so, how? (this will likely involve putting forward a proposal 
for review)

Please also add thoughts (preferably in a new section or with your initials if 
added to others) here: http://wiki.spdx.org/view/Legal_Team/Templatizing


I will share a few points from my experience in templatization. I currently use 
a different templatization syntax that predates SPDX, but the principle of 
using regular expressions embedded within the license text is similar.


The major barrier to me adopting the SPDX templates is insufficient 
templatization within the existing licenses. The SPDX templates currently 
encode what I perceive to be the ‘official’ variations, i.e. organization name, 
person name, product name etc. However, real-world licenses contain may minor 
variations that may be inconsequential from a legal perspective, but 
nonetheless do not warrant separating out as separate licenses. Here is an 
example from the GPL-3.0 notice where it is common to see two variations in one 
of the sentences:

distributed in the hope that it will be useful
distributed in the hope that they will be useful

The example above is fairly uncontroversial, I would hope. However, there are 
plenty of other examples that border on having a legal impact. For example, in 
these two BSD-2-Clause variations it is necessary to consider whether the 
additional word constitutes an acceptable minor variation or warrants a 
different classification altogether:

Redistributions of source code must retain the above copyright notice, this 
list of…
Redistributions of source code must retain the above copyright notice 
unmodified, this list of…

It is the grey cases like these that make expanding the 

RE: SPDX Legal call this Thursday

2015-09-16 Thread Sam Ellis
3) License matching templates/markup:
We have a task to add markup to some of the standard headers and have also had 
input to add/edit markup on existing licenses.  As a result of the latter, it 
has been raised that perhaps the markup could be improved. Before adding more 
markup (to standard headers, license text or both), it seemed prudent to start 
a discussion as to whether the existing markup is effective.  Please ponder the 
following questions:
a) have you used the existing markup for matching purposes?
i) if no, why not?
ii) if yes, has it been helpful/effective?  Could it be 
improved, and if so, how? (this will likely involve putting forward a proposal 
for review)

Please also add thoughts (preferably in a new section or with your initials if 
added to others) here: http://wiki.spdx.org/view/Legal_Team/Templatizing


I will share a few points from my experience in templatization. I currently use 
a different templatization syntax that predates SPDX, but the principle of 
using regular expressions embedded within the license text is similar.


The major barrier to me adopting the SPDX templates is insufficient 
templatization within the existing licenses. The SPDX templates currently 
encode what I perceive to be the ‘official’ variations, i.e. organization name, 
person name, product name etc. However, real-world licenses contain may minor 
variations that may be inconsequential from a legal perspective, but 
nonetheless do not warrant separating out as separate licenses. Here is an 
example from the GPL-3.0 notice where it is common to see two variations in one 
of the sentences:

distributed in the hope that it will be useful
distributed in the hope that they will be useful

The example above is fairly uncontroversial, I would hope. However, there are 
plenty of other examples that border on having a legal impact. For example, in 
these two BSD-2-Clause variations it is necessary to consider whether the 
additional word constitutes an acceptable minor variation or warrants a 
different classification altogether:

Redistributions of source code must retain the above copyright notice, this 
list of…
Redistributions of source code must retain the above copyright notice 
unmodified, this list of…

It is the grey cases like these that make expanding the use of templating 
difficult. Inevitably it leads to having to make some judgements about the 
impact of a particular word or phrase on the legal interpretation, something 
that I am aware SPDX tries to avoid.

Whether it is worth templating all the cases like these primarily depends on 
the goals of the SPDX templates. If they are for human use to see what official 
variations are permitted, then they are not necessary. On the other hand, if 
they are to be used by automated license scanning tools, then covering these 
cases is essential in order to have a tool that works effectively on real-world 
code. So I think an important point is to gain clarity on the purpose of the 
templates.


In terms of the current application of the templates, I have a technical 
concern over the use of unbounded regular expressions, for example:

<>

This is unbounded because it will match any number of characters for the 
copyrightHolderAsIs field. The practical consequence of this is that regular 
expression matching can explode in terms of time. I don’t have a concrete 
example to hand, but my own experience with using the same unbounded regular 
expressions on real-world licenses is that I have seen it take minutes just to 
process one regular expression on a single file, and this does not scale well 
when there are millions of files to process. Clearly, in terms of English 
language there is no maximum size on the length of a copyright statement. Using 
an unbounded regular expression is therefore correct in theory but difficult to 
use in practice. I have had to use size bounded regular expressions in order to 
have a scanning tool that will complete in a reasonable time. The problem in 
switching to bounded regular expressions is in deciding on what is an 
acceptable upper bound on the size, and this can really only be judged by 
experimentation against real-world licenses, and does then require on-going 
tweaking as new license variations are discovered.


Neither of these are problems with templatization per-se, and they are more to 
do with the extent and way in which they are currently applied.


-- IMPORTANT NOTICE: The contents of this email and any attachments are 
confidential and may also be privileged. If you are not the intended recipient, 
please notify the sender immediately and do not disclose the contents to any 
other person, use it for any purpose, or store or copy the information in any 
medium. Thank you.

ARM Limited, Registered office 110 Fulbourn Road, Cambridge CB1 9NJ, Registered 
in 

Re: SPDX Legal call this Thursday

2015-09-16 Thread Philippe Ombredanne
On Wed, Sep 16, 2015 at 2:33 AM, J Lovejoy  wrote:
> 3) License matching templates/markup:
> We have a task to add markup to some of the standard headers and have also
> had input to add/edit markup on existing licenses.  As a result of the
> latter, it has been raised that perhaps the markup could be improved. Before
> adding more markup (to standard headers, license text or both), it seemed
> prudent to start a discussion as to whether the existing markup is
> effective.  Please ponder the following questions:

> a) have you used the existing markup for matching purposes?

Yes and No:  ScanCode uses an SPDX-inspired/derived markup, but
instead of reusing the markup directly from the main license texts,
markup is transformed in a simpler {{mustache-like}} syntax added to
copies of these texts used only for detection purpose.

> i) if no, why not?

Because:
- adding more markup to a reference license text makes this eventually
no longer usable as a reference text and harder to read by humans
- the many variations found in the wild make it hard to put all in a
single template.
- the markup syntax implies eventually an implementation using regular
expressions. ScanCode does not use regex, but inverted indexes and
string alignments.

> ii) if yes, has it been helpful/effective?  Could it be improved, and if so,
> how? (this will likely involve putting forward a proposal for review)

I think a simple markup is a very effective way to detect licenses
with minor text variations and still call this an exact match.
It is also a very effective way to indicate variations for humans.
I find it hard personally to mix the human readability and technical
detection concerns in the same file without compromises.

As food for thought, here are some examples of markup as used in ScanCode:

https://github.com/nexB/scancode-toolkit/blob/b37be4de78152fbd3ed54761627c960010ce26a3/src/licensedcode/data/rules/apache-1.1_38.RULE#L17
https://github.com/nexB/scancode-toolkit/blob/b37be4de78152fbd3ed54761627c960010ce26a3/src/licensedcode/data/rules/bzip2-libbzip-1.0.5_1.RULE#L1

The syntax is using double curly braces to enclose variable parts.
There is no regex involved.
Optionally a number can be used after the opening braces to indicate
the number of variable words, defaulting to 5 words.
For instance {{ Copyright (c) 2015 Myco }} would match up to 5 words
and {{ 10 Copyright (c) 2015 Myco inc.}} would match up to 10 words.

I hope this helps even though this is a slightly different take.
-- 
Cordially
Philippe Ombredanne
___
Spdx-tech mailing list
Spdx-tech@lists.spdx.org
https://lists.spdx.org/mailman/listinfo/spdx-tech


RE: SPDX Legal call this Thursday

2015-09-16 Thread Gary O'Neall
Hi Sam,

 

Responses inline below:

 

Gary

 

From: spdx-tech-boun...@lists.spdx.org 
[mailto:spdx-tech-boun...@lists.spdx.org] On Behalf Of Sam Ellis
Sent: Wednesday, September 16, 2015 4:00 AM
To: J Lovejoy; SPDX-legal
Cc: spdx-tech@lists.spdx.org
Subject: RE: SPDX Legal call this Thursday

 

3) License matching templates/markup: 

We have a task to add markup to some of the standard headers and have also had 
input to add/edit markup on existing licenses.  As a result of the latter, it 
has been raised that perhaps the markup could be improved. Before adding more 
markup (to standard headers, license text or both), it seemed prudent to start 
a discussion as to whether the existing markup is effective.  Please ponder the 
following questions:

a) have you used the existing markup for matching purposes?

i) if no, why not?

ii) if yes, has it been helpful/effective?  Could it be 
improved, and if so, how? (this will likely involve putting forward a proposal 
for review)

[Gary] Yes - For the SourceAuditor commercial tools, the markup is used to 
validate that 2 licenses are equivalent per the matching guidelines. The open 
source SPDX tools uses the markup is used in a number of ways.  The 
"compareSpdx" and "compareMultipleSpdx" commands use the markup to determine if 
the licenses are equivalent.  There are library methods implemented to compare 
license text and report if the license text matches any of the SPDX LicenseList.

 

In all cases above, the markup is used to compare 2 existing known license 
text.  It is NOT used to match license text against a library of possible 
license matches.  In the commercial tool, a separate algorithm implements this 
functionality and the markup language turned out to be too inefficient for this 
purpose - at least for the performance requirements of our application.

 

Note: When we originally discussed the markup language, we debated whether to 
cover the use case of searching a library of possible license matches and the 
decision was taken not to support this.

 

In my opinion, the markup works fine for matching two license texts.  If we 
wanted to support a searching use case, we would need to modify/extend the 
markup language to enable this to  be efficient.

 

 

Please also add thoughts (preferably in a new section or with your initials if 
added to others) here: http://wiki.spdx.org/view/Legal_Team/Templatizing

 

 

I will share a few points from my experience in templatization. I currently use 
a different templatization syntax that predates SPDX, but the principle of 
using regular expressions embedded within the license text is similar.

 

 

The major barrier to me adopting the SPDX templates is insufficient 
templatization within the existing licenses. The SPDX templates currently 
encode what I perceive to be the ‘official’ variations, i.e. organization name, 
person name, product name etc. However, real-world licenses contain may minor 
variations that may be inconsequential from a legal perspective, but 
nonetheless do not warrant separating out as separate licenses. Here is an 
example from the GPL-3.0 notice where it is common to see two variations in one 
of the sentences:

 

distributed in the hope that it will be useful

distributed in the hope that they will be useful

 

The example above is fairly uncontroversial, I would hope. However, there are 
plenty of other examples that border on having a legal impact. For example, in 
these two BSD-2-Clause variations it is necessary to consider whether the 
additional word constitutes an acceptable minor variation or warrants a 
different classification altogether:

 

Redistributions of source code must retain the above copyright notice, this 
list of…

Redistributions of source code must retain the above copyright notice 
unmodified, this list of…

 

It is the grey cases like these that make expanding the use of templating 
difficult. Inevitably it leads to having to make some judgements about the 
impact of a particular word or phrase on the legal interpretation, something 
that I am aware SPDX tries to avoid.

 

Whether it is worth templating all the cases like these primarily depends on 
the goals of the SPDX templates. If they are for human use to see what official 
variations are permitted, then they are not necessary. On the other hand, if 
they are to be used by automated license scanning tools, then covering these 
cases is essential in order to have a tool that works effectively on real-world 
code. So I think an important point is to gain clarity on the purpose of the 
templates.

 

 

In terms of the current application of the templates, I have a technical 
concern over the use of unbounded regular expressions, for example:

 

<<var;name=copyrightHolderAsIs;original=THE COPYRIGHT HOLDERS AND 
CONTRIBUTORS;match=.+>>

 

This is unbounded because it wil

SPDX Legal call this Thursday

2015-09-15 Thread J Lovejoy
Hi All,

In preparation for Thursday’s call, please review the following items in 
advance for our agenda. 

Announcements and updates (#1) are here only for your information (nothing to 
discuss on call). We will focus primarily on #2 for purposes of this call and 
then #3 if we have time:

1) Announcements/updates:
a) formatting issue with standard headers on HTML pages for license 
list has now been fixed (thanks, Gary!)
b) LinuxCon Europe is in a few weeks: talks related to SPDX by Jilayne 
http://sched.co/3xVB  
and Phil Odence and Dave Marr - http://sched.co/4GGz 
Also, there will be a Supply Chain Mini-Summit on the Thursday, see more info 
here: 
http://events.linuxfoundation.org/events/linuxcon-europe/extend-the-experience/supply-chain-summit
 

c) Working on proposal for pull request process for license list 
templates (and possibly other aspects of changes to license list) - will submit 
a full proposal to legal team when something more concrete is ready (see 
http://wiki.spdx.org/view/Legal_Team/Minutes/2015-08-06 
 for initial 
discussion/reference)


2) SPDX License List v2.2 is scheduled to be released at the end of this month! 
 
a) got some answers back from Fedora on licenses on their list we 
wanted to add, but couldn’t find text for, etc. Can we add:

i)  Interbase Public License / Interbase - 
http://www.borland.com/devsupport/interbase/opensource/IPL.html - link broken, 
can’t find license. Does Fedora have it archived somewhere? Is this still used 
/ do we need to add to SPDX-LL?
ANSWER: Here is an archived copy:
https://web.archive.org/web/20060319014854/http://info.borland.com/devsupport/interbase/opensource/IPL.html
Firebird is still under this license, still used in Fedora.

ii) Sendmail License / Sendmail /  
http://www.sendmail.org/ftp/LICENSE  - 
link from Fedora site does not go to license. We intend to add, but wanted to 
confirm that we have the correct license that you meant due to broken link - 
can you confirm that this the correct license here:
http://www.sendmail.com/pdfs/open_source/sendmail_license.pdf 
 
ANSWER: That is the correct sendmail license. We have updated our link.

iii) Crystal Stacker License / Crystal Stacker -  
https://fedoraproject.org/wiki/Licensing/CrystalStacker - license on
Fedora site does not match license in download. (full explanation was in 
previous email thread) - please review and see if you agree with the
recommendation at the end of the email.
ANSWER: Updated the Crystal Stacker entry in the Fedora license list to add the 
missing disclaimer text. License now matches license in download. I do not 
believe there is a different source license vs binary license here.
JL: further explanation re: previous email thread to be provided on call

b) To continue (pick back up) our momentum for adding license 
exceptions, please review the 5 license exceptions highlighted in light green 
here: 
https://docs.google.com/spreadsheets/d/11AKxLBoN_VXM32OmDTk2hKeYExKzsnPjAVM7rLstQ8s/edit?pli=1#gid=0
 

 for potentially adding to v2.2 

c) Also, let’s discuss a couple items related to existing exceptions 
that we didn’t quite get to for v2.1:
i) WxWindows - the text in the exception we have versus what is 
on the OSI site is not the same!! The only differences are: we have "3.1" 
instead of "3.0" in the first clause; and "your" instead of "the user's" in the 
second clause. See http://opensource.org/licenses/WXwindows and 
http://spdx.org/licenses/WxWindows-exception-3.1.html - what we have is 
consistent with what is here: https://www.wxwidgets.org/about/licence/
· should we accommodate this difference somehow? If so, due to this 
already being on the license list, this seems like it should be a priority to 
resolve for v2.1 release
ii) Classpath-exception-2.0 - why do we have 2.0 and the note 
saying it’s typically used with GPL-2.0? the Fedora example has it being used 
with all GPL versions and there doesn’t seem to have other versions. worth 
removing the “2.0” in the short identifier?


3) License matching templates/markup: 
We have a task to add markup to some of the standard headers and have also had 
input to add/edit markup on existing licenses.  As a result of the latter, it 
has been raised that perhaps the markup could be improved. Before adding more 
markup (to standard headers, license text or both), it seemed prudent to start 
a discussion as to whether the existing markup is effective.  Please ponder