RE: SPDX Legal call this Thursday
3) License matching templates/markup: We have a task to add markup to some of the standard headers and have also had input to add/edit markup on existing licenses. As a result of the latter, it has been raised that perhaps the markup could be improved. Before adding more markup (to standard headers, license text or both), it seemed prudent to start a discussion as to whether the existing markup is effective. Please ponder the following questions: a) have you used the existing markup for matching purposes? i) if no, why not? ii) if yes, has it been helpful/effective? Could it be improved, and if so, how? (this will likely involve putting forward a proposal for review) Please also add thoughts (preferably in a new section or with your initials if added to others) here: http://wiki.spdx.org/view/Legal_Team/Templatizing I will share a few points from my experience in templatization. I currently use a different templatization syntax that predates SPDX, but the principle of using regular expressions embedded within the license text is similar. The major barrier to me adopting the SPDX templates is insufficient templatization within the existing licenses. The SPDX templates currently encode what I perceive to be the ‘official’ variations, i.e. organization name, person name, product name etc. However, real-world licenses contain may minor variations that may be inconsequential from a legal perspective, but nonetheless do not warrant separating out as separate licenses. Here is an example from the GPL-3.0 notice where it is common to see two variations in one of the sentences: distributed in the hope that it will be useful distributed in the hope that they will be useful The example above is fairly uncontroversial, I would hope. However, there are plenty of other examples that border on having a legal impact. For example, in these two BSD-2-Clause variations it is necessary to consider whether the additional word constitutes an acceptable minor variation or warrants a different classification altogether: Redistributions of source code must retain the above copyright notice, this list of… Redistributions of source code must retain the above copyright notice unmodified, this list of… It is the grey cases like these that make expanding the use of templating difficult. Inevitably it leads to having to make some judgements about the impact of a particular word or phrase on the legal interpretation, something that I am aware SPDX tries to avoid. Whether it is worth templating all the cases like these primarily depends on the goals of the SPDX templates. If they are for human use to see what official variations are permitted, then they are not necessary. On the other hand, if they are to be used by automated license scanning tools, then covering these cases is essential in order to have a tool that works effectively on real-world code. So I think an important point is to gain clarity on the purpose of the templates. In terms of the current application of the templates, I have a technical concern over the use of unbounded regular expressions, for example: <> This is unbounded because it will match any number of characters for the copyrightHolderAsIs field. The practical consequence of this is that regular expression matching can explode in terms of time. I don’t have a concrete example to hand, but my own experience with using the same unbounded regular expressions on real-world licenses is that I have seen it take minutes just to process one regular expression on a single file, and this does not scale well when there are millions of files to process. Clearly, in terms of English language there is no maximum size on the length of a copyright statement. Using an unbounded regular expression is therefore correct in theory but difficult to use in practice. I have had to use size bounded regular expressions in order to have a scanning tool that will complete in a reasonable time. The problem in switching to bounded regular expressions is in deciding on what is an acceptable upper bound on the size, and this can really only be judged by experimentation against real-world licenses, and does then require on-going tweaking as new license variations are discovered. Neither of these are problems with templatization per-se, and they are more to do with the extent and way in which they are currently applied. -- IMPORTANT NOTICE: The contents of this email and any attachments are confidential and may also be privileged. If you are not the intended recipient, please notify the sender immediately and do not disclose the contents to any other person, use it for any purpose, or store or copy the information in any medium. Thank you. ARM Limited, Registered office 110 Fulbourn Road, Cambridge CB1 9NJ, Registered in
Re: SPDX Legal call this Thursday
On Wed, Sep 16, 2015 at 2:33 AM, J Lovejoywrote: > 3) License matching templates/markup: > We have a task to add markup to some of the standard headers and have also > had input to add/edit markup on existing licenses. As a result of the > latter, it has been raised that perhaps the markup could be improved. Before > adding more markup (to standard headers, license text or both), it seemed > prudent to start a discussion as to whether the existing markup is > effective. Please ponder the following questions: > a) have you used the existing markup for matching purposes? Yes and No: ScanCode uses an SPDX-inspired/derived markup, but instead of reusing the markup directly from the main license texts, markup is transformed in a simpler {{mustache-like}} syntax added to copies of these texts used only for detection purpose. > i) if no, why not? Because: - adding more markup to a reference license text makes this eventually no longer usable as a reference text and harder to read by humans - the many variations found in the wild make it hard to put all in a single template. - the markup syntax implies eventually an implementation using regular expressions. ScanCode does not use regex, but inverted indexes and string alignments. > ii) if yes, has it been helpful/effective? Could it be improved, and if so, > how? (this will likely involve putting forward a proposal for review) I think a simple markup is a very effective way to detect licenses with minor text variations and still call this an exact match. It is also a very effective way to indicate variations for humans. I find it hard personally to mix the human readability and technical detection concerns in the same file without compromises. As food for thought, here are some examples of markup as used in ScanCode: https://github.com/nexB/scancode-toolkit/blob/b37be4de78152fbd3ed54761627c960010ce26a3/src/licensedcode/data/rules/apache-1.1_38.RULE#L17 https://github.com/nexB/scancode-toolkit/blob/b37be4de78152fbd3ed54761627c960010ce26a3/src/licensedcode/data/rules/bzip2-libbzip-1.0.5_1.RULE#L1 The syntax is using double curly braces to enclose variable parts. There is no regex involved. Optionally a number can be used after the opening braces to indicate the number of variable words, defaulting to 5 words. For instance {{ Copyright (c) 2015 Myco }} would match up to 5 words and {{ 10 Copyright (c) 2015 Myco inc.}} would match up to 10 words. I hope this helps even though this is a slightly different take. -- Cordially Philippe Ombredanne ___ Spdx-tech mailing list Spdx-tech@lists.spdx.org https://lists.spdx.org/mailman/listinfo/spdx-tech
RE: SPDX Legal call this Thursday
Hi Sam, Responses inline below: Gary From: spdx-tech-boun...@lists.spdx.org [mailto:spdx-tech-boun...@lists.spdx.org] On Behalf Of Sam Ellis Sent: Wednesday, September 16, 2015 4:00 AM To: J Lovejoy; SPDX-legal Cc: spdx-tech@lists.spdx.org Subject: RE: SPDX Legal call this Thursday 3) License matching templates/markup: We have a task to add markup to some of the standard headers and have also had input to add/edit markup on existing licenses. As a result of the latter, it has been raised that perhaps the markup could be improved. Before adding more markup (to standard headers, license text or both), it seemed prudent to start a discussion as to whether the existing markup is effective. Please ponder the following questions: a) have you used the existing markup for matching purposes? i) if no, why not? ii) if yes, has it been helpful/effective? Could it be improved, and if so, how? (this will likely involve putting forward a proposal for review) [Gary] Yes - For the SourceAuditor commercial tools, the markup is used to validate that 2 licenses are equivalent per the matching guidelines. The open source SPDX tools uses the markup is used in a number of ways. The "compareSpdx" and "compareMultipleSpdx" commands use the markup to determine if the licenses are equivalent. There are library methods implemented to compare license text and report if the license text matches any of the SPDX LicenseList. In all cases above, the markup is used to compare 2 existing known license text. It is NOT used to match license text against a library of possible license matches. In the commercial tool, a separate algorithm implements this functionality and the markup language turned out to be too inefficient for this purpose - at least for the performance requirements of our application. Note: When we originally discussed the markup language, we debated whether to cover the use case of searching a library of possible license matches and the decision was taken not to support this. In my opinion, the markup works fine for matching two license texts. If we wanted to support a searching use case, we would need to modify/extend the markup language to enable this to be efficient. Please also add thoughts (preferably in a new section or with your initials if added to others) here: http://wiki.spdx.org/view/Legal_Team/Templatizing I will share a few points from my experience in templatization. I currently use a different templatization syntax that predates SPDX, but the principle of using regular expressions embedded within the license text is similar. The major barrier to me adopting the SPDX templates is insufficient templatization within the existing licenses. The SPDX templates currently encode what I perceive to be the ‘official’ variations, i.e. organization name, person name, product name etc. However, real-world licenses contain may minor variations that may be inconsequential from a legal perspective, but nonetheless do not warrant separating out as separate licenses. Here is an example from the GPL-3.0 notice where it is common to see two variations in one of the sentences: distributed in the hope that it will be useful distributed in the hope that they will be useful The example above is fairly uncontroversial, I would hope. However, there are plenty of other examples that border on having a legal impact. For example, in these two BSD-2-Clause variations it is necessary to consider whether the additional word constitutes an acceptable minor variation or warrants a different classification altogether: Redistributions of source code must retain the above copyright notice, this list of… Redistributions of source code must retain the above copyright notice unmodified, this list of… It is the grey cases like these that make expanding the use of templating difficult. Inevitably it leads to having to make some judgements about the impact of a particular word or phrase on the legal interpretation, something that I am aware SPDX tries to avoid. Whether it is worth templating all the cases like these primarily depends on the goals of the SPDX templates. If they are for human use to see what official variations are permitted, then they are not necessary. On the other hand, if they are to be used by automated license scanning tools, then covering these cases is essential in order to have a tool that works effectively on real-world code. So I think an important point is to gain clarity on the purpose of the templates. In terms of the current application of the templates, I have a technical concern over the use of unbounded regular expressions, for example: <> This is unbounded because it will match any number of characters for the