Hi Kris,
Excellent point on Excel - it really is difficult to do diff's (inside or outside of Git). One more detail on the current workflow - we only maintain the license metadata in the spreadsheet (license ID, header, related URL's), the actual license text for the templates are in individual text files already - one per license. Hence the suggestion of focusing on the license body text (as long as we are OK living with the diff's issue on Excel). Gary From: Kris.re [mailto:kris...@bbhmedia.com] Sent: Tuesday, October 27, 2015 4:14 PM To: Gary O'Neall; 'SPDX-legal'; spdx-t...@lists.spdx.org Subject: RE: Markup proposal Hey, thanks for the references. I'm sorry if I seem to be ignoring existing practices - a lot of the existing workflow and use cases I just don't have much of a conception of. I very much agree that the human-maintained data should be kept as legible as possible, which is why I attempted to keep the markup to wrapping entities and succinct representations. Matching existing tags used for RDF-a makes sense; the tag names suggested should be taken very much as placeholders for another discussion entirely. The main goal was to get a grip on the whole of things in a way that would let us discuss what that meant for people maintaining the licenses and the rest of the workflow for producing products for people (and programs) to consume. Speaking of workflow, I don't find Excel to be a very convenient platform for managing/editing large blocks of text, and it could be that the people currently doing the job feel differently, but the larger advantage to keeping license-related data in a single, text-editable file per-license is what that means for the ability to allow contributions (via pull requests) and track history/changes to a particular license over time. You can't really compare diffs of an excel spreadsheet on Github, and a change to the spreadsheet could be a change to (m)any of the licenses it tracks. Keeping files changed in lockstep with licenses changed has significant maintenance benefits - much harder to miss something on accident. One possible other candidate for an "input format" is Markdown; it is useful and fairly widely known, but unless we hijack the "link" format to specify matching regular expressions (which feels kind of dirty but is totally possible) I'm not sure there's a straightforward way to supply the alternate/matching functionality. A downside to utilizing Markdown is the fact that every markdown parser operates slightly differently. One other point worth noting about XML is that it provides information in a form that other formats such as JSON cannot; this is because it is fundamentally a markup language, not a data transport format. This is a significant point in favor of XML as a "source" format rather than a "destination" format, but as was previously noted, the "fully marked up" templates will be awful for a human to maintain, so the stress on XML itself as the target "source" language is much less. The choice of something familiar and accessible is, however, important for that component. Kris From: Gary O'Neall [mailto:g...@sourceauditor.com] Sent: Tuesday, October 27, 2015 3:24 PM To: Kris.re <kris...@bbhmedia.com>; 'SPDX-legal' <spdx-legal@lists.spdx.org>; spdx-t...@lists.spdx.org Subject: RE: Markup proposal Hi Kris, Thanks for writing up such thorough proposal. This will make it much easier to discuss some of the specifics. A couple quick items - you mentioned that you could not find the syntax for the current text. It is in the SPDX specification PDF file (http://spdx.org/sites/spdx/files/SPDX-2.0.pdf) Appendix II page 78. Also, in answer to your question if the variable text is used - the answer is yes. They are implemented by the SPDX tools and I do use that functionality myself. I would imagine there are others besides myself as well. There are a lot of items I could respond to, but I think I'll save the details for our call. I do have 4 high level observations/opinions/suggestions: - The markup proposed is a pretty big change in how we actually produce the licenses. If I understand the markup language maintained by the humans, IMHO it is putting a larger burden on the legal team than the current approach and may limit the adoption of the markup. I remember debates just on annotating the title, which is also in your proposal + several other annotations. Before digging into the technical details too much, I want to make sure we discuss/understand the process changes and that the legal team is signed up to create and maintain the new markups. - Some of the tags in your proposal are already defined and used in RDF-a which is already machine parseable (e.g. license-identifier/licenseID and header). See the document Accessing SPDX Licenses at http://wiki.spdx.org/images/SPDX-TR-2014-2.v1.0.pdf for details. We could add another representation in XML (just like we are adding JSON format). If we do, I would propose using the same tags as in the spec. The input for these fields are in a spreadsheet maintained by the legal team and a tool generates the HTML file for the website. If we went to these annotations as the source or input to the website creation tools, I would assume we would discard the spreadsheet and go strictly to the annotated files - which would be a change to the process. I would propose that we focus on the body of the license text and not revisit the other tags since that is where the current challenges are. - I really like the idea of creating a normalized license format to make it easier for downstream tools to match. Having implemented a matcher just using the current rules, I can confirm it is not that easy and it is error prone. From reading through your proposal, you have a lot of good ideas on how to implement this. We could enhance the current tool that generates the website to produce normalized templates - it is just a SMOP (small matter of programming ;). - In thinking about this a bit, we may want to separate the discussion into two parts: 1) What should be the format for the human maintained licenses? (e.g. the files which are fed into the tool that generates the website) 2) What should be maintained as a format for providing matching by external tools? This could be produced by the tool that generates the website as long as the information to produce that format is present in the human maintained licenses. Gary From: spdx-tech-boun...@lists.spdx.org [mailto:spdx-tech-boun...@lists.spdx.org] On Behalf Of Kris.re Sent: Tuesday, October 27, 2015 9:02 AM To: 'SPDX-legal'; spdx-t...@lists.spdx.org Subject: Markup proposal Introduction As discussed on the last legal call, here's my proposal for enhancing the markup. There are two problems that need solving: 1) Maintaining a version of the licenses with just enough markup to aid in the computer-hard decisions 2) Make computer matching entirely logic-free by providing the required information to an implementation with markup Since (2) may necessarily be difficult to maintain by hand, as well as be error-prone (for example, finding *every* synonym instance in some license text), I suggest that a build process is necessary to produce the "enhanced" [item (2)] markup from the human-maintained "source" markup. Categorization of matching guidelines Items that belong in the source markup are items that are easy for a human to identify but difficult for a program to identify (or easy for multiple implementations to get "differently"). These items include: - Headers (copyright declaration, license name preface) - Bullets (specifically in the case of numbered bullets with roman numerals, it can be difficult to distinguish accurately from wrapped text ending a sentence) - Optional sections (such as instructions on license application) - References to the author / copyright holder Items that belong in the enhanced markup are items that are easy for a program to identify, where the build process can create a canonical "correct" form of the text to-match, thus 1) obviating implementations from the need to write boilerplate parsing code and 2) ensuring that implementations produce the same results given the same sample text. These include: - Varietal spellings - Copyright symbol - Punctuation (dashes and quotes) - Whitespace - Capitalization The whitespace and capitalization items are somewhat questionable, but since there is already a build process it doesn't hurt us to add them, and it means that implementations will not have a need to perform any transformations on the matching-text beyond XML parsing. An important note: whitespace between words being significant, this approach will not necessarily lead to the desired results if we encounter a need or desire to create *optional* sections of text. For example, "James, while John had had <alt match="(had )*"/> had had a better effect on the teacher" - in this example sentence, if zero matches occur, we will be matching a sentence with two spaces in a row. I don't believe this will currently be a problem, because I don't know of any reason for us to mark up templates in this way; every matchable component should encompass the left and right word boundaries, so spacing should not be an issue and should be preserved exactly as produced in the enhanced markup. Translation of current markup to proposal Please correct me if I am mistaken - I appear to be unable to find documentation on the current markup - but I observe two elements in the current markup syntax: 1) entity replacement and 2) optional section blocks. These are quite readily translated into an XML form. I am uncertain if the variable function is used by any of the current tooling, but it can be left off until desired or added as-is. Suggested transformations: 1) <<var; name=foo; original=Original Text; match=.+>> becomes: <alt match=".+" name="foo">Original Text</alt> 2) <<beginOptional; name=optionalIntro>>Some text here<<endOptional>> becomes: <section name="optionalIntro">Some text here</section> This is a fairly direct translation and could even be performed programmatically on the existing data if we elect to use a generic "optional" tag; above I've used "section" as a placeholder for multiple candidate tags depending on the section being described. While we can define all desired elements in terms of "alt" tags, I believe some of them at least deserve their own tags. Bullets, for example, I would suggest wrapping in "<b>"; this is primarily for brevity: <b>1.</b> Some clause . <b>2.</b> More items.. <b>a.</b> Sub-item. The need for brevity comes as a result of trying to keep the source markup as human-manageable as possible. While it's somewhat tempting to designate hierarchical <ol> or <ul> items as in HTML, this doesn't actually provide any use for *matching* purposes, so has been explicitly discarded. Other items that might warrant their own tags actually become section wrappers, such as for copyright declarations: <copyright>Copyright C <year> <owner>. All rights reserved.</copyright> (This also points out the need to escape < and > symbols as entities, a task which can be performed programmatically in a conversion process as well) Other tags wrapping optional sections might include: <title>, <footer>, <header>, and <optional>. Synonyms One extra tag can support the enhanced markup in a useful way: for varietal/alternative spellings, an external file with some metadata can define the allowed synonyms for some string of text, allowing this to be updated without the need to modify the contents of the license templates themselves. These can be substituted into the document at build time to provide data-driven matching that supports these matching rules, e.g.: "Neither the name of the <syn identifier="copyright-holder"/> nor the names." (and, externally:) <synonyms identifier="copyright-holder"> <synonym>Copyright holder</synonym> <synonym>Copyright owner</synonym> </synonyms> While this approach can also be used to handle alternate values of dashes, quotes, etc., plain normalization of the data in the enhanced markup should suffice, a transformation which can be applied along with the lowercasing and whitespace removal: all varieties of dashes become a simple ascii hyphen, all varieties of quotes become a simple ascii quote, etc. It may be a good idea to also "externalize" matching of copyright headers, bullets, dashes, quotes, etc. in this fashion, since their structure and meaning is independent of an actual license file. Providing an explicit list of valid characters or a regular expression for these items in the SPDX database metadata will ensure that implementations are kept consistent. Structure of overall document This is pretty straightforward: <license identifier="SuchAndSo"> <title>The Such and So License</title> <copyright>Copyright C 2015 Foo Bars</copyright> <body>License text ..</body> <footer>How to apply this license: ..</footer> </license> Whitespace can be formatted such that the concatenated text content of the XML file produces the original document, though that poses a minor problem if we desire to include other data streams, which I recommend we do. One way to handle that would be to wrap the above example in a higher root element, which would allow us to include things such as optional clauses and license headers: <SPDX> <header>This file is licensed under the Such and So License</header> <license identifier="SuchAndSo">(as above)</license> <optional identifier="SuchAndSo-foos-exception">.</optional> </SPDX> Examples Using the above ideas, here is a marked-up example of the BSD 3 clause license in the "source" format: <SPDX> <license identifier="BSD-3-Clause"> <copyright>Copyright (c) <year> <owner>. All rights reserved.</copyright> <body>Redistribution and use in source and binary forms, with or without modification, are permitted provided that the following conditions are met: <b>1.</b> Redistributions of source code must retain the above copyright notice, this list of conditions and the following disclaimer. <b>2.</b> Redistributions in binary form must reproduce the above copyright notice, this list of conditions and the following disclaimer in the documentation and/or other materials provided with the distribution. <b>3.</b> Neither the name of <alt match=".+">the copyright holder</alt> nor the names of its contributors may be used to endorse or promote products derived from this software without specific prior written permission. THIS SOFTWARE IS PROVIDED BY <alt match=".+">THE COPYRIGHT HOLDERS AND CONTRIBUTORS</alt> "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL <alt match=".+">THE COPYRIGHT HOLDER OR CONTRIBUTORS</alt> BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.</body> </license> </SPDX> And again in the "enhanced" format: <SPDX> <license identifier="BSD-3-Clause"> <copyright><syn identifier="copyright"/> <syn identifier="copyright"/> <year> <owner>. all rights reserved.</copyright> <body>redistribution and use in source and binary forms, with or without modification, are permitted provided that the following conditions are met: <b>1.</b> redistributions of source code must retain the above <syn identifier="copyright"/> notice, this list of conditions and the following disclaimer. <b>2.</b> redistributions in binary form must reproduce the above <syn identifier="copyright"/> notice, this list of conditions and the following disclaimer in the documentation and/or other materials provided with the distribution. <b>3.</b> neither the name of the <syn identifier="copyright-holder"/> nor the names of its contributors may be used to endorse or promote products derived from this software without specific prior written permission. this software is provided by the <syn identifier="copyright-holder"/>s and contributors "as is" and any express or implied warranties, including, but not limited to, the implied warranties of merchantability and fitness for a particular purpose are disclaimed. in no event shall the <syn identifier="copyright-holder"/> or contributors be liable for any direct, indirect, incidental, special, exemplary, or consequential damages (including, but not limited to, procurement of substitute goods or services; loss of use, data, or profits; or business interruption) however caused and on any theory of liability, whether in contract, strict liability, or tort (including negligence or otherwise) arising in any way out of the use of this software, even if advised of the possibility of such damage.</body> </license> </SPDX> (please ignore curly quotes and other formatting added my e-mail client!) You'll note that the product of the enhanced format could now be quite readily translated into a regular expression, or parsed as an XML document to apply matching directly by some other means. It actually occurs to me that this transitory format may not be entirely necessary, but once we have reduced it to something like a regular expression, we remove the ability of a consumer to apply this data in other ways. Aside: normalization As with the matching guidelines, we should define an explicit process for normalizing candidate text to be matched against the SPDX data set, and should use that same approach for normalizing the data used to produce the enhanced markup. This includes *exactly* which Unicode characters count as dashes, quotes, etc. and how (or whether) we interpret hyphenated words vs dashes, especially with regards to whitespace, and so on. Matching process Following this proposal, the matching process for an implementer becomes straightforward and accurate: (optionally, if we don't supply the data in regular expression form) Build matcher from marked up template: 1) Parse XML 2) Render body (and/or header, exception, etc.) contents to a regular expression: a. Replace "alt" tags with their "match" attribute b. Replace "syn" tags with a synonym list [e.g. (foo|bar|baz)] c. Replace "b" tags with a bullet matcher d. Replace dashes and quotes with dash/quote matchers 3) There's no need to render copyright, title, or optional/footer sections since the matching guidelines say they can be safely ignored; only the body counts (and any addons) Apply matcher: 1) Identify candidate text 2) Normalize candidate text (lowercase, remove whitespace 3) Apply matcher(s) to candidate text As you can see, no part of this involves making determinations about what counts as a bullet, what counts as substantive text, etc., and implementation is very straightforward, utilizing only basic string manipulation and an XML parser (a tool which will be available in most every language). Conclusion This turned out to be quite long but, I hope, thorough. I will be pasting it into the wiki too, for reference; please have a think on it and reply with any comments/suggestions/etc. Kris
_______________________________________________ Spdx-legal mailing list Spdx-legal@lists.spdx.org https://lists.spdx.org/mailman/listinfo/spdx-legal