Markup proposal

Kris . re Tue, 27 Oct 2015 09:19:18 -0700

Introduction

As discussed on the last legal call, here's my proposal for enhancing the 
markup.


There are two problems that need solving:


1)      Maintaining a version of the licenses with just enough markup to aid in 
the computer-hard decisions

2)      Make computer matching entirely logic-free by providing the required 
information to an implementation with markup

Since (2) may necessarily be difficult to maintain by hand, as well as be 
error-prone (for example, finding *every* synonym instance in some license 
text), I suggest that a build process is necessary to produce the "enhanced" 
[item (2)] markup from the human-maintained "source" markup.
Categorization of matching guidelines

Items that belong in the source markup are items that are easy for a human to 
identify but difficult for a program to identify (or easy for multiple 
implementations to get "differently"). These items include:


-          Headers (copyright declaration, license name preface)

-          Bullets (specifically in the case of numbered bullets with roman 
numerals, it can be difficult to distinguish accurately from wrapped text 
ending a sentence)

-          Optional sections (such as instructions on license application)

-          References to the author / copyright holder

Items that belong in the enhanced markup are items that are easy for a program 
to identify, where the build process can create a canonical "correct" form of 
the text to-match, thus 1) obviating implementations from the need to  write 
boilerplate parsing code and 2) ensuring that implementations produce the same 
results given the same sample text. These include:


-          Varietal spellings

-          Copyright symbol

-          Punctuation (dashes and quotes)

-          Whitespace

-          Capitalization

The whitespace and capitalization items are somewhat questionable, but since 
there is already a build process it doesn't hurt us to add them, and it means 
that implementations will not have a need to perform any transformations on the 
matching-text beyond XML parsing. An important note: whitespace between words 
being significant, this approach will not necessarily lead to the desired 
results if we encounter a need or desire to create *optional* sections of text. 
For example, "James, while John had had <alt match="(had )*"/> had had a better 
effect on the teacher" - in this example sentence, if zero matches occur, we 
will be matching a sentence with two spaces in a row. I don't believe this will 
currently be a problem, because I don't know of any reason for us to mark up 
templates in this way; every matchable component should encompass the left and 
right word boundaries, so spacing should not be an issue and should be 
preserved exactly as produced in the enhanced markup.
Translation of current markup to proposal

Please correct me if I am mistaken - I appear to be unable to find 
documentation on the current markup - but I observe two elements in the current 
markup syntax: 1) entity replacement and 2) optional section blocks. These are 
quite readily translated into an XML form. I am uncertain if the variable 
function is used by any of the current tooling, but it can be left off until 
desired or added as-is.

Suggested transformations:

1)      <<var; name=foo; original=Original Text; match=.+>>
becomes:
<alt match=".+" name="foo">Original Text</alt>

2)      <<beginOptional; name=optionalIntro>>Some text here<<endOptional>>
becomes:
<section name="optionalIntro">Some text here</section>

This is a fairly direct translation and could even be performed 
programmatically on the existing data if we elect to use a generic "optional" 
tag; above I've used "section" as a placeholder for multiple candidate tags 
depending on the section being described.

While we can define all desired elements in terms of "alt" tags, I believe some 
of them at least deserve their own tags. Bullets, for example, I would suggest 
wrapping in "<b>"; this is primarily for brevity:

<b>1.</b> Some clause ...
<b>2.</b> More items....
<b>a.</b> Sub-item...

The need for brevity comes as a result of trying to keep the source markup as 
human-manageable as possible. While it's somewhat tempting to designate 
hierarchical <ol> or <ul> items as in HTML, this doesn't actually provide any 
use for *matching* purposes, so has been explicitly discarded.

Other items that might warrant their own tags actually become section wrappers, 
such as for copyright declarations:

<copyright>Copyright (c) &lt;year&gt; &lt;owner&gt;. All rights 
reserved.</copyright>

(This also points out the need to escape < and > symbols as entities, a task 
which can be performed programmatically in a conversion process as well)

Other tags wrapping optional sections might include: <title>, <footer>, 
<header>, and <optional>.
Synonyms

One extra tag can support the enhanced markup in a useful way: for 
varietal/alternative spellings, an external file with some metadata can define 
the allowed synonyms for some string of text, allowing this to be updated 
without the need to modify the contents of the license templates themselves. 
These can be substituted into the document at build time to provide data-driven 
matching that supports these matching rules, e.g.:

"Neither the name of the <syn identifier="copyright-holder"/> nor the names..."

(and, externally:)

<synonyms identifier="copyright-holder">
  <synonym>Copyright holder</synonym>
  <synonym>Copyright owner</synonym>
</synonyms>

While this approach can also be used to handle alternate values of dashes, 
quotes, etc., plain normalization of the data in the enhanced markup should 
suffice, a transformation which can be applied along with the lowercasing and 
whitespace removal: all varieties of dashes become a simple ascii hyphen, all 
varieties of quotes become a simple ascii quote, etc.

It may be a good idea to also "externalize" matching of copyright headers, 
bullets, dashes, quotes, etc. in this fashion, since their structure and 
meaning is independent of an actual license file. Providing an explicit list of 
valid characters or a regular expression for these items in the SPDX database 
metadata will ensure that implementations are kept consistent.
Structure of overall document

This is pretty straightforward:

<license identifier="SuchAndSo">
  <title>The Such and So License</title>
  <copyright>Copyright (c) 2015 Foo Bars</copyright>

  <body>License text ....</body>

  <footer>How to apply this license: ....</footer>
</license>

Whitespace can be formatted such that the concatenated text content of the XML 
file produces the original document, though that poses a minor problem if we 
desire to include other data streams, which I recommend we do. One way to 
handle that would be to wrap the above example in a higher root element, which 
would allow us to include things such as optional clauses and license headers:

<SPDX>
  <header>This file is licensed under the Such and So License</header>
  <license identifier="SuchAndSo">(as above)</license>
  <optional identifier="SuchAndSo-foos-exception">...</optional>
</SPDX>
Examples

Using the above ideas, here is a marked-up example of the BSD 3 clause license 
in the "source" format:

<SPDX>
<license identifier="BSD-3-Clause">
<copyright>Copyright (c) &lt;year&gt; &lt;owner&gt;. All rights 
reserved.</copyright>

<body>Redistribution and use in source and binary forms, with or without 
modification, are permitted provided that the following conditions are met:

<b>1.</b> Redistributions of source code must retain the above copyright 
notice, this list of conditions and the following disclaimer.

<b>2.</b> Redistributions in binary form must reproduce the above copyright 
notice, this list of conditions and the following disclaimer in the 
documentation and/or other materials provided with the distribution.

<b>3.</b> Neither the name of <alt match=".+">the copyright holder</alt> nor 
the names of its contributors may be used to endorse or promote products 
derived from this software without specific prior written permission.

THIS SOFTWARE IS PROVIDED BY <alt match=".+">THE COPYRIGHT HOLDERS AND 
CONTRIBUTORS</alt> "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, 
BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A 
PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL <alt match=".+">THE 
COPYRIGHT HOLDER OR CONTRIBUTORS</alt> BE LIABLE FOR ANY DIRECT, INDIRECT, 
INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT 
LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR 
PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF 
LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE 
OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF 
ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.</body>
</license>
</SPDX>

And again in the "enhanced" format:

<SPDX>
<license identifier="BSD-3-Clause">
<copyright><syn identifier="copyright"/> <syn identifier="copyright"/> 
&lt;year&gt; &lt;owner&gt;. all rights reserved.</copyright>
<body>redistribution and use in source and binary forms, with or without 
modification, are permitted provided that the following conditions are met: 
<b>1.</b> redistributions of source code must retain the above <syn 
identifier="copyright"/> notice, this list of conditions and the following 
disclaimer. <b>2.</b> redistributions in binary form must reproduce the above 
<syn identifier="copyright"/> notice, this list of conditions and the following 
disclaimer in the documentation and/or other materials provided with the 
distribution. <b>3.</b> neither the name of the <syn 
identifier="copyright-holder"/> nor the names of its contributors may be used 
to endorse or promote products derived from this software without specific 
prior written permission. this software is provided by the <syn 
identifier="copyright-holder"/>s and contributors "as is" and any express or 
implied warranties, including, but not limited to, the implied warranties of 
merchantability and fitness for a particular purpose are disclaimed. in no 
event shall the <syn identifier="copyright-holder"/> or contributors be liable 
for any direct, indirect, incidental, special, exemplary, or consequential 
damages (including, but not limited to, procurement of substitute goods or 
services; loss of use, data, or profits; or business interruption) however 
caused and on any theory of liability, whether in contract, strict liability, 
or tort (including negligence or otherwise) arising in any way out of the use 
of this software, even if advised of the possibility of such damage.</body>
</license>
</SPDX>

(please ignore curly quotes and other formatting added my e-mail client!)

You'll note that the product of the enhanced format could now be quite readily 
translated into a regular expression, or parsed as an XML document to apply 
matching directly by some other means. It actually occurs to me that this 
transitory format may not be entirely necessary, but once we have reduced it to 
something like a regular expression, we remove the ability of a consumer to 
apply this data in other ways.
Aside: normalization

As with the matching guidelines, we should define an explicit process for 
normalizing candidate text to be matched against the SPDX data set, and should 
use that same approach for normalizing the data used to produce the enhanced 
markup. This includes *exactly* which Unicode characters count as dashes, 
quotes, etc. and how (or whether) we interpret hyphenated words vs dashes, 
especially with regards to whitespace, and so on.

Matching process

Following this proposal, the matching process for an implementer becomes 
straightforward and accurate:

(optionally, if we don't supply the data in regular expression form) Build 
matcher from marked up template:

1)      Parse XML

2)      Render body (and/or header, exception, etc.) contents to a regular 
expression:

a.       Replace "alt" tags with their "match" attribute

b.      Replace "syn" tags with a synonym list [e.g. (foo|bar|baz)]

c.       Replace "b" tags with a bullet matcher

d.      Replace dashes and quotes with dash/quote matchers

3)      There's no need to render copyright, title, or optional/footer sections 
since the matching guidelines say they can be safely ignored; only the body 
counts (and any addons)

Apply matcher:


1)      Identify candidate text

2)      Normalize candidate text (lowercase, remove whitespace

3)      Apply matcher(s) to candidate text

As you can see, no part of this involves making determinations about what 
counts as a bullet, what counts as substantive text, etc., and implementation 
is very straightforward, utilizing only basic string manipulation and an XML 
parser (a tool which will be available in most every language).
Conclusion

This turned out to be quite long but, I hope, thorough. I will be pasting it 
into the wiki too, for reference; please have a think on it and reply with any 
comments/suggestions/etc.

Kris

_______________________________________________
Spdx-tech mailing list
[email protected]
https://lists.spdx.org/mailman/listinfo/spdx-tech

Markup proposal

Reply via email to