RE: Markup proposal

Gary O'Neall Tue, 27 Oct 2015 16:54:12 -0700

Hi Kris,


Excellent point on Excel - it really is difficult to do diff's (inside or
outside of Git).  

 

One more detail on the current workflow - we only maintain the license
metadata in the spreadsheet (license ID, header, related URL's), the actual
license text for the templates are in individual text files already - one
per license.  Hence the suggestion of focusing on the license body text (as
long as we are OK living with the diff's issue on Excel).

 

Gary

 

From: Kris.re [mailto:kris...@bbhmedia.com] 
Sent: Tuesday, October 27, 2015 4:14 PM
To: Gary O'Neall; 'SPDX-legal'; spdx-t...@lists.spdx.org
Subject: RE: Markup proposal

 

Hey, thanks for the references. I'm sorry if I seem to be ignoring existing
practices - a lot of the existing workflow and use cases I just don't have
much of a conception of. 

 

I very much agree that the human-maintained data should be kept as legible
as possible, which is why I attempted to keep the markup to wrapping
entities and succinct representations. Matching existing tags used for RDF-a
makes sense; the tag names suggested should be taken very much as
placeholders for another discussion entirely. The main goal was to get a
grip on the whole of things in a way that would let us discuss what that
meant for people maintaining the licenses and the rest of the workflow for
producing products for people (and programs) to consume.

 

Speaking of workflow, I don't find Excel to be a very convenient platform
for managing/editing large blocks of text, and it could be that the people
currently doing the job feel differently, but the larger advantage to
keeping license-related data in a single, text-editable file per-license is
what that means for the ability to allow contributions (via pull requests)
and track history/changes to a particular license over time. You can't
really compare diffs of an excel spreadsheet on Github, and a change to the
spreadsheet could be a change to (m)any of the licenses it tracks. Keeping
files changed in lockstep with licenses changed has significant maintenance
benefits - much harder to miss something on accident.

 

One possible other candidate for an "input format" is Markdown; it is useful
and fairly widely known, but unless we hijack the "link" format to specify
matching regular expressions (which feels kind of dirty but is totally
possible) I'm not sure there's a straightforward way to supply the
alternate/matching functionality. A downside to utilizing Markdown is the
fact that every markdown parser operates slightly differently.

 

One other point worth noting about XML is that it provides information in a
form that other formats such as JSON cannot; this is because it is
fundamentally a markup language, not a data transport format. This is a
significant point in favor of XML as a "source" format rather than a
"destination" format, but as was previously noted, the "fully marked up"
templates will be awful for a human to maintain, so the stress on XML itself
as the target "source" language is much less. The choice of something
familiar and accessible is, however, important for that component.

 

Kris

 

 

From: Gary O'Neall [mailto:g...@sourceauditor.com] 
Sent: Tuesday, October 27, 2015 3:24 PM
To: Kris.re <kris...@bbhmedia.com>; 'SPDX-legal'
<spdx-legal@lists.spdx.org>; spdx-t...@lists.spdx.org
Subject: RE: Markup proposal

 

Hi Kris,


Thanks for writing up such thorough proposal.  This will make it much easier
to discuss some of the specifics.

 

A couple quick items - you mentioned that you could not find the syntax for
the current text.  It is in the SPDX specification PDF file
(http://spdx.org/sites/spdx/files/SPDX-2.0.pdf) Appendix II page 78.

 

Also, in answer to your question if the variable text is used - the answer
is yes.  They are implemented by the SPDX tools and I do use that
functionality myself.  I would imagine there are others besides myself as
well.

 

There are a lot of items I could respond to, but I think I'll save the
details for our call.

 

I do have 4 high level observations/opinions/suggestions:

 

- The markup proposed is a pretty big change in how we actually produce the
licenses.  If I understand the markup language maintained by the humans,
IMHO it is putting a larger burden on the legal team than the current
approach and may limit the adoption of the markup.  I remember debates just
on annotating the title, which is also in your proposal + several other
annotations.  Before digging into the technical details too much, I want to
make sure we discuss/understand the process changes and that the legal team
is signed up to create and maintain the new markups.

 

- Some of the tags in your proposal are already defined and used in RDF-a
which is already machine parseable (e.g. license-identifier/licenseID and
header).  See the document Accessing SPDX Licenses at
http://wiki.spdx.org/images/SPDX-TR-2014-2.v1.0.pdf for details.  We could
add another representation in XML (just like we are adding JSON format).  If
we do, I would propose using the same tags as in the spec.  The input for
these fields are in a spreadsheet maintained by the legal team and a tool
generates the HTML file for the website.  If we went to these annotations as
the source or input to the website creation tools, I would assume we would
discard the spreadsheet and go strictly to the annotated files - which would
be a change to the process.  I would propose that we focus on the body of
the license text and not revisit the other tags since that is where the
current challenges are. 

 

- I really like the idea of creating a normalized license format to make it
easier for downstream tools to match.  Having implemented a matcher just
using the current rules, I can confirm it is not that easy and it is error
prone.  From reading through your proposal, you have a lot of good ideas on
how to implement this.  We could enhance the current tool that generates the
website to produce normalized templates - it is just a SMOP (small matter of
programming ;).

 

- In thinking about this a bit, we may want to separate the discussion into
two parts:

1) What should be the format for the human maintained licenses?  (e.g. the
files which are fed into the tool that generates the website)

2) What should be maintained as a format for providing matching by external
tools?  This could be produced by the tool that generates the website as
long as the information to produce that format is present in the human
maintained licenses.

 

Gary

 

 

From: spdx-tech-boun...@lists.spdx.org
[mailto:spdx-tech-boun...@lists.spdx.org] On Behalf Of Kris.re
Sent: Tuesday, October 27, 2015 9:02 AM
To: 'SPDX-legal'; spdx-t...@lists.spdx.org
Subject: Markup proposal

 

Introduction

As discussed on the last legal call, here's my proposal for enhancing the
markup.

 

There are two problems that need solving:

 

1)      Maintaining a version of the licenses with just enough markup to aid
in the computer-hard decisions

2)      Make computer matching entirely logic-free by providing the required
information to an implementation with markup

 

Since (2) may necessarily be difficult to maintain by hand, as well as be
error-prone (for example, finding *every* synonym instance in some license
text), I suggest that a build process is necessary to produce the "enhanced"
[item (2)] markup from the human-maintained "source" markup.


Categorization of matching guidelines


 

Items that belong in the source markup are items that are easy for a human
to identify but difficult for a program to identify (or easy for multiple
implementations to get "differently"). These items include:

 

-          Headers (copyright declaration, license name preface)

-          Bullets (specifically in the case of numbered bullets with roman
numerals, it can be difficult to distinguish accurately from wrapped text
ending a sentence)

-          Optional sections (such as instructions on license application)

-          References to the author / copyright holder

 

Items that belong in the enhanced markup are items that are easy for a
program to identify, where the build process can create a canonical
"correct" form of the text to-match, thus 1) obviating implementations from
the need to  write boilerplate parsing code and 2) ensuring that
implementations produce the same results given the same sample text. These
include:

 

-          Varietal spellings

-          Copyright symbol

-          Punctuation (dashes and quotes) 

-          Whitespace

-          Capitalization

 

The whitespace and capitalization items are somewhat questionable, but since
there is already a build process it doesn't hurt us to add them, and it
means that implementations will not have a need to perform any
transformations on the matching-text beyond XML parsing. An important note:
whitespace between words being significant, this approach will not
necessarily lead to the desired results if we encounter a need or desire to
create *optional* sections of text. For example, "James, while John had had
<alt match="(had )*"/> had had a better effect on the teacher" - in this
example sentence, if zero matches occur, we will be matching a sentence with
two spaces in a row. I don't believe this will currently be a problem,
because I don't know of any reason for us to mark up templates in this way;
every matchable component should encompass the left and right word
boundaries, so spacing should not be an issue and should be preserved
exactly as produced in the enhanced markup.


Translation of current markup to proposal


 

Please correct me if I am mistaken - I appear to be unable to find
documentation on the current markup - but I observe two elements in the
current markup syntax: 1) entity replacement and 2) optional section blocks.
These are quite readily translated into an XML form. I am uncertain if the
variable function is used by any of the current tooling, but it can be left
off until desired or added as-is.

 

Suggested transformations:

1)      <<var; name=foo; original=Original Text; match=.+>>
becomes:
<alt match=".+" name="foo">Original Text</alt>

2)      <<beginOptional; name=optionalIntro>>Some text here<<endOptional>>
becomes:
<section name="optionalIntro">Some text here</section>

 

This is a fairly direct translation and could even be performed
programmatically on the existing data if we elect to use a generic
"optional" tag; above I've used "section" as a placeholder for multiple
candidate tags depending on the section being described.

 

While we can define all desired elements in terms of "alt" tags, I believe
some of them at least deserve their own tags. Bullets, for example, I would
suggest wrapping in "<b>"; this is primarily for brevity:


<b>1.</b> Some clause .
<b>2.</b> More items..

<b>a.</b> Sub-item.

 

The need for brevity comes as a result of trying to keep the source markup
as human-manageable as possible. While it's somewhat tempting to designate
hierarchical <ol> or <ul> items as in HTML, this doesn't actually provide
any use for *matching* purposes, so has been explicitly discarded.

 

Other items that might warrant their own tags actually become section
wrappers, such as for copyright declarations:

<copyright>Copyright C &lt;year&gt; &lt;owner&gt;. All rights
reserved.</copyright>

(This also points out the need to escape < and > symbols as entities, a task
which can be performed programmatically in a conversion process as well)

 

Other tags wrapping optional sections might include: <title>, <footer>,
<header>, and <optional>.


Synonyms


 

One extra tag can support the enhanced markup in a useful way: for
varietal/alternative spellings, an external file with some metadata can
define the allowed synonyms for some string of text, allowing this to be
updated without the need to modify the contents of the license templates
themselves. These can be substituted into the document at build time to
provide data-driven matching that supports these matching rules, e.g.:

"Neither the name of the <syn identifier="copyright-holder"/> nor the
names."

(and, externally:)

<synonyms identifier="copyright-holder">

  <synonym>Copyright holder</synonym>
  <synonym>Copyright owner</synonym>
</synonyms>

 

While this approach can also be used to handle alternate values of dashes,
quotes, etc., plain normalization of the data in the enhanced markup should
suffice, a transformation which can be applied along with the lowercasing
and whitespace removal: all varieties of dashes become a simple ascii
hyphen, all varieties of quotes become a simple ascii quote, etc.

 

It may be a good idea to also "externalize" matching of copyright headers,
bullets, dashes, quotes, etc. in this fashion, since their structure and
meaning is independent of an actual license file. Providing an explicit list
of valid characters or a regular expression for these items in the SPDX
database metadata will ensure that implementations are kept consistent.


Structure of overall document


 

This is pretty straightforward:

 

<license identifier="SuchAndSo">
  <title>The Such and So License</title>
  <copyright>Copyright C 2015 Foo Bars</copyright>


  <body>License text ..</body>

  <footer>How to apply this license: ..</footer>
</license>

 

Whitespace can be formatted such that the concatenated text content of the
XML file produces the original document, though that poses a minor problem
if we desire to include other data streams, which I recommend we do. One way
to handle that would be to wrap the above example in a higher root element,
which would allow us to include things such as optional clauses and license
headers:

<SPDX>

  <header>This file is licensed under the Such and So License</header>
  <license identifier="SuchAndSo">(as above)</license>
  <optional identifier="SuchAndSo-foos-exception">.</optional>

</SPDX>


Examples


 

Using the above ideas, here is a marked-up example of the BSD 3 clause
license in the "source" format:

<SPDX>
<license identifier="BSD-3-Clause">

<copyright>Copyright (c) &lt;year&gt; &lt;owner&gt;. All rights
reserved.</copyright>

 

<body>Redistribution and use in source and binary forms, with or without
modification, are permitted provided that the following conditions are met:

 

<b>1.</b> Redistributions of source code must retain the above copyright
notice, this list of conditions and the following disclaimer. 

 

<b>2.</b> Redistributions in binary form must reproduce the above copyright
notice, this list of conditions and the following disclaimer in the
documentation and/or other materials provided with the distribution. 

 

<b>3.</b> Neither the name of <alt match=".+">the copyright holder</alt> nor
the names of its contributors may be used to endorse or promote products
derived from this software without specific prior written permission.

 

THIS SOFTWARE IS PROVIDED BY <alt match=".+">THE COPYRIGHT HOLDERS AND
CONTRIBUTORS</alt> "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING,
BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS
FOR A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL <alt
match=".+">THE COPYRIGHT HOLDER OR CONTRIBUTORS</alt> BE LIABLE FOR ANY
DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES
(INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES;
LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND
ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
(INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF
THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.</body>

</license>

</SPDX>

And again in the "enhanced" format:

 

<SPDX>
<license identifier="BSD-3-Clause">

<copyright><syn identifier="copyright"/> <syn identifier="copyright"/>
&lt;year&gt; &lt;owner&gt;. all rights reserved.</copyright>

<body>redistribution and use in source and binary forms, with or without
modification, are permitted provided that the following conditions are met:
<b>1.</b> redistributions of source code must retain the above <syn
identifier="copyright"/> notice, this list of conditions and the following
disclaimer. <b>2.</b> redistributions in binary form must reproduce the
above <syn identifier="copyright"/> notice, this list of conditions and the
following disclaimer in the documentation and/or other materials provided
with the distribution. <b>3.</b> neither the name of the <syn
identifier="copyright-holder"/> nor the names of its contributors may be
used to endorse or promote products derived from this software without
specific prior written permission. this software is provided by the <syn
identifier="copyright-holder"/>s and contributors "as is" and any express or
implied warranties, including, but not limited to, the implied warranties of
merchantability and fitness for a particular purpose are disclaimed. in no
event shall the <syn identifier="copyright-holder"/> or contributors be
liable for any direct, indirect, incidental, special, exemplary, or
consequential damages (including, but not limited to, procurement of
substitute goods or services; loss of use, data, or profits; or business
interruption) however caused and on any theory of liability, whether in
contract, strict liability, or tort (including negligence or otherwise)
arising in any way out of the use of this software, even if advised of the
possibility of such damage.</body>
</license>
</SPDX>

(please ignore curly quotes and other formatting added my e-mail client!)

You'll note that the product of the enhanced format could now be quite
readily translated into a regular expression, or parsed as an XML document
to apply matching directly by some other means. It actually occurs to me
that this transitory format may not be entirely necessary, but once we have
reduced it to something like a regular expression, we remove the ability of
a consumer to apply this data in other ways.


Aside: normalization


 

As with the matching guidelines, we should define an explicit process for
normalizing candidate text to be matched against the SPDX data set, and
should use that same approach for normalizing the data used to produce the
enhanced markup. This includes *exactly* which Unicode characters count as
dashes, quotes, etc. and how (or whether) we interpret hyphenated words vs
dashes, especially with regards to whitespace, and so on.

Matching process

 

Following this proposal, the matching process for an implementer becomes
straightforward and accurate:

 

(optionally, if we don't supply the data in regular expression form) Build
matcher from marked up template:

1)      Parse XML

2)      Render body (and/or header, exception, etc.) contents to a regular
expression:

a.       Replace "alt" tags with their "match" attribute

b.      Replace "syn" tags with a synonym list [e.g. (foo|bar|baz)]

c.       Replace "b" tags with a bullet matcher

d.      Replace dashes and quotes with dash/quote matchers

3)      There's no need to render copyright, title, or optional/footer
sections since the matching guidelines say they can be safely ignored; only
the body counts (and any addons)

 

Apply matcher:

 

1)      Identify candidate text

2)      Normalize candidate text (lowercase, remove whitespace

3)      Apply matcher(s) to candidate text

 

As you can see, no part of this involves making determinations about what
counts as a bullet, what counts as substantive text, etc., and
implementation is very straightforward, utilizing only basic string
manipulation and an XML parser (a tool which will be available in most every
language).


Conclusion


 

This turned out to be quite long but, I hope, thorough. I will be pasting it
into the wiki too, for reference; please have a think on it and reply with
any comments/suggestions/etc.

 

Kris

_______________________________________________
Spdx-legal mailing list
Spdx-legal@lists.spdx.org
https://lists.spdx.org/mailman/listinfo/spdx-legal

RE: Markup proposal

Reply via email to