Richard, Jeff:

On Mon, Mar 11, 2019 at 10:32 PM Richard Fontana <[email protected]> wrote:
> Use of "LicenseRef" (not to mention something like
> NOASSERTION) is a nonstarter for the use cases we are most interested
> in. What we've actually done in some cases is use the nonstandard
> identifiers created by nexB.

Agreed. What I am trying to achieve here is to make these become "standard" and
known at SPDX. I think this is possible.

On Sun, Mar 10, 2019 at 12:44 PM Jeff McAffer
<[email protected]> wrote:
>> IMO the "ideal" here is that there is some automated way of
>> "fingerprinting" license texts such that two parties, given more or less
>> the same text, can independently come up with the same id. At that point
>> you would not need a registry, just a shared algorithm. When/if eventually
>> SPDX does recognize a given license and gives it a formal id, there could
>> be a relatively simple aliasing step where SPDX id "SomeCoolLicense-1.0"
>> is AKA "LicenseRef-43bdf298"

This ideal works in theory but for several reasons I outline below would be
too brittle in practice as you would have different fingerprints too often for
this to be working. Instead running a full license detection is a better way
to dedupe things. And this requires some form of centralization but could be
fully automated alright.  The other thing is that IMO giving a name/id does
matter a lot: the license named 43bdf298 is not really human friendly.

Now even if license-text-fingerprint-as-id were to work out, the difficult part
is not so much the algorithm for computing these, but the content you feed for
fingerprinting. And that part is not easily to automate:

 - For instance, is a copyright part of the license or not (I think not, but
   YMMV)?

 - Or what about statements around a license? For instance these two SPDX
   licenses may not really deserve a different id yet they have one:

   https://spdx.org/licenses/bzip2-1.0.6.html and
   https://spdx.org/licenses/bzip2-1.0.5.html

   The LICENSE file in the original code archives does not have a patent
   disclaimer statement footer seen in bzip2-1.0.5's SPDX license text.
   That footer is present on the archive.org website only. I would not treat
   this as part of the license, but this was treated as part of it here. This
   is a judgment call.

 - Or for instance, there are 6+ version of the text of the GPL-2.0 which are
   really the same but would fingerprint differently.

Therefore a fingerprint algorithm would be hard to generalize as there would be
many exceptions or a simple one would be too brittle in too many cases.
Deduping is best achieved by license detection with a full diff (which
is what scancode does FWIW).

Let me follow up with my suggestion.
--
Cordially

Philippe Ombredanne

-=-=-=-=-=-=-=-=-=-=-=-
Links: You receive all messages sent to this group.

View/Reply Online (#3668): https://lists.spdx.org/g/Spdx-tech/message/3668
Mute This Topic: https://lists.spdx.org/mt/30299820/21656
Group Owner: [email protected]
Unsubscribe: https://lists.spdx.org/g/Spdx-tech/unsub  
[[email protected]]
-=-=-=-=-=-=-=-=-=-=-=-

Reply via email to