Richard, Jeff: On Mon, Mar 11, 2019 at 10:32 PM Richard Fontana <[email protected]> wrote: > Use of "LicenseRef" (not to mention something like > NOASSERTION) is a nonstarter for the use cases we are most interested > in. What we've actually done in some cases is use the nonstandard > identifiers created by nexB.
Agreed. What I am trying to achieve here is to make these become "standard" and known at SPDX. I think this is possible. On Sun, Mar 10, 2019 at 12:44 PM Jeff McAffer <[email protected]> wrote: >> IMO the "ideal" here is that there is some automated way of >> "fingerprinting" license texts such that two parties, given more or less >> the same text, can independently come up with the same id. At that point >> you would not need a registry, just a shared algorithm. When/if eventually >> SPDX does recognize a given license and gives it a formal id, there could >> be a relatively simple aliasing step where SPDX id "SomeCoolLicense-1.0" >> is AKA "LicenseRef-43bdf298" This ideal works in theory but for several reasons I outline below would be too brittle in practice as you would have different fingerprints too often for this to be working. Instead running a full license detection is a better way to dedupe things. And this requires some form of centralization but could be fully automated alright. The other thing is that IMO giving a name/id does matter a lot: the license named 43bdf298 is not really human friendly. Now even if license-text-fingerprint-as-id were to work out, the difficult part is not so much the algorithm for computing these, but the content you feed for fingerprinting. And that part is not easily to automate: - For instance, is a copyright part of the license or not (I think not, but YMMV)? - Or what about statements around a license? For instance these two SPDX licenses may not really deserve a different id yet they have one: https://spdx.org/licenses/bzip2-1.0.6.html and https://spdx.org/licenses/bzip2-1.0.5.html The LICENSE file in the original code archives does not have a patent disclaimer statement footer seen in bzip2-1.0.5's SPDX license text. That footer is present on the archive.org website only. I would not treat this as part of the license, but this was treated as part of it here. This is a judgment call. - Or for instance, there are 6+ version of the text of the GPL-2.0 which are really the same but would fingerprint differently. Therefore a fingerprint algorithm would be hard to generalize as there would be many exceptions or a simple one would be too brittle in too many cases. Deduping is best achieved by license detection with a full diff (which is what scancode does FWIW). Let me follow up with my suggestion. -- Cordially Philippe Ombredanne -=-=-=-=-=-=-=-=-=-=-=- Links: You receive all messages sent to this group. View/Reply Online (#3668): https://lists.spdx.org/g/Spdx-tech/message/3668 Mute This Topic: https://lists.spdx.org/mt/30299820/21656 Group Owner: [email protected] Unsubscribe: https://lists.spdx.org/g/Spdx-tech/unsub [[email protected]] -=-=-=-=-=-=-=-=-=-=-=-
