Hi Philippe,

The question from Rose comes from my ticket in Tern: 
https://github.com/tern-tools/tern/issues/1188

It would be very good if you can help Rose to implement a better solution than 
the LicenseRef- with multiple licences in it.

Best regards,

Marc-Etienne

-----Original Message-----
From: [email protected] <[email protected]> On Behalf Of Philippe 
Ombredanne via lists.spdx.org
Sent: Wednesday, December 7, 2022 4:20 PM
To: [email protected]
Cc: [email protected]; [email protected]
Subject: Re: [spdx-tech] Multiple Licenses in a single LicenseRef?

Hi Rose:
Welcome back!

On Fri, Dec 2, 2022 at 10:55 PM Rose Judge via lists.spdx.org 
<[email protected]> wrote:

> Tern is a tool that can generate SPDX documents for containers.
> When we are collecting license information for Debian packages inside 
> a container, we must scan the copyright files to gather any type of 
> license information for that package. We do this with the 
> Debian-inspector library; other package managers like apk or rpm can 
> provide a direct license for a package with a straightforward command.

First, thank you for using the debian-inspector library 
https://github.com/nexB/debian-inspector !

I have observed that RPM and Alpine seldom provide straightforward package 
licenses.
They each provide a brief license summary but based on extensive scan reviews 
my take is this:

- Alpine has a fair share of approximative license statements that are outdated 
and out of sync with the code,
- RPMs license tags are heavily summarized hiding several details; extra 
attached license texts need scan treatment to make sense of.
- In contrast, Debian is extra verbose and is lacking the summarization 
provided by these two.

Nothing is perfect in this lowly world.

> This means that licenses associated with a debian package typically 
> look something like this after scanning the copyright text:
> GPL-2, GPL-2+, GPL-3+, LGPL, LGPL-3+, MIT, public-domain
>
> Is it possible to create a LicenseRef of the entire string of multiple 
> licenses? I.e.:
> PackageLicenseDeclared: LicenseRef-123456
> LicenseID: LicenseRef-123456
> ExtractedText: <text>Original license: GPL-2, GPL-2+, GPL-3+, LGPL,
> LGPL-3+, MIT, public-domain</text>

You could of course do this, but this would create mostly harmless SPDX 
documents depleted of actionable information. I have seen perfectly valid SPDX 
documents created this way in the docfest using only local LicenseRef and they 
are mostly useless: they require full reprocessing to re-detect the licenses 
(with ScanCode).

> Or, does the spec require that we separate each license into a 
> separate LicenseRef?

There is no requirement in the spec to otherwise prohibit you to happily create 
a massive LicenseRef with the major side effects I mentioned above and below.

> The issue with the latter option is I’m not sure choosing AND or OR to 
> join the various license refs is something Tern should be doing as 
> each infers a different compliance obligation.

You can combine these all with an AND because this is the meaning of what you 
get from a Debian copyright file.
But the caveat is that MIT, public-domain are NOT license keys and not SPDX ids 
either. These are merely references in the style of a local LicenseRef and 
their actual meaning is entirely determined by the license or noice text that 
comes after them.

Sadly enough, existing tools all assume incorrectly that these Debian codes are 
license keys and end up doing a big disservice to their users with fairly 
inaccurate or misleading license data at scale. The devil is in getting the 
details right.

The solution is to use ScanCode and since tern already embeds it already, you 
should look at the code we crafted to properly detect and make sense of Debian 
copyright files whether they are structured machine-readable files or legacy 
non-structured.

ScanCode has about 2000 lines of Python code (on top of the debian-inspector 
code base) to process these. This gives a sense of the complexity of the task 
at hand. There is no other tool that can make sense of Debian copyright files 
like ScanCode that I have heard of.

The common Debian license symbols are listed there:
https://github.com/nexB/scancode-toolkit/blob/d64acdded0b1f9e760cb9a5e47aecffab814b811/src/packagedcode/debian_copyright.py#L987
But there are hundreds of others that are not reliably mappable to SPDX. You 
need a full scancode detection on the license text or notice that follows in 
the deb822 paragraph. There is also a notion of primary/default and secondary 
licenses that is not entirely trivial to capture and ScanCode handles this too.

You can see the toolkit code here:
https://github.com/nexB/scancode-toolkit/blob/develop/src/packagedcode/debian_copyright.py
and here:
https://github.com/nexB/scancode-toolkit/blob/develop/src/packagedcode/debian.py

And this is used also in ScanCode.io for Debian/Ubuntu for VM and docker image 
scanning in https://scancodeio.readthedocs.io/en/latest/

Please tell me how I can help so this becomes easy enough for you to reuse this 
in tern.
--
Cordially
Philippe Ombredanne

+1 650 799 0949 | [email protected]
AboutCode - Open source for open source - https://www.aboutcode.org 
VulnerableCode - the open code and open data vulnerability database - 
https://github.com/nexb/vulnerablecode
ScanCode - scan your code, for origin/license/vulnerabilities, report SBOMs - 
https://github.com/nexB/scancode-toolkit
https://github.com/nexB/scancode.io
package-url - the mostly universal SBOM identifier for packages - 
https://github.com/package-url DejaCode - What's in your code?! - 
http://www.dejacode.com nexB Inc. - http://www.nexb.com







-=-=-=-=-=-=-=-=-=-=-=-
Links: You receive all messages sent to this group.
View/Reply Online (#4881): https://lists.spdx.org/g/Spdx-tech/message/4881
Mute This Topic: https://lists.spdx.org/mt/95416586/21656
Group Owner: [email protected]
Unsubscribe: https://lists.spdx.org/g/Spdx-tech/unsub [[email protected]]
-=-=-=-=-=-=-=-=-=-=-=-


Reply via email to