In the spirit of “any suggestions and/or modifications will be very much 
appreciated”, I have inserted comments below.



From: [email protected] <[email protected]> On Behalf Of Smith 
Tanjong Agbor
Sent: Wednesday, June 17, 2020 12:32
To: [email protected]; [email protected]
Cc: Gary O'Neall <[email protected]>; [email protected]
Subject: Validate license cross references: New fields to be added



Hi all,



I am working on a Google Summer of Code project that emanates from this 
discussion/issue<https://eur01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fspdx%2FLicenseListPublisher%2Fissues%2F60%23issuecomment-570511697&data=02%7C01%7Cmichael.kaelbling%40siemens.com%7Cd7c5507a4800473b738b08d812d6d551%7C38ae3bcd95794fd4addab42e1495d55a%7C1%7C1%7C637280061198707175&sdata=%2B91xSgGaHQ8tUV%2FvyZ%2F9ETzRJz82lH1kMNxUsXf0Ly4%3D&reserved=0>;
 concerning the validation of license cross references. Here is a link to my 
GSOC 
proposal<https://eur01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fdocs.google.com%2Fdocument%2Fd%2F10RlmmsnJ7suDudjgugHMZkOOa-1IsY2Bv_Ew_tgzpv4%2Fedit&data=02%7C01%7Cmichael.kaelbling%40siemens.com%7Cd7c5507a4800473b738b08d812d6d551%7C38ae3bcd95794fd4addab42e1495d55a%7C1%7C1%7C637280061198707175&sdata=iLaenAoL2Xda%2FtuXKPPR7%2BDFjlsKvsDIg%2FmjqeMLCUY%3D&reserved=0>.



The focus is on improving the 
LicenseListPublisher<https://eur01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fspdx%2FLicenseListPublisher&data=02%7C01%7Cmichael.kaelbling%40siemens.com%7Cd7c5507a4800473b738b08d812d6d551%7C38ae3bcd95794fd4addab42e1495d55a%7C1%7C1%7C637280061198717129&sdata=ZujYLHGnpimli4lx4e7T2QGWctKTAEc1GbcK%2BSgCzHw%3D&reserved=0>
 repository to have generated license 
data<https://eur01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fspdx%2Flicense-list-data&data=02%7C01%7Cmichael.kaelbling%40siemens.com%7Cd7c5507a4800473b738b08d812d6d551%7C38ae3bcd95794fd4addab42e1495d55a%7C1%7C1%7C637280061198717129&sdata=CJBsqA%2FRI90Ld9FMUX%2FqyDsnPnIL4%2F8UlAJwC2ZgXT4%3D&reserved=0>
 updated with fields on the validity of the crossref, among others.



Inorder to do this, the structure of the crossref shall change(in some cases, 
eg JSON), and in others, there shall be additional tags. In general the 
following are fields which shall be added to the crossrefs:



"isValid": true/false,

Indicates whether or not the crossref url is a valid url (ex: not some local 
file link)

Must a valid URL be based on one of only two/three schemes: http, https, and 
ftp? Is http://localhost/ or https://127.0.0.1 valid?


"isWayBackLink": true/false,

Indicates whether or not the url is a link from a previous version(wayback 
machine) of the site(where the license is located)


"extraText": true/false,

Indicates whether or not the license from the url has extra text in its 
description when compared to the license description in the current file.


"isMatch": true/false,

Indicates whether or not the license from the url link matches(perfectly) the 
license description in the current file.

Rather than true/false perhaps allow the name of the matched algorithm:
verbatim
noassertion – if no test result is available (for invalid links perhaps)
todo – no match attempted

“” – no match asserted
…
verbatim2 – matches with \r == \r\n == \n
verbatim3 – matches “ignoring whitespace differences” reflowed text

verbatim4 – matches ignoring decoration (comments, flower-boxes)
template – matches template verbatim (see ppalaga’s comment)
et cetera as they become available



"url": 
"http://landley.net/toybox/license.html<https://eur01.safelinks.protection.outlook.com/?url=http%3A%2F%2Flandley.net%2Ftoybox%2Flicense.html&data=02%7C01%7Cmichael.kaelbling%40siemens.com%7Cd7c5507a4800473b738b08d812d6d551%7C38ae3bcd95794fd4addab42e1495d55a%7C1%7C1%7C637280061198727086&sdata=UG9wF2id8FbX%2B7fjrAZqd9kFIpGDijFbf1F3%2BvtniXE%3D&reserved=0>",

This is the url of the license text/description


"isDead": true/false

Indicates whether or not the url is a dead link(a link that returns a page 
different from HTTP_200, could be bad request HTTP_400, not found HTTP_404, 
forbidden HTTP_403, etc)

Rather than true/false (since dead sites can be reanimated), how about a date 
for the most-recent HTTP-200 response? “dateMRHTTP200”: “UTC date”



Please consider this as a proposal and any suggestions and/or modifications 
will be very much appreciated.



Thanks,

Smith








-=-=-=-=-=-=-=-=-=-=-=-
Links: You receive all messages sent to this group.

View/Reply Online (#3886): https://lists.spdx.org/g/Spdx-tech/message/3886
Mute This Topic: https://lists.spdx.org/mt/74934696/21656
Group Owner: [email protected]
Unsubscribe: https://lists.spdx.org/g/Spdx-tech/unsub  
[[email protected]]
-=-=-=-=-=-=-=-=-=-=-=-

Reply via email to