Hi Smith, Thanks for your well-laid-out email and your GSoC proposal. Trying to think about this from the perspective of the LicenseListPublisher repository over time, I would imagine the validity and other status of links could change over time. Links can linkrot, http-302 forwards can differ one day to the next, and the license text presented in HTML at a specific URL could be, and sometimes is, altered -- either with or without explicitly versioning the license. I think this necessitates some way of recording or representing validity information as a point-in-time, at minimum with a lastChecked value (e.g., UTC). There may be use cases for representing validity over periods of time, for example:
- Time-series: (in daily checks tagged with UTC): valid-valid-valid-invalid-invalid-valid - Last-known-modified: perhaps lastChecked and lastChanged so that one could say "this was checked every week since X date and hasn't changed) - Other: other time-related information that tooling providers might want Then, I wasn't sure if isValid represented a valid regex-matchable URL (which presumably could be local, or more likely, corporate intranet), or both validly-formed according to regex and accessible from [some place on] the global internet. In theory that might depend on DNS, firewall configurations, or both, which are subject to change or manipulation to e.g. mitigate DDoS, find the physically closest webserver for a CDN, or block specific IPs sending malicious traffic. When it comes down to the "bits on the wire," the server has the option whether and how to respond to a request, and the server can (and occasionally does) make its decision based on these types of connection metadata describing the "from" side of the connection. So in theory it may make sense to include things like the source IP address of the system performing the validation attempt. That raises privacy issues, although if it came from a Linux Foundation system (or something similar), then hiding the validating system's IP address wouldn't necessarily be a requirement. So it may make sense to evaluation these kinds of contextual data points, along with clarifying in the isValid name or definition which validity-check you mean for it to represent. At minimum, it's worth thinking through these things and how we would deal with the edge cases introduced by relying on DNS and http to perform what is ultimately a connection-based point-in-time check. Best, Brad Edmondson PS: Personally I am not in favor of SPDX tracking the validity of license-text links, but then again I am coming at this as a contributor on the SPDX-legal side of things, and not on the SPDX tech team nor a frequent user of tooling. If the tech team is happy with this idea generally, and with fully owning the process and collected data on the LicenseListPublisher side, then I would have no objection from the legal side. (Also, of course, I only represent my own view and not the official or finalized position of the legal team.) -- Brad Edmondson, *Esq.* [email protected] On Wed, Jun 17, 2020 at 6:31 AM Smith Tanjong Agbor <[email protected]> wrote: > Hi all, > > I am working on a Google Summer of Code project that emanates from this > discussion/issue > <https://github.com/spdx/LicenseListPublisher/issues/60#issuecomment-570511697>; > concerning the validation of license cross references. Here is a link to > my GSOC proposal > <https://docs.google.com/document/d/10RlmmsnJ7suDudjgugHMZkOOa-1IsY2Bv_Ew_tgzpv4/edit> > . > > The focus is on improving the LicenseListPublisher > <https://github.com/spdx/LicenseListPublisher> repository to have > generated license data <https://github.com/spdx/license-list-data> updated > with fields on the validity of the crossref, among others. > > Inorder to do this, the structure of the crossref shall change(in some > cases, eg JSON), and in others, there shall be additional tags. In general > the following are fields which shall be added to the crossrefs: > > *"isValid": true/false,* > Indicates whether or not the crossref url is a valid url (ex: not some > local file link) > > *"isWayBackLink": true/false,* > Indicates whether or not the url is a link from a previous version(wayback > machine) of the site(where the license is located) > > *"extraText": true/false,* > Indicates whether or not the license from the url has extra text in its > description when compared to the license description in the current file. > > "isMatch": true/false, > Indicates whether or not the license from the url link matches(perfectly) > the license description in the current file. > > "url": "http://landley.net/toybox/license.html", > This is the url of the license text/description > > > *"isDead": true/false* > Indicates whether or not the url is a dead link(a link that returns a page > different from HTTP_200, could be bad request HTTP_400, not found > HTTP_404, forbidden HTTP_403, etc) > > Please consider this as a proposal and any suggestions and/or > modifications will be very much appreciated. > > Thanks, > Smith > > > > > -=-=-=-=-=-=-=-=-=-=-=- Links: You receive all messages sent to this group. View/Reply Online (#3887): https://lists.spdx.org/g/Spdx-tech/message/3887 Mute This Topic: https://lists.spdx.org/mt/74934696/21656 Group Owner: [email protected] Unsubscribe: https://lists.spdx.org/g/Spdx-tech/unsub [[email protected]] -=-=-=-=-=-=-=-=-=-=-=-
