Re: [Rd] URL checks

2021-01-07 Thread Kirill Müller via R-devel
One other failure mode: SSL certificates trusted by browsers that are 
not installed on the check machine, e.g. the "GEANT Vereniging" 
certificate from https://relational.fit.cvut.cz/ .



K


On 07.01.21 12:14, Kirill Müller via R-devel wrote:

Hi


The URL checks in R CMD check test all links in the README and 
vignettes for broken or redirected links. In many cases this improves 
documentation, I see problems with this approach which I have detailed 
below.


I'm writing to this mailing list because I think the change needs to 
happen in R's check routines. I propose to introduce an "allow-list" 
for URLs, to reduce the burden on both CRAN and package maintainers.


Comments are greatly appreciated.


Best regards

Kirill


# Problems with the detection of broken/redirected URLs

## 301 should often be 307, how to change?

Many web sites use a 301 redirection code that probably should be a 
307. For example, https://www.oracle.com and https://www.oracle.com/ 
both redirect to https://www.oracle.com/index.html with a 301. I 
suspect the company still wants oracle.com to be recognized as the 
primary entry point of their web presence (to reserve the right to 
move the redirection to a different location later), I haven't checked 
with their PR department though. If that's true, the redirect probably 
should be a 307, which should be fixed by their IT department which I 
haven't contacted yet either.


$ curl -i https://www.oracle.com
HTTP/2 301
server: AkamaiGHost
content-length: 0
location: https://www.oracle.com/index.html
...

## User agent detection

twitter.com responds with a 400 error for requests without a user 
agent string hinting at an accepted browser.


$ curl -i https://twitter.com/
HTTP/2 400
...
...Please switch to a supported browser..

$ curl -s -i https://twitter.com/ -A "Mozilla/5.0 (X11; Ubuntu; Linux 
x86_64; rv:84.0) Gecko/20100101 Firefox/84.0" | head -n 1

HTTP/2 200

# Impact

While the latter problem *could* be fixed by supplying a browser-like 
user agent string, the former problem is virtually unfixable -- so 
many web sites should use 307 instead of 301 but don't. The above list 
is also incomplete -- think of unreliable links, HTTP links, other 
failure modes...


This affects me as a package maintainer, I have the choice to either 
change the links to incorrect versions, or remove them altogether.


I can also choose to explain each broken link to CRAN, this subjects 
the team to undue burden I think. Submitting a package with NOTEs 
delays the release for a package which I must release very soon to 
avoid having it pulled from CRAN, I'd rather not risk that -- hence I 
need to remove the link and put it back later.


I'm aware of https://github.com/r-lib/urlchecker, this alleviates the 
problem but ultimately doesn't solve it.


# Proposed solution

## Allow-list

A file inst/URL that lists all URLs where failures are allowed -- 
possibly with a list of the HTTP codes accepted for that link.


Example:

https://oracle.com/ 301
https://twitter.com/drob/status/1224851726068527106 400

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel


__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel


[Rd] URL checks

2021-01-07 Thread Kirill Müller via R-devel

Hi


The URL checks in R CMD check test all links in the README and vignettes 
for broken or redirected links. In many cases this improves 
documentation, I see problems with this approach which I have detailed 
below.


I'm writing to this mailing list because I think the change needs to 
happen in R's check routines. I propose to introduce an "allow-list" for 
URLs, to reduce the burden on both CRAN and package maintainers.


Comments are greatly appreciated.


Best regards

Kirill


# Problems with the detection of broken/redirected URLs

## 301 should often be 307, how to change?

Many web sites use a 301 redirection code that probably should be a 307. 
For example, https://www.oracle.com and https://www.oracle.com/ both 
redirect to https://www.oracle.com/index.html with a 301. I suspect the 
company still wants oracle.com to be recognized as the primary entry 
point of their web presence (to reserve the right to move the 
redirection to a different location later), I haven't checked with their 
PR department though. If that's true, the redirect probably should be a 
307, which should be fixed by their IT department which I haven't 
contacted yet either.


$ curl -i https://www.oracle.com
HTTP/2 301
server: AkamaiGHost
content-length: 0
location: https://www.oracle.com/index.html
...

## User agent detection

twitter.com responds with a 400 error for requests without a user agent 
string hinting at an accepted browser.


$ curl -i https://twitter.com/
HTTP/2 400
...
...Please switch to a supported browser..

$ curl -s -i https://twitter.com/ -A "Mozilla/5.0 (X11; Ubuntu; Linux 
x86_64; rv:84.0) Gecko/20100101 Firefox/84.0" | head -n 1

HTTP/2 200

# Impact

While the latter problem *could* be fixed by supplying a browser-like 
user agent string, the former problem is virtually unfixable -- so many 
web sites should use 307 instead of 301 but don't. The above list is 
also incomplete -- think of unreliable links, HTTP links, other failure 
modes...


This affects me as a package maintainer, I have the choice to either 
change the links to incorrect versions, or remove them altogether.


I can also choose to explain each broken link to CRAN, this subjects the 
team to undue burden I think. Submitting a package with NOTEs delays the 
release for a package which I must release very soon to avoid having it 
pulled from CRAN, I'd rather not risk that -- hence I need to remove the 
link and put it back later.


I'm aware of https://github.com/r-lib/urlchecker, this alleviates the 
problem but ultimately doesn't solve it.


# Proposed solution

## Allow-list

A file inst/URL that lists all URLs where failures are allowed -- 
possibly with a list of the HTTP codes accepted for that link.


Example:

https://oracle.com/ 301
https://twitter.com/drob/status/1224851726068527106 400

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel


Re: [Rd] Printing Unicode escapes with 6 digits may be problematic

2021-01-07 Thread Korpela Mikko (MML)
I see that this was only a passing issue. R-devel r79638 and greater (also 
tested with r79801) print six digits inside curly brace delimiters, like so: 
"\U{016fe4}1" (using the example below). This ensures compatibility between 
output and input.

- Mikko

-Alkuperäinen viesti-
Lähettäjä: R-devel  Puolesta Korpela Mikko (MML)
Lähetetty: maanantai 14. joulukuuta 2020 11.51
Vastaanottaja: r-devel 
Aihe: [Rd] Printing Unicode escapes with 6 digits may be problematic

A recent R-devel commit introduces a change in the way non-printable Unicode 
characters are shown as an escape code. Whereas large code points were 
previously printed using an escape code of 8 hexadecimal digits, with initial 
zeros, the present code (tested with R-devel r79623 on Ubuntu Linux) only 
prints 6 hex digits. I think this may be problematic: it is now possible that R 
prints a character string which is not valid when reused as an input. See the 
following example.

"\U{16FE4}1"
# [1] "\U016fe41"
"\U016fe41"
# Error: invalid \U value 16fe41 (line 1)

Best regards,
- Mikko Korpela

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel