RE: [bug #58354] Wget doesn't parse URIs starting with http:/

2020-05-12 Thread Seymour J Metz
I've got code for parsing broken URLs at 
http://mason.gmu.edu/~smetz3/source/unobfuscate.zip if that's of any use to 
you. 


--
Shmuel (Seymour J.) Metz
http://mason.gmu.edu/~smetz3


From: Bug-wget [bug-wget-bounces+smetz3=gmu@gnu.org] on behalf of Luca 
Bernardi [invalid.nore...@gnu.org]
Sent: Tuesday, May 12, 2020 6:57 AM
To: Luca Bernardi; gscriv...@gnu.org; tim.rueh...@gmx.de; bug-wget@gnu.org; 
dar...@gnu.org
Subject: [bug #58354] Wget doesn't parse URIs starting with http:/

Follow-up Comment #1, bug #58354 (project wget):

PS This bug has happened when trying to crawl a website with default Wordpress
template.

___

Reply to this item at:

  
<https://secure-web.cisco.com/1q_9r4L4Y69ONAuRRi0ugNjuqo2Tj_fFoBQbF5ioU-bnyA1vRNKC2qjgGrGzNsMeAi9WBFuCZq5ZbRgGNcUnwFXhwPut6uzco1g0e7u7DGjIlIzN1O2Kb8A7lcd1hGFvVO2RlJOXPPbaPfPz1vWjpt1lp_MSi15q_ApZl5XAVjS7RRw_8hl0LW1Vlav9F86E8xj6U0j7w1Rb17wjLXaH3YDyCxaR2rYYNb5aMPjo-HUQgiErPIGkmU5OTyscR3nnY5AZZ-gRcgT7fDYF-9BIsYRmM1WK1zcfH5YaUF08mWkkbcQcl4uZEgkb53ewOM5Hc2ze5rHP40EGGXdoHzHZCnFQ-tEzuTrjgYf4u8kaLWS_mLhOUPdnuK0TVTYUcKWVhJJLvOlsmp7YPRnhtDQNzNqDbDLbFFtg7nplUPJo8CIC74qShVvDvMPALoH0UviH4/https%3A%2F%2Fsavannah.gnu.org%2Fbugs%2F%3F58354>

___
  Message sent via Savannah
  
https://secure-web.cisco.com/1t7bdydvsCxBYK2hviWUK34edpVCbTtcc7hvoEjsGxp7TF7YcwxQ4wHZDEeqhx7ckLh33IjhN6G3CTT6UK6Nhhq-1MBzaLtKN3ycAbQu9cLQX_Is4dFUdOLYzPUdtaX4csfyBmvz-h5-D-HjK5ZoEEYyJLkpqwjCVh8FrDCzMX3GPuG7Gc47pGRmt4cAoaa64gi3TWmRF9Rlac3d-3JLYmkzxyBl6DMT_eeYR9YQIZLnWPYhJhdG4367UOEV6eEJPSzbApw6N0xoxr7bE9EhRLs509MOh6MRMnCQPJk6JpDttjn_xSjlybWQzZRlYmm87zlzgsopx_leVwUGOHKtEcCDJqMajmWHC4NDH2M3DPfHGQ5uSYTbaoVmgMMZBuHksYzhBaW8pWLkIYDTAe288H6u12Rr1qbRMeJA6v5UeUTNSgb5ebn2ld1j9hvKPDnN-/https%3A%2F%2Fsavannah.gnu.org%2F






[bug #58354] Wget doesn't parse URIs starting with http:/

2020-05-12 Thread Daniel Stenberg
Follow-up Comment #3, bug #58354 (project wget):

Note that the browser's The WHATWG URL Spec (TWUS) allows that kind of
abomination, while RFC 3986 explicitly requires two slashes to be there.
(Listed in my "URL interop" page at
https://github.com/bagder/docs/blob/master/URL-interop.md)

In fact, TWUS allows an unlimited amount of slashes (and backslashes) to be
present and is still fine with it.

Because browsers allow this, instances of such URLs appear in the wild so it
makes sense to be somewhat accommodating.

In the curl project (which primarily has RFC3986 as guidance), we've still
decided to accept one, two or three slashes as we've seen both one and three
happen in the wild. I don't believe in being more lenient than so.

___

Reply to this item at:

  

___
  Message sent via Savannah
  https://savannah.gnu.org/




[bug #58354] Wget doesn't parse URIs starting with http:/

2020-05-12 Thread Tim Ruehsen
Update of bug #58354 (project wget):

  Status:None => Confirmed  

___

Follow-up Comment #2:

We could accept this while parsing input *and* having a BASE.

It doesn't make much sense without a BASE, as the host/domain part is skipped
here.

I assume that 'https:' should also be recognized.


___

Reply to this item at:

  

___
  Message sent via Savannah
  https://savannah.gnu.org/




[bug #58354] Wget doesn't parse URIs starting with http:/

2020-05-12 Thread Luca Bernardi
Follow-up Comment #1, bug #58354 (project wget):

PS This bug has happened when trying to crawl a website with default Wordpress
template.

___

Reply to this item at:

  

___
  Message sent via Savannah
  https://savannah.gnu.org/




Re: [bug #58354] Wget doesn't parse URIs starting with http:/

2020-05-12 Thread Jeffrey Walton
On Tue, May 12, 2020 at 6:45 AM Luca Bernardi  wrote:
>
> URL:
>   
>
>  Summary: Wget doesn't parse URIs starting with http:/
>  Project: GNU Wget
> Submitted by: f0ff
> Submitted on: Tue 12 May 2020 10:45:17 AM UTC
> Category: None
> Severity: 3 - Normal
> Priority: 5 - Normal
>   Status: None
>  Privacy: Public
>  Assigned to: None
>  Originator Name:
> Originator Email:
>  Open/Closed: Open
>  Release: 1.14
>  Discussion Lock: Any
> Operating System: GNU/Linux
>  Reproducibility: Every Time
>Fixed Release: None
>  Planned Release: None
>   Regression: None
>Work Required: None
>   Patch Included: No
>
> ___
>
> Details:
>
> Hi,
> Wget refuses to parse URIs that start with http:/ (note single slash), e.g.
> http:/wp-includes/css/dist/block-library/style.min.css?ver=5.4.1. These are
> widely accepted by browsers.
>
> Command that I've used: `wget --user-agent=Mozilla --content-disposition
> --page-requisites --adjust-extension --restrict-file-names=windows -d -e
> robots=off -m -k -E -r -l 10 -p -N -F -P crawl  -nH $IP`

You may as well make the slashes optional in the protocol string.
Berners Lee does not like them anyway,
https://www.mentalfloss.com/uk/history/27802/10-inventors-who-came-to-regret-their-creations.

Jeff



[bug #58354] Wget doesn't parse URIs starting with http:/

2020-05-12 Thread Luca Bernardi
URL:
  

 Summary: Wget doesn't parse URIs starting with http:/
 Project: GNU Wget
Submitted by: f0ff
Submitted on: Tue 12 May 2020 10:45:17 AM UTC
Category: None
Severity: 3 - Normal
Priority: 5 - Normal
  Status: None
 Privacy: Public
 Assigned to: None
 Originator Name: 
Originator Email: 
 Open/Closed: Open
 Release: 1.14
 Discussion Lock: Any
Operating System: GNU/Linux
 Reproducibility: Every Time
   Fixed Release: None
 Planned Release: None
  Regression: None
   Work Required: None
  Patch Included: No

___

Details:

Hi,
Wget refuses to parse URIs that start with http:/ (note single slash), e.g.
http:/wp-includes/css/dist/block-library/style.min.css?ver=5.4.1. These are
widely accepted by browsers.

Command that I've used: `wget --user-agent=Mozilla --content-disposition
--page-requisites --adjust-extension --restrict-file-names=windows -d -e
robots=off -m -k -E -r -l 10 -p -N -F -P crawl  -nH $IP`



___

File Attachments:


---
Date: Tue 12 May 2020 10:45:17 AM UTC  Name: out.txt  Size: 17KiB   By: f0ff



___

Reply to this item at:

  

___
  Message sent via Savannah
  https://savannah.gnu.org/