URL: <http://savannah.gnu.org/bugs/?50383>
Summary: --local-encoding isn't used when converting a relative link in a recursive download Project: GNU Wget Submitted by: None Submitted on: Wed 22 Feb 2017 11:10:42 PM UTC Category: Program Logic Severity: 3 - Normal Priority: 5 - Normal Status: None Privacy: Public Assigned to: None Originator Name: William Prescott Originator Email: appledesktop...@gmail.com Open/Closed: Open Discussion Lock: Any Release: 1.19 Operating System: GNU/Linux Reproducibility: Every Time Fixed Release: None Planned Release: None Regression: None Work Required: None Patch Included: None _______________________________________________________ Details: When expanding a relative URL found on a page, Wget doesn't appear to take into account the local encoding of the URL Wget was called with. This is apparent when trying to recursively download pages encoded with Shift_JIS whose URL contains a tilde (Shift_JIS lacks ~ and has ‾ at the same code point). While the documents themselves cannot have a tilde, they are able to use relative links to move around within this path. Wget is currently expanding relative links as if the user-provided URL was in the document's character encoding. In the case of my example here, this changes the URL's tilde to ‾. My expectation is that Wget would use the specified local encoding for the user-provided part of the base and the remote encoding for the rest of the URL. Additionally, links on a page retrieved using "IRI fallbacking" will not be followed (noticeable on bar.html in the example). This may constitute another bug? ---------------------------------------- EXAMPLE CASE (test files attached as tar archive) On server: ~foo/index.html ~foo/bar.html ~foo/baz.html (empty) ~foo/index.html is Shift_JIS encoded and contains <meta http-equiv="Content-Type" content="text/html;charset=Shift_JIS"> <a href="bar.html">Bar</a> ~foo/bar.html is Shift_JIS encoded and contains <meta http-equiv="Content-Type" content="text/html;charset=Shift_JIS"> <a href="baz.html">Baz</a> Results for wget -np -r --local-encoding=utf-8 -d 'http://127.0.0.1/~foo/' (using Wget 1.19): ~foo/index.html works fine and is saved to "127.0.0.1/~foo/index.html" ~foo/bar.html gets tried as "%E2%80%BEfoo/bar.html" before IRI fallbacking and is then incorrectly saved to "127.0.0.1/‾foo/bar.html" ~foo/baz.html is never visited. Mailing list discussion at http://lists.gnu.org/archive/html/bug-wget/2017-02/msg00111.html _______________________________________________________ File Attachments: ------------------------------------------------------- Date: Wed 22 Feb 2017 11:10:42 PM UTC Name: wget_output.txt Size: 6kB By: None <http://savannah.gnu.org/bugs/download.php?file_id=39812> ------------------------------------------------------- Date: Wed 22 Feb 2017 11:10:42 PM UTC Name: example.tar.gz Size: 297B By: None <http://savannah.gnu.org/bugs/download.php?file_id=39813> _______________________________________________________ Reply to this item at: <http://savannah.gnu.org/bugs/?50383> _______________________________________________ Message sent via/by Savannah http://savannah.gnu.org/