Re: [Bug-wget] Tilde issue with recursive download when IRI is enabled and a page uses Shift JIS

2017-02-17 Thread William Prescott
I would imagine that it could use the value that "--local-encoding"
gets. I used UTF-8 here as an example, since that is what my terminal
is set to.

Best regards,
William Prescott
(Eli Zaretskii: Sorry for writing to you directly a few minutes ago --
I forgot that the reply button doesn't do what I expect for this
mailing list.)

On Fri, Feb 17, 2017 at 4:10 AM, Eli Zaretskii  wrote:
> How should wget know that "http://example.com/~foo/bar.html; comes
> from a UTF-8 encoding?  Where should that piece of information come
> from?



Re: [Bug-wget] Tilde issue with recursive download when IRI is enabled and a page uses Shift JIS

2017-02-17 Thread William Prescott
Hello,

I was just thinking about this again. My initial message was incorrect,
since I had expected Wget to work with invalid input. However, I still have
a nagging feeling that something odd goes on with relative links, and I'd
rather hear that it's intended than to let it slip by as a bug just because
I didn't explain it well enough.

The last message I sent mentioned this a little bit:
> I would also like to note that, even when the the document's links don't
> contain a tilde, Wget will still fail to fetch the pages as long as there
> is a tilde in the URL the Wget was called with.

Let's consider the (UTF-8) URL "http://example.com/~foo/bar.html;
bar.html is Shift_JIS encoded and contains:

Baz

(this time, bar.html is perfectly valid Shift_JIS and doesn't have a tilde)

A recursive download will fail, because the relative URL appears to get
processed as
sjis_to_utf8(utf8_to_sjis("http://example.com/~foo/;) + sjis("baz.html"))
resulting in
http://example.com/‾foo/baz.html

I would have expected
utf8("http://example.com/~foo/;) + sjis_to_utf8("baz.html")
resulting in
http://example.com/~foo/baz.html

Best regards,
William Prescott



Re: [Bug-wget] Tilde issue with recursive download when IRI is enabled and a page uses Shift JIS

2017-02-07 Thread Tim Rühsen
On Montag, 6. Februar 2017 19:02:16 CET William Prescott wrote:
> Thanks for the responses.
> 
> Indeed, that seems to be the case: Shift JIS replaces ASCII \ and ~
> with ¥ and ‾, respectively
> (with exceptions as per Andries' message).
> 
> In addition, RFC 3987 (Internationalized Resource Identifiers (IRIs))
> section 6.3 states that:
> "In cases where the document as a whole has a
>native character encoding, IRIs MUST also be encoded in this
>character encoding and converted accordingly by a parser or
>interpreter."
> This would make it seem that the observed behavior in Wget is correct and
> that the document is faulty.
> 
> I would also like to note that, even when the the document's links don't
> contain a tilde, Wget will still fail to fetch the pages as long as there
> is a tilde in the URL the Wget was called with.

Hi William,

you are on UTF-8 and thus copy a URL from the original document does 
not do the Shift JIS to UTF-8 conversion. If your editor (or text viewer) is 
locale/charset aware (e.g. here on KDE I use kate and can manually tell it, 
that the charset encoding of the viewed document is 'sjis'), set it to the 
right encoding and then try copy

Another way would be to translate your string from ShiftJIS to UTF-8 as I did 
in my example, like

$ wget `echo 'http://domain.jp/~withtilde'|iconv -f SHIFT-JIS -t utf-8`

Or you translate your whole document to UTF-8 with that trick, like
$ cat shiftjis_text.html|iconv -f SHIFT-JIS -t utf-8 >utf8_text.html

Now you should be able to copy URLs from that document.
Ah yes, that only works on Unix/Linux/BSD systems.

Regards, Tim

> On Mon, Feb 6, 2017 at 6:29 PM, Andries E. Brouwer
> 
>  wrote:
> > On Mon, Feb 06, 2017 at 10:55:32PM +0100, Tim Rühsen wrote:
> >> On Montag, 6. Februar 2017 05:02:57 CET William Prescott wrote:
> >> > Hello,
> >> > 
> >> > I'm encountering a problem when recursively downloading from a website
> >> > when
> >> > the URL contains a tilde and the page encoding claims to be Shift JIS.
> >> > 
> >> > I've tried both Wget 1.17.1 (from Ubuntu 16.04) and 1.19 (from source,
> >> > with Libidn2 0.16).
> >> > I believe my local character encoding is UTF-8.
> >> > 
> >> > The first page will download okay, but then most pages after it will
> >> > get the tilde converted to "%E2%80%BE" ("‾"), which, as one would
> >> > expect, doesn't work.
> >> 
> >> Hi William,
> >> 
> >> reproducable by:
> >> 
> >> $echo '~'|iconv -f SHIFT-JIS -t utf-8
> >> ‾
> >> 
> >> $echo -n '~'|iconv -f SHIFT-JIS -t utf-8|od -t x1
> >> 000 e2 80 be
> >> 
> >> So this seems not be a Wget issue, but maybe a general character
> >> conversion
> >> issue. Not sure what Wget could do...
> >> 
> >> Regards, Tim
> > 
> > Shift JIS is not a single well-defined character set.
> > There are x-sjis-unicode, x-sjis-cp932, x-sjis-jisx0221, x-sjis-jdk117
> > that all are called "shift-jis", and are subtly different.
> > See also https://www.w3.org/TR/japanese-xml/#sjis .
> > 
> > 
> > SJIS and CP932 (the "Microsoft version of SJIS") are almost identical,
> > and CP932 does contain a tilde.
> > 
> > Java did (does?) treat SJIS 5c and 7e as ASCII 5c and 7e.
> > The docs say "This is in keeping with standard industry practice within
> > Japan."
> > 
> > Can wget use a fallback? Use the given bytes converted from SJIS.
> > When that fails use these bytes converted from CP932 (if different).
> > When that fails use these bytes without any conversion?
> > 
> > 
> > It looks like
> > http://seesaawiki.jp/w/kou1okada/d/wget%20-%20troubleshooting
> > describes the same problem. There three successful suggestions are given
> > (for wget 1.13.4): (i) Give one of ASCII, EUC-JP or UTF-8 with the
> > --remote-encoding option, (ii) Give the --no-iri option, (iii) Export
> > LANG=C.
> > 
> > Andries



signature.asc
Description: This is a digitally signed message part.


Re: [Bug-wget] Tilde issue with recursive download when IRI is enabled and a page uses Shift JIS

2017-02-06 Thread William Prescott
Thanks for the responses.

Indeed, that seems to be the case: Shift JIS replaces ASCII \ and ~
with ¥ and ‾, respectively
(with exceptions as per Andries' message).

In addition, RFC 3987 (Internationalized Resource Identifiers (IRIs))
section 6.3 states that:
"In cases where the document as a whole has a
   native character encoding, IRIs MUST also be encoded in this
   character encoding and converted accordingly by a parser or
   interpreter."
This would make it seem that the observed behavior in Wget is correct and that
the document is faulty.

I would also like to note that, even when the the document's links don't contain
a tilde, Wget will still fail to fetch the pages as long as there is a tilde in
the URL the Wget was called with.

Best regards,
William Prescott

On Mon, Feb 6, 2017 at 6:29 PM, Andries E. Brouwer
 wrote:
> On Mon, Feb 06, 2017 at 10:55:32PM +0100, Tim Rühsen wrote:
>> On Montag, 6. Februar 2017 05:02:57 CET William Prescott wrote:
>> > Hello,
>> >
>> > I'm encountering a problem when recursively downloading from a website when
>> > the URL contains a tilde and the page encoding claims to be Shift JIS.
>> >
>> > I've tried both Wget 1.17.1 (from Ubuntu 16.04) and 1.19 (from source,
>> > with Libidn2 0.16).
>> > I believe my local character encoding is UTF-8.
>> >
>> > The first page will download okay, but then most pages after it will get 
>> > the
>> > tilde converted to "%E2%80%BE" ("‾"), which, as one would expect, doesn't
>> > work.
>>
>> Hi William,
>>
>> reproducable by:
>>
>> $echo '~'|iconv -f SHIFT-JIS -t utf-8
>> ‾
>>
>> $echo -n '~'|iconv -f SHIFT-JIS -t utf-8|od -t x1
>> 000 e2 80 be
>>
>> So this seems not be a Wget issue, but maybe a general character conversion
>> issue. Not sure what Wget could do...
>>
>> Regards, Tim
>
>
> Shift JIS is not a single well-defined character set.
> There are x-sjis-unicode, x-sjis-cp932, x-sjis-jisx0221, x-sjis-jdk117
> that all are called "shift-jis", and are subtly different.
> See also https://www.w3.org/TR/japanese-xml/#sjis .
>
>
> SJIS and CP932 (the "Microsoft version of SJIS") are almost identical,
> and CP932 does contain a tilde.
>
> Java did (does?) treat SJIS 5c and 7e as ASCII 5c and 7e.
> The docs say "This is in keeping with standard industry practice within 
> Japan."
>
> Can wget use a fallback? Use the given bytes converted from SJIS.
> When that fails use these bytes converted from CP932 (if different).
> When that fails use these bytes without any conversion?
>
>
> It looks like http://seesaawiki.jp/w/kou1okada/d/wget%20-%20troubleshooting
> describes the same problem. There three successful suggestions are given
> (for wget 1.13.4): (i) Give one of ASCII, EUC-JP or UTF-8 with the
> --remote-encoding option, (ii) Give the --no-iri option, (iii) Export LANG=C.
>
> Andries
>



Re: [Bug-wget] Tilde issue with recursive download when IRI is enabled and a page uses Shift JIS

2017-02-06 Thread Andries E. Brouwer
On Mon, Feb 06, 2017 at 10:55:32PM +0100, Tim Rühsen wrote:
> On Montag, 6. Februar 2017 05:02:57 CET William Prescott wrote:
> > Hello,
> > 
> > I'm encountering a problem when recursively downloading from a website when
> > the URL contains a tilde and the page encoding claims to be Shift JIS.
> > 
> > I've tried both Wget 1.17.1 (from Ubuntu 16.04) and 1.19 (from source,
> > with Libidn2 0.16).
> > I believe my local character encoding is UTF-8.
> > 
> > The first page will download okay, but then most pages after it will get the
> > tilde converted to "%E2%80%BE" ("‾"), which, as one would expect, doesn't
> > work.
> 
> Hi William,
> 
> reproducable by:
> 
> $echo '~'|iconv -f SHIFT-JIS -t utf-8
> ‾
> 
> $echo -n '~'|iconv -f SHIFT-JIS -t utf-8|od -t x1
> 000 e2 80 be
> 
> So this seems not be a Wget issue, but maybe a general character conversion 
> issue. Not sure what Wget could do...
> 
> Regards, Tim


Shift JIS is not a single well-defined character set.
There are x-sjis-unicode, x-sjis-cp932, x-sjis-jisx0221, x-sjis-jdk117
that all are called "shift-jis", and are subtly different.
See also https://www.w3.org/TR/japanese-xml/#sjis .


SJIS and CP932 (the "Microsoft version of SJIS") are almost identical,
and CP932 does contain a tilde.

Java did (does?) treat SJIS 5c and 7e as ASCII 5c and 7e.
The docs say "This is in keeping with standard industry practice within Japan."

Can wget use a fallback? Use the given bytes converted from SJIS.
When that fails use these bytes converted from CP932 (if different).
When that fails use these bytes without any conversion?


It looks like http://seesaawiki.jp/w/kou1okada/d/wget%20-%20troubleshooting
describes the same problem. There three successful suggestions are given
(for wget 1.13.4): (i) Give one of ASCII, EUC-JP or UTF-8 with the
--remote-encoding option, (ii) Give the --no-iri option, (iii) Export LANG=C.

Andries



Re: [Bug-wget] Tilde issue with recursive download when IRI is enabled and a page uses Shift JIS

2017-02-06 Thread Tim Rühsen
On Montag, 6. Februar 2017 05:02:57 CET William Prescott wrote:
> Hello,
> 
> I'm encountering a problem when recursively downloading from a website when
> the URL contains a tilde and the page encoding claims to be Shift JIS.
> 
> I've tried both Wget 1.17.1 (from Ubuntu 16.04) and 1.19 (from source,
> with Libidn2 0.16).
> I believe my local character encoding is UTF-8.
> 
> The first page will download okay, but then most pages after it will get the
> tilde converted to "%E2%80%BE" ("‾"), which, as one would expect, doesn't
> work.

Hi William,

reproducable by:

$echo '~'|iconv -f SHIFT-JIS -t utf-8
‾

$echo -n '~'|iconv -f SHIFT-JIS -t utf-8|od -t x1
000 e2 80 be

So this seems not be a Wget issue, but maybe a general character conversion 
issue. Not sure what Wget could do...

Regards, Tim


signature.asc
Description: This is a digitally signed message part.