Re: IDN patch for wget

2006-03-22 Thread Hrvoje Niksic
[ Moving the discussion from the patches list to the general
  discussion list, followed by more people. ]

Juho Vähä-Herttua [EMAIL PROTECTED] writes:

 Thank you for mentioning this feature, I forgot to explicitly mention
 it in my mail. Currently wget doesn't handle the charset at all on
 HTML pages, so the recursive feature is already horribly broken on
 some websites.

That is a different issue, and it only arises on sites that use a
non-8-bit-wide fixed width encoding, such as UTF-16.  (Such as is a
euphemism because I know of no other such encoding that is in wide
use.)

On the other hand, the IDN feature, as implemented by your patch,
simply doesn't work (it silently malfunctions) whenever the HTML/HTTP
charset is different than the charset of the user's locale --
regardless of whether it is UTF-16, Latin *, UTF-8, or something else.

 So someone could file this into wget bugs list, but I can tell you
 it's not easy to resolve.

It's not that hard, either -- you can always transform UTF-16 into
UTF-8 and work with that.

 However, I don't see how this is related to IDN, it is related to
 all domain names and correct HTML parsing.

The problem *you* described (retrieving UTF-16 pages) is not at all
related to IDN.  However, the problem *I* described (charsets in HTML
and in user's locale differing) is very related to IDN because your
patch doesn't address the problem at all, and you don't seem to have a
problem with that.

Before IDN, Wget would simply send to the server whatever it found in
the HTML.  With IDN, charset-aware processing is done, and it has to
take the page charset into account.  Your patch doesn't do that -- it
silently assumes (or so I believe; you never confirmed this) that the
charset of u-host is the charset of the user's locale.  That breaks
with any page that specifies a different charset and attempts to link
to a non-ASCII domain.


Re: IDN patch for wget

2006-03-22 Thread Hrvoje Niksic
Juho Vähä-Herttua [EMAIL PROTECTED] writes:

 It is very much related to IDN. If wget would detect the correct
 charset of web page content it would be trivial to do the
 conversions to IDN knowing that, this is what most web browsers do.
 Wget never even detects the correct charset.

And that is why your IDN patch is incomplete.  You made it sound like
proper IDN support was as simple as an eight-line addition to url.c,
when in fact it's (unfortunately) not.

 I also asked for comments on invalid domain name resolving but I
 never got answers, I suppose you don't think that's an issue.

I don't know enough about IDN to know how this should be handled.

 But for me this conversation is already taking too much useful time
 from other projects comparing to how much interest I have on this
 issue.

I'm sorry to hear that; I was under the impression that you were
interested in making Wget support IDN, not just providing a partial,
proof-of-concept, patch.


Re: IDN patch for wget

2006-03-22 Thread Hrvoje Niksic
Juho Vähä-Herttua [EMAIL PROTECTED] writes:

 It's not that hard, either -- you can always transform UTF-16 into
 UTF-8 and work with that.

 No you can't. Then the filenames in URLs that should be as escaped UTF-16 will
 be transformed into escaped UTF-8.

Can you elaborate on this?  What I had in mind was:

1. start with a stream of UTF-16 sequences
2. convert that into a string of UCS code points
3. encode that into UTF-8
now work with UTF-8 consistently

What do you mean by file names as escpaed UTF-16?

 silently assumes (or so I believe; you never confirmed this) that the
 charset of u-host is the charset of the user's locale.  That breaks
 with any page that specifies a different charset and attempts to link
 to a non-ASCII domain.

 I confirmed this quite clearly in my earlier mail:

I must have missed that part of the mail; sorry about that.

 It assumes this with the function I used, but it also supports  
 conversions from unicode strings where conversions are made
 manually.

So Wget has not only to call libidn, but also to call an unspecified
library that converts charsets encountered in HTML (potentially a
large set) to Unicode?


Re: IDN patch for wget

2006-03-22 Thread Juho Vähä-Herttua

On 22.3.2006, at 17:10, Hrvoje Niksic wrote:

Can you elaborate on this?  What I had in mind was:

1. start with a stream of UTF-16 sequences
2. convert that into a string of UCS code points
3. encode that into UTF-8
now work with UTF-8 consistently

What do you mean by file names as escpaed UTF-16?


I will take that back a little, after trying it in real life it's  
actually not very bad idea. What I meant was that URL uses 8-bit  
escaping, but if UTF-16 strings have non-ASCII characters how are  
they encoded? With a few tests I found out that Opera, Firefox and  
Konqueror all do exactly like you suggested, they convert the URL to  
UTF-8 and then escape those 8-bit sequences. I first thought some  
would use the UTF-16 raw byte representation but I was wrong, I  
wouldn't see any use for it either though. Safari doesn't seem to  
like non-ascii characters in wide charsets at all, which seems  
reasonable.



It assumes this with the function I used, but it also supports
conversions from unicode strings where conversions are made
manually.


So Wget has not only to call libidn, but also to call an unspecified
library that converts charsets encountered in HTML (potentially a
large set) to Unicode?


Libidn links to iconv (which is a prerequisite for any  
internationalization) and can handle the conversion itself. If it  
wouldn't, it would be much more feasible to call iconv and just write  
the punycode encoding manually. Is it possible to have multiple  
charsets in single HTML file? Because all we need is for wget to tell  
the url handler which charset we are using right now. If the url  
comes from command line it would be the current locale. If finding  
out the charset from HTTP/HTML turns out to be too hard, I suggest  
either delimiting IDN support to command line or dropping the whole  
thing.


To answer earlier comments, I never remember saying my patch is  
complete or full and proper IDN support. I just demonstrated that  
it's easy to convert hostnames to IDN using libidn. I had no idea at  
that time that wget ignores all charsets in HTML files altogether,  
but I found out quite soon. I'm interested in making wget support IDN  
-- to a certain point. And my question about DNS queries can be  
expressed with following patch. Why not do:


--- clip ---
Index: src/url.c
===
--- src/url.c   (revision 2135)
+++ src/url.c   (working copy)
@@ -836,8 +836,8 @@
  converted to %HH by reencode_escapes).  */
   if (strchr (u-host, '%'))
 {
-  url_unescape (u-host);
-  host_modified = true;
+  error_code = PE_INVALID_HOST_NAME;
+  goto error;
 }
   if (params_b)
--- clip ---

I don't understand the explanation of supporting binary characters in  
hostnames, since they are not supported in RFC1035 section 2.3.1. It  
is mentioned though, that this syntax only preferred, but I'm not  
aware of any applications that would break the specification. Instead  
they all use punycode to fill the requirements of the specification  
mentioned before.



Juho


Re: IDN patch for wget

2006-03-22 Thread Hrvoje Niksic
Juho Vähä-Herttua [EMAIL PROTECTED] writes:

 On 22.3.2006, at 17:10, Hrvoje Niksic wrote:
 Can you elaborate on this?  What I had in mind was:

 1. start with a stream of UTF-16 sequences
 2. convert that into a string of UCS code points
 3. encode that into UTF-8
 now work with UTF-8 consistently

 What do you mean by file names as escpaed UTF-16?

 I will take that back a little, after trying it in real life it's
 actually not very bad idea.

Thanks.  The nice thing about such a transformation is that the
program doesn't have to be aware of the properties of individual
unicode characters, it only needs to mechanically decode UTF-16 and
encode UTF-8, which shouldn't be hard given that both encodings are
fairly simple and well-defined.

 What I meant was that URL uses 8-bit escaping, but if UTF-16 strings
 have non-ASCII characters how are they encoded?

In Wget's scenario they would be converted to a sequence of UTF-8
bytes representing those non-ASCII characters, and subsequently each
byte would be escaped as %HH, as you say Opera and Firefox do.  In
fact, everything would work exactly the same way as if the HTML
contained the link encoded as UTF-8.  Since most web servers accept
UTF-8 in URLs (or so I'm told), there's a good chance that things
would just work.

 So Wget has not only to call libidn, but also to call an unspecified
 library that converts charsets encountered in HTML (potentially a
 large set) to Unicode?

 Libidn links to iconv (which is a prerequisite for any
 internationalization)

Please note that Wget is portable to systems without iconv, or with
old and very limited versions of iconv.  While I agree that using
iconv is beneficial for many purposes, I also believe that Wget should
try to remain as flexible as possible, which includes doing the right
thing even without external libraries like iconv.  The
above-referenced UTF-16 - UTF-8 transformation is an example of that.

For proper construction of IDN hostnames, something like iconv (and
possibly also libidn) appears to be necessary.  Maybe that was part of
the reason why I wasn't too eager to consider IDN in Wget, at least
until it becomes popular enough for supporting it to be a necessity.

 To answer earlier comments, I never remember saying my patch is
 complete or full and proper IDN support.

You're right, you didn't say that.  I understood that from the
statement that it's very easy to do with GNU libidn, whereas I now
see that you meant that GNU libidn makes construction of IDN easy,
nothing else.

And my question about DNS queries can be expressed with following
patch. Why not do:

 --- clip ---
 Index: src/url.c
 ===
 --- src/url.c   (revision 2135)
 +++ src/url.c   (working copy)
 @@ -836,8 +836,8 @@
converted to %HH by reencode_escapes).  */
 if (strchr (u-host, '%'))
   {
 -  url_unescape (u-host);
 -  host_modified = true;
 +  error_code = PE_INVALID_HOST_NAME;
 +  goto error;
   }
 if (params_b)
 --- clip ---

Because that would break, for example, this URL:

http://%77%77%77.%63%6e%6e.%63%6f%6d/

which Opera and Konqueror handle (but Firefox doesn't).  I kind of
like the idea that % escapes work anywhere in the URL.  I'm not sure
if the current RFC's consider the above an acceptable alternative to
http://www.cnn.com/;.

We might want to search for invalid characters after unescaping the
host name, but I saw no reason to do that.  For one, getaddrinfo is (I
suppose) perfectly capable of rejecting invalid host names or simply
of not finding them.  Also, if such host names happen to (somehow)
work in some environments (for example by libc's getaddrinfo
performing the IDN translation automatically), why prevent that?

I hope this answers your question.