On 22.3.2006, at 17:10, Hrvoje Niksic wrote:
Can you elaborate on this?  What I had in mind was:

1. start with a stream of UTF-16 sequences
2. convert that into a string of UCS code points
3. encode that into UTF-8
now work with UTF-8 consistently

What do you mean by file names as "escpaed UTF-16"?

I will take that back a little, after trying it in real life it's actually not very bad idea. What I meant was that URL uses 8-bit escaping, but if UTF-16 strings have non-ASCII characters how are they encoded? With a few tests I found out that Opera, Firefox and Konqueror all do exactly like you suggested, they convert the URL to UTF-8 and then escape those 8-bit sequences. I first thought some would use the UTF-16 raw byte representation but I was wrong, I wouldn't see any use for it either though. Safari doesn't seem to like non-ascii characters in wide charsets at all, which seems reasonable.

It assumes this with the function I used, but it also supports
conversions from unicode strings where conversions are made
manually.

So Wget has not only to call libidn, but also to call an unspecified
library that converts charsets encountered in HTML (potentially a
large set) to Unicode?

Libidn links to iconv (which is a prerequisite for any internationalization) and can handle the conversion itself. If it wouldn't, it would be much more feasible to call iconv and just write the punycode encoding manually. Is it possible to have multiple charsets in single HTML file? Because all we need is for wget to tell the url handler which charset we are using right now. If the url comes from command line it would be the current locale. If finding out the charset from HTTP/HTML turns out to be too hard, I suggest either delimiting IDN support to command line or dropping the whole thing.

To answer earlier comments, I never remember saying my patch is complete or full and proper IDN support. I just demonstrated that it's easy to convert hostnames to IDN using libidn. I had no idea at that time that wget ignores all charsets in HTML files altogether, but I found out quite soon. I'm interested in making wget support IDN -- to a certain point. And my question about DNS queries can be expressed with following patch. Why not do:

--- clip ---
Index: src/url.c
===================================================================
--- src/url.c   (revision 2135)
+++ src/url.c   (working copy)
@@ -836,8 +836,8 @@
      converted to %HH by reencode_escapes).  */
   if (strchr (u->host, '%'))
     {
-      url_unescape (u->host);
-      host_modified = true;
+      error_code = PE_INVALID_HOST_NAME;
+      goto error;
     }
   if (params_b)
--- clip ---

I don't understand the explanation of supporting binary characters in hostnames, since they are not supported in RFC1035 section 2.3.1. It is mentioned though, that this syntax only preferred, but I'm not aware of any applications that would break the specification. Instead they all use punycode to fill the requirements of the specification mentioned before.


Juho

Reply via email to