Re: IDN patch for wget

Juho Vähä-Herttua Wed, 22 Mar 2006 09:37:50 -0800

On 22.3.2006, at 17:10, Hrvoje Niksic wrote:

Can you elaborate on this?  What I had in mind was:


1. start with a stream of UTF-16 sequences
2. convert that into a string of UCS code points
3. encode that into UTF-8
now work with UTF-8 consistently

What do you mean by file names as "escpaed UTF-16"?

I will take that back a little, after trying it in real life it'sactually not very bad idea. What I meant was that URL uses 8-bitescaping, but if UTF-16 strings have non-ASCII characters how arethey encoded? With a few tests I found out that Opera, Firefox andKonqueror all do exactly like you suggested, they convert the URL toUTF-8 and then escape those 8-bit sequences. I first thought somewould use the UTF-16 raw byte representation but I was wrong, Iwouldn't see any use for it either though. Safari doesn't seem tolike non-ascii characters in wide charsets at all, which seemsreasonable.

It assumes this with the function I used, but it also supports
conversions from unicode strings where conversions are made
manually.


So Wget has not only to call libidn, but also to call an unspecified
library that converts charsets encountered in HTML (potentially a
large set) to Unicode?

Libidn links to iconv (which is a prerequisite for anyinternationalization) and can handle the conversion itself. If itwouldn't, it would be much more feasible to call iconv and just writethe punycode encoding manually. Is it possible to have multiplecharsets in single HTML file? Because all we need is for wget to tellthe url handler which charset we are using right now. If the urlcomes from command line it would be the current locale. If findingout the charset from HTTP/HTML turns out to be too hard, I suggesteither delimiting IDN support to command line or dropping the wholething.

To answer earlier comments, I never remember saying my patch iscomplete or full and proper IDN support. I just demonstrated thatit's easy to convert hostnames to IDN using libidn. I had no idea atthat time that wget ignores all charsets in HTML files altogether,but I found out quite soon. I'm interested in making wget support IDN-- to a certain point. And my question about DNS queries can beexpressed with following patch. Why not do:


--- clip ---
Index: src/url.c
===================================================================
--- src/url.c   (revision 2135)
+++ src/url.c   (working copy)
@@ -836,8 +836,8 @@
      converted to %HH by reencode_escapes).  */
   if (strchr (u->host, '%'))
     {
-      url_unescape (u->host);
-      host_modified = true;
+      error_code = PE_INVALID_HOST_NAME;
+      goto error;
     }
   if (params_b)
--- clip ---

I don't understand the explanation of supporting binary characters inhostnames, since they are not supported in RFC1035 section 2.3.1. Itis mentioned though, that this syntax only preferred, but I'm notaware of any applications that would break the specification. Insteadthey all use punycode to fill the requirements of the specificationmentioned before.



Juho

Re: IDN patch for wget

Reply via email to