On 22.3.2006, at 17:10, Hrvoje Niksic wrote:
Can you elaborate on this? What I had in mind was:
1. start with a stream of UTF-16 sequences
2. convert that into a string of UCS code points
3. encode that into UTF-8
now work with UTF-8 consistently
What do you mean by file names as "escpaed UTF-16"?
I will take that back a little, after trying it in real life it's
actually not very bad idea. What I meant was that URL uses 8-bit
escaping, but if UTF-16 strings have non-ASCII characters how are
they encoded? With a few tests I found out that Opera, Firefox and
Konqueror all do exactly like you suggested, they convert the URL to
UTF-8 and then escape those 8-bit sequences. I first thought some
would use the UTF-16 raw byte representation but I was wrong, I
wouldn't see any use for it either though. Safari doesn't seem to
like non-ascii characters in wide charsets at all, which seems
reasonable.
It assumes this with the function I used, but it also supports
conversions from unicode strings where conversions are made
manually.
So Wget has not only to call libidn, but also to call an unspecified
library that converts charsets encountered in HTML (potentially a
large set) to Unicode?
Libidn links to iconv (which is a prerequisite for any
internationalization) and can handle the conversion itself. If it
wouldn't, it would be much more feasible to call iconv and just write
the punycode encoding manually. Is it possible to have multiple
charsets in single HTML file? Because all we need is for wget to tell
the url handler which charset we are using right now. If the url
comes from command line it would be the current locale. If finding
out the charset from HTTP/HTML turns out to be too hard, I suggest
either delimiting IDN support to command line or dropping the whole
thing.
To answer earlier comments, I never remember saying my patch is
complete or full and proper IDN support. I just demonstrated that
it's easy to convert hostnames to IDN using libidn. I had no idea at
that time that wget ignores all charsets in HTML files altogether,
but I found out quite soon. I'm interested in making wget support IDN
-- to a certain point. And my question about DNS queries can be
expressed with following patch. Why not do:
--- clip ---
Index: src/url.c
===================================================================
--- src/url.c (revision 2135)
+++ src/url.c (working copy)
@@ -836,8 +836,8 @@
converted to %HH by reencode_escapes). */
if (strchr (u->host, '%'))
{
- url_unescape (u->host);
- host_modified = true;
+ error_code = PE_INVALID_HOST_NAME;
+ goto error;
}
if (params_b)
--- clip ---
I don't understand the explanation of supporting binary characters in
hostnames, since they are not supported in RFC1035 section 2.3.1. It
is mentioned though, that this syntax only preferred, but I'm not
aware of any applications that would break the specification. Instead
they all use punycode to fill the requirements of the specification
mentioned before.
Juho