"Gisle Vanem" <[EMAIL PROTECTED]> writes:
> I have to use the ACE form www.xn--troms-zua.no
> which is a bit of a pain.
> Ref. http://www.norid.no/domenenavnbaser/ace/?language=en
>
> Why is wget munging the hostname here? Seem it calls
> reencode_escapes() on the hostname part. Why I don't know.
Calling reencode_escapes() is correct anywhere in the URL; what Wget
needs to do is unescape the host part of the URL before using it
further.
> If it where not for the "Host:" header, the name could remain
> un-escaped. I don't know what the standard say about this case.
> Should the header contain "Host:www.xn--troms-zua.no" ?
The Host header is (I think) not URL-escaped, so we can simply send
the 8-bit characters as we received them.
Here's a patch; please let me know if it works for you.
2004-03-19 Hrvoje Niksic <[EMAIL PROTECTED]>
* url.c (url_parse): Decode %HH sequences in host name.
Index: src/url.c
===================================================================
RCS file: /pack/anoncvs/wget/src/url.c,v
retrieving revision 1.110
diff -u -r1.110 url.c
--- src/url.c 2003/12/15 10:22:54 1.110
+++ src/url.c 2004/03/19 20:57:43
@@ -999,6 +999,15 @@
host_modified = lowercase_str (u->host);
+ /* Decode %HH sequences in host name. This is important not so much
+ to support %HH sequences, but to support binary characters (which
+ will have been converted to %HH by reencode_escapes). */
+ if (strchr (u->host, '%'))
+ {
+ url_unescape (u->host);
+ host_modified = 1;
+ }
+
if (params_b)
u->params = strdupdelim (params_b, params_e);
if (query_b)