Re: non-ASCII in host names

Hrvoje Niksic Fri, 19 Mar 2004 12:59:38 -0800

"Gisle Vanem" <[EMAIL PROTECTED]> writes:

> I have to use the ACE form www.xn--troms-zua.no
> which is a bit of a pain.
> Ref. http://www.norid.no/domenenavnbaser/ace/?language=en
>
> Why is wget munging the hostname here? Seem it calls
> reencode_escapes() on the hostname part. Why I don't know.


Calling reencode_escapes() is correct anywhere in the URL; what Wget
needs to do is unescape the host part of the URL before using it
further.

> If it where not for the "Host:" header, the name could remain
> un-escaped. I don't know what the standard say about this case.
> Should the header contain "Host:www.xn--troms-zua.no" ?

The Host header is (I think) not URL-escaped, so we can simply send
the 8-bit characters as we received them.

Here's a patch; please let me know if it works for you.

2004-03-19  Hrvoje Niksic  <[EMAIL PROTECTED]>

        * url.c (url_parse): Decode %HH sequences in host name.

Index: src/url.c
===================================================================
RCS file: /pack/anoncvs/wget/src/url.c,v
retrieving revision 1.110
diff -u -r1.110 url.c
--- src/url.c   2003/12/15 10:22:54     1.110
+++ src/url.c   2004/03/19 20:57:43
@@ -999,6 +999,15 @@
 
   host_modified = lowercase_str (u->host);
 
+  /* Decode %HH sequences in host name.  This is important not so much
+     to support %HH sequences, but to support binary characters (which
+     will have been converted to %HH by reencode_escapes).  */
+  if (strchr (u->host, '%'))
+    {
+      url_unescape (u->host);
+      host_modified = 1;
+    }
+
   if (params_b)
     u->params = strdupdelim (params_b, params_e);
   if (query_b)

Re: non-ASCII in host names

Reply via email to