Re: IDN patch for wget

Hrvoje Niksic Wed, 22 Mar 2006 02:14:57 -0800

[ Moving the discussion from the patches list to the general
  discussion list, followed by more people. ]

Juho Vähä-Herttua <[EMAIL PROTECTED]> writes:

> Thank you for mentioning this feature, I forgot to explicitly mention
> it in my mail. Currently wget doesn't handle the charset at all on
> HTML pages, so the recursive feature is already horribly broken on
> some websites.

That is a different issue, and it only arises on sites that use a
non-8-bit-wide fixed width encoding, such as UTF-16.  ("Such as" is a
euphemism because I know of no other such encoding that is in wide
use.)

On the other hand, the IDN feature, as implemented by your patch,
simply doesn't work (it silently malfunctions) whenever the HTML/HTTP
charset is different than the charset of the user's locale --
regardless of whether it is UTF-16, Latin *, UTF-8, or something else.

> So someone could file this into wget bugs list, but I can tell you
> it's not easy to resolve.

It's not that hard, either -- you can always transform UTF-16 into
UTF-8 and work with that.

> However, I don't see how this is related to IDN, it is related to
> all domain names and correct HTML parsing.

The problem *you* described (retrieving UTF-16 pages) is not at all
related to IDN.  However, the problem *I* described (charsets in HTML
and in user's locale differing) is very related to IDN because your
patch doesn't address the problem at all, and you don't seem to have a
problem with that.

Before IDN, Wget would simply send to the server whatever it found in
the HTML.  With IDN, charset-aware processing is done, and it has to
take the page charset into account.  Your patch doesn't do that -- it
silently assumes (or so I believe; you never confirmed this) that the
charset of u->host is the charset of the user's locale.  That breaks
with any page that specifies a different charset and attempts to link
to a non-ASCII domain.

Re: IDN patch for wget

Reply via email to