[ Moving the discussion from the patches list to the general discussion list, followed by more people. ]
Juho Vähä-Herttua <[EMAIL PROTECTED]> writes: > Thank you for mentioning this feature, I forgot to explicitly mention > it in my mail. Currently wget doesn't handle the charset at all on > HTML pages, so the recursive feature is already horribly broken on > some websites. That is a different issue, and it only arises on sites that use a non-8-bit-wide fixed width encoding, such as UTF-16. ("Such as" is a euphemism because I know of no other such encoding that is in wide use.) On the other hand, the IDN feature, as implemented by your patch, simply doesn't work (it silently malfunctions) whenever the HTML/HTTP charset is different than the charset of the user's locale -- regardless of whether it is UTF-16, Latin *, UTF-8, or something else. > So someone could file this into wget bugs list, but I can tell you > it's not easy to resolve. It's not that hard, either -- you can always transform UTF-16 into UTF-8 and work with that. > However, I don't see how this is related to IDN, it is related to > all domain names and correct HTML parsing. The problem *you* described (retrieving UTF-16 pages) is not at all related to IDN. However, the problem *I* described (charsets in HTML and in user's locale differing) is very related to IDN because your patch doesn't address the problem at all, and you don't seem to have a problem with that. Before IDN, Wget would simply send to the server whatever it found in the HTML. With IDN, charset-aware processing is done, and it has to take the page charset into account. Your patch doesn't do that -- it silently assumes (or so I believe; you never confirmed this) that the charset of u->host is the charset of the user's locale. That breaks with any page that specifies a different charset and attempts to link to a non-ASCII domain.