Mauro Tortonesi <[EMAIL PROTECTED]> writes:

> but i suspect we wiil probably have to add foreign charset support
> to wget one of these days. for example, suppose we are doing a
> recursive HTTP retrieval and the HTML pages we retrieve are not
> encoded in ASCII but in UTF16 (an encoding in which is perfectly
> fine to have null bytes in the stream). what do we do in that
> situation?

I've never seen a UTF-16 HTML page (which doesn't mean they don't
exist), nor have I seen reports that requested adding support for
UTF-16.  If/when UTF-16 becomes an issue, it's not that hard to add
rudimentary support for converting the (ASCII subset of) UTF-16 to
ASCII, so that we can find the links.

In fact, we could be even smarter -- Wget could mechanically convert
UTF-16 to UTF-8, and parse UTF-8 contents as if it were ASCII, without
ever being aware of the charset intricacies.  The nice thing about
UTF-8 is that it can be handled with normal C string functions
without corrupting the international characters.

Reply via email to