On Monday 21 February 2005 03:57 pm, Hrvoje Niksic wrote:
> Mauro Tortonesi <[EMAIL PROTECTED]> writes:
> > but i suspect we wiil probably have to add foreign charset support
> > to wget one of these days. for example, suppose we are doing a
> > recursive HTTP retrieval and the HTML pages we retrieve are not
> > encoded in ASCII but in UTF16 (an encoding in which is perfectly
> > fine to have null bytes in the stream). what do we do in that
> > situation?
>
> I've never seen a UTF-16 HTML page (which doesn't mean they don't
> exist), nor have I seen reports that requested adding support for
> UTF-16.  If/when UTF-16 becomes an issue, it's not that hard to add
> rudimentary support for converting the (ASCII subset of) UTF-16 to
> ASCII, so that we can find the links.
>
> In fact, we could be even smarter -- Wget could mechanically convert
> UTF-16 to UTF-8, and parse UTF-8 contents as if it were ASCII, without
> ever being aware of the charset intricacies.  

ok.

> The nice thing about UTF-8 is that it can be handled with normal C string 
> functions without corrupting the international characters.

this is the reason for which simone and i wanted to use UTF8 for the internal 
representation of strings in wget. and i still am not sure if we should do 
that in order to make interpolation of strings safe.

anyway, i think we should consider writing a module for processing returned 
HTTP resources (typically but not only HTML pages) with a well-defined 
interface for external plugins, so that it could possibly support extensions 
written in perl or python. in this way it would be easier to add e.g. 
javascript support to wget.

-- 
Aequam memento rebus in arduis servare mentem...

Mauro Tortonesi

University of Ferrara - Dept. of Eng.    http://www.ing.unife.it
Institute of Human & Machine Cognition   http://www.ihmc.us
Deep Space 6 - IPv6 for Linux            http://www.deepspace6.net
Ferrara Linux User Group                 http://www.ferrara.linux.it

Reply via email to