On Monday 21 February 2005 03:57 pm, Hrvoje Niksic wrote: > Mauro Tortonesi <[EMAIL PROTECTED]> writes: > > but i suspect we wiil probably have to add foreign charset support > > to wget one of these days. for example, suppose we are doing a > > recursive HTTP retrieval and the HTML pages we retrieve are not > > encoded in ASCII but in UTF16 (an encoding in which is perfectly > > fine to have null bytes in the stream). what do we do in that > > situation? > > I've never seen a UTF-16 HTML page (which doesn't mean they don't > exist), nor have I seen reports that requested adding support for > UTF-16. If/when UTF-16 becomes an issue, it's not that hard to add > rudimentary support for converting the (ASCII subset of) UTF-16 to > ASCII, so that we can find the links. > > In fact, we could be even smarter -- Wget could mechanically convert > UTF-16 to UTF-8, and parse UTF-8 contents as if it were ASCII, without > ever being aware of the charset intricacies.
ok. > The nice thing about UTF-8 is that it can be handled with normal C string > functions without corrupting the international characters. this is the reason for which simone and i wanted to use UTF8 for the internal representation of strings in wget. and i still am not sure if we should do that in order to make interpolation of strings safe. anyway, i think we should consider writing a module for processing returned HTTP resources (typically but not only HTML pages) with a well-defined interface for external plugins, so that it could possibly support extensions written in perl or python. in this way it would be easier to add e.g. javascript support to wget. -- Aequam memento rebus in arduis servare mentem... Mauro Tortonesi University of Ferrara - Dept. of Eng. http://www.ing.unife.it Institute of Human & Machine Cognition http://www.ihmc.us Deep Space 6 - IPv6 for Linux http://www.deepspace6.net Ferrara Linux User Group http://www.ferrara.linux.it