Re: Doing character encoding/decoding within libwww?
On 9/23/07, Bjoern Hoehrmann [EMAIL PROTECTED] wrote: Well that is necessarily so to keep the interface simple. Going from LWP::Simple::get to LWP::UserAgent-new-get(...) is easy enough to not warrant adding functionality to LWP::Simple. My concern, though, is that with this approach, LWP::Simple isn't just lacking features: it's harmful. Users of LWP::Simple today cannot guarantee that the octets they get are usable as text. So long as applications use it, these applications will never be properly internationalizable and we will continue seeing new applications written that don't properly handle character encodings. Actually that is not the case, there are plenty of, say, application/* formats, like the XML types, that carry encoding information in the header, without replicating it in the content (likewise, information in the content may not be replicated in the header, and the two may contra- dict each other). I didn't notice that application/xml and +xml media types also made the HTTP charset authoritative. Basically, my thought is that if it follows these rules (by placing it in the HTTP headers), it seems appropriate to decode it as text. Otherwise, the charset information will require some closer inspection, but but could easily be done by the caller even if they use LWP::Simple. Well, automagic decoding of content cannot be added to LWP::Simple with- out some opt-in switch as that would break a lot of programs, and if you require some opt-in, you might as well require switching the module. That's certainly a good argument. You could also just supplement its methods with variants that attempt to return text instead of octets, and deprecate or at least discourage the use of the other methods when you're expecting text. (It might be appropriate to print out a warning when an octet-based method is used to fetch a textual media type.) If LWP::Simple can't be easily changed to manage character encodings cluefully, reasonably completely, and transparently to the caller, the responsible thing to do would be to add some verbiage to its documentation making this clear and discouraging its use altogether for retrieving text. David
Re: Doing character encoding/decoding within libwww?
On 9/22/07, Bjoern Hoehrmann [EMAIL PROTECTED] wrote: Generally speaking, this is rather difficult as some content may not be textual at all, and textual formats vary in how applications are to de- tect the encoding (e.g., XML has different rules than HTML, text/plain has no rules beyond looking at the charset parameter, and so on). If you want a general-purpose solution, a good start would be a module taking a HTTP::Response object and detecting the encoding, possibly decoding it on request. Fortunately, we know the Content-Type at this point, so we can decide if it's appropriate to decode it as text, and if so, how to go about doing it. HTML::Encoding seems like it approaches the problem reasonably well, but ideally, I'd like to be able some day to use LWP::Simple's get() and get back a logical text string for text/* or application/*+xml. Similarly, getprint() should do the Right Thing with respect to my locale. Users of LWP::Simple can't invoke another layer of processing, even if they wanted to. So, today, it's either get back octets that may or may not be useful as text or use the full blown LWP::UserAgent and add another layer (perhaps too-specifically-named HTML::Encoding) to make sure you get text right. It just seems like we can simplify that. Thanks for the feedback. David
Re: Doing character encoding/decoding within libwww?
On 9/22/07, Bjoern Hoehrmann [EMAIL PROTECTED] wrote: * Bill Moseley wrote: If you have the response object: $response-decoded_content; That removes content encodings like gzip and deflate, but David is asking about character encodings like utf-8 and iso-8859-1. Content encodings are applied after character encodings. So after reading Bill's response, I thought to myself the same thing, but added, ...though that sounds like it would be the perfect place to implement this. After checking the code, decoded_content does indeed decode character encodings and returns text instead of octets! I don't think it used to do that, but that's great. It still doesn't help in the LWP::Simple case, though, and if someone is actually using LWP::Simple for their application, they probably aren't going to spend the time needed to ensure the octets they get back are meaningful text either. But this certainly simplifies the problem. What would people think about just changing LWP::Simple to use decoded_content instead of content? David
Re: Doing character encoding/decoding within libwww?
On 9/22/07, Bill Moseley [EMAIL PROTECTED] wrote: It's been a long day. What other mime types are you thinking of other than text/*? The most complete implementation imaginable would start with at least these: text/html (html-specific rules) text/xml (xml-specific rules) text/* (general-purpose text rules) application/*+xml (xml-specific rules) You'd probably also want this to be extensible, so that I can add my own media types at run-time to guarantee my non-obvious textual media type is handled properly. On the other hand, I'm less convinced now that dipping into the HTML or XML content to figure out the proper encoding is necessarily the proper thing to do here. My complaint about LWP::Simple was that the HTTP Content-Type (charset) information is lost by the time it gets to the caller. If the data isn't in text at that point, it will never reliably get there. But for HTML and XML, if the character encoding is actually specified in the contentrather than in the HTTP headers, then it isn't as important to deal with it up front. I could see a case then for dealing with text/* only and returning octets for everything else, since text/* is the only media type that has character encoding details in the HTTP headers. That being said, applications based on LWP::Simple are likely to work better with HTML and XML assistance for the reason I gave earlier: users of LWP::Simple probably aren't going to take the time to do the proper parsing and decoding. Yes, it's still their fault for not coding a robust application, but helping them do that is I think still a valid goal, if we can do it safely. David
Re: Doing character encoding/decoding within libwww?
On Sat, Sep 22, 2007 at 11:53:14PM -0700, David Nesting wrote: On the other hand, I'm less convinced now that dipping into the HTML or XML content to figure out the proper encoding is necessarily the proper thing to do here. Well, it's often needed since content providers may not have the ability to alter the server's Content-Type header to add the correct charset. On the other hand, it probably depends on what you plan to do with the content. Passing off to a parser (e.g. libxml2) would also figure out the encoding. I have a program that uses LWP and used decoded_content but then I re-encode it before passing it on to the next tool in the chain that also will decode. But, I've also considered parsing the content and removing any content-specified charsets and returning utf8 at all times. My complaint about LWP::Simple was that the HTTP Content-Type (charset) information is lost by the time it gets to the caller. If the data isn't in text at that point, it will never reliably get there. But for HTML and XML, if the character encoding is actually specified in the contentrather than in the HTTP headers, then it isn't as important to deal with it up front. I could see a case then for dealing with text/* only and returning octets for everything else, since text/* is the only media type that has character encoding details in the HTTP headers. That being said, applications based on LWP::Simple are likely to work better with HTML and XML assistance for the reason I gave earlier: users of LWP::Simple probably aren't going to take the time to do the proper parsing and decoding. Yes, it's still their fault for not coding a robust application, but helping them do that is I think still a valid goal, if we can do it safely. I'd tend to agree. Make LWP::Simple return decoded content and if you need more control don'e use LWP::Simple. -- Bill Moseley [EMAIL PROTECTED]
Re: Doing character encoding/decoding within libwww?
* David Nesting wrote: The most complete implementation imaginable would start with at least these: text/html (html-specific rules) text/xml (xml-specific rules) text/* (general-purpose text rules) application/*+xml (xml-specific rules) HTML::Encoding does all of these, except text/* (for which there are no rules beyond checking the charset parameter, though you might also try to check for a Unicode signature at the beginning, which almost always indicates the Unicode encoding form, HTML::Encoding can do both but is not designed to do that for arbitrary types). On the other hand, I'm less convinced now that dipping into the HTML or XML content to figure out the proper encoding is necessarily the proper thing to do here. My complaint about LWP::Simple was that the HTTP Content-Type (charset) information is lost by the time it gets to the caller. Well that is necessarily so to keep the interface simple. Going from LWP::Simple::get to LWP::UserAgent-new-get(...) is easy enough to not warrant adding functionality to LWP::Simple. I could see a case then for dealing with text/* only and returning octets for everything else, since text/* is the only media type that has character encoding details in the HTTP headers. Actually that is not the case, there are plenty of, say, application/* formats, like the XML types, that carry encoding information in the header, without replicating it in the content (likewise, information in the content may not be replicated in the header, and the two may contra- dict each other). Yes, it's still their fault for not coding a robust application, but helping them do that is I think still a valid goal, if we can do it safely. Well, automagic decoding of content cannot be added to LWP::Simple with- out some opt-in switch as that would break a lot of programs, and if you require some opt-in, you might as well require switching the module. -- Björn Höhrmann · mailto:[EMAIL PROTECTED] · http://bjoern.hoehrmann.de Weinh. Str. 22 · Telefon: +49(0)621/4309674 · http://www.bjoernsworld.de 68309 Mannheim · PGP Pub. KeyID: 0xA4357E78 · http://www.websitedev.de/
Re: Doing character encoding/decoding within libwww?
* David Nesting wrote: For most uses of libwww, developers do little with character encoding. Indeed, for general-case use of LWP::Simple, they can't, because that information isn't even exposed. Has any thought gone into doing this internally within libwww, so that when I fetch content, I get back text instead of octets? Generally speaking, this is rather difficult as some content may not be textual at all, and textual formats vary in how applications are to de- tect the encoding (e.g., XML has different rules than HTML, text/plain has no rules beyond looking at the charset parameter, and so on). If you want a general-purpose solution, a good start would be a module taking a HTTP::Response object and detecting the encoding, possibly decoding it on request. I'd be happy to help work on some of this, but the fact that I see no use of character encodings within libwww makes me wonder if this is more of a policy decision not to do it. There was a bit of a discussion to somehow use HTML::Encoding for some parts of it, which pretty much solves the problem for HTML and XML, cf the list archives. Help on improving HTML::Encoding would be welcome, I have little time to work on it at the moment. -- Björn Höhrmann · mailto:[EMAIL PROTECTED] · http://bjoern.hoehrmann.de Weinh. Str. 22 · Telefon: +49(0)621/4309674 · http://www.bjoernsworld.de 68309 Mannheim · PGP Pub. KeyID: 0xA4357E78 · http://www.websitedev.de/
Re: Doing character encoding/decoding within libwww?
On Fri, Sep 21, 2007 at 12:49:26PM -0700, David Nesting wrote: For most uses of libwww, developers do little with character encoding. Indeed, for general-case use of LWP::Simple, they can't, because that information isn't even exposed. Has any thought gone into doing this internally within libwww, so that when I fetch content, I get back text instead of octets? If you have the response object: $response-decoded_content; -- Bill Moseley [EMAIL PROTECTED]
Re: Doing character encoding/decoding within libwww?
* Bill Moseley wrote: On Fri, Sep 21, 2007 at 12:49:26PM -0700, David Nesting wrote: For most uses of libwww, developers do little with character encoding. Indeed, for general-case use of LWP::Simple, they can't, because that information isn't even exposed. Has any thought gone into doing this internally within libwww, so that when I fetch content, I get back text instead of octets? If you have the response object: $response-decoded_content; That removes content encodings like gzip and deflate, but David is asking about character encodings like utf-8 and iso-8859-1. Content encodings are applied after character encodings. -- Björn Höhrmann · mailto:[EMAIL PROTECTED] · http://bjoern.hoehrmann.de Weinh. Str. 22 · Telefon: +49(0)621/4309674 · http://www.bjoernsworld.de 68309 Mannheim · PGP Pub. KeyID: 0xA4357E78 · http://www.websitedev.de/
Re: Doing character encoding/decoding within libwww?
On Sat, Sep 22, 2007 at 11:50:53PM +0200, Bjoern Hoehrmann wrote: * Bill Moseley wrote: On Fri, Sep 21, 2007 at 12:49:26PM -0700, David Nesting wrote: For most uses of libwww, developers do little with character encoding. Indeed, for general-case use of LWP::Simple, they can't, because that information isn't even exposed. Has any thought gone into doing this internally within libwww, so that when I fetch content, I get back text instead of octets? If you have the response object: $response-decoded_content; That removes content encodings like gzip and deflate, but David is asking about character encodings like utf-8 and iso-8859-1. Content encodings are applied after character encodings. sub decoded_content { $content_ref = \Encode::decode($charset, $$content_ref, Encode::FB_CROAK() | Encode::LEAVE_SRC()); -- Bill Moseley [EMAIL PROTECTED]
Re: Doing character encoding/decoding within libwww?
* Bill Moseley wrote: sub decoded_content { $content_ref = \Encode::decode($charset, $$content_ref, Encode::FB_CROAK() | Encode::LEAVE_SRC()); The documentation I re-read earlier even says that... This is still a far cry from being generally useful though, it only works for text/* and only if the encoding is specified in the header, or the format does not use some kind of inline label that is inconsistent with the default. Most of the time this is not the case, however. -- Björn Höhrmann · mailto:[EMAIL PROTECTED] · http://bjoern.hoehrmann.de Weinh. Str. 22 · Telefon: +49(0)621/4309674 · http://www.bjoernsworld.de 68309 Mannheim · PGP Pub. KeyID: 0xA4357E78 · http://www.websitedev.de/
Re: Doing character encoding/decoding within libwww?
On Sun, Sep 23, 2007 at 01:22:21AM +0200, Bjoern Hoehrmann wrote: * Bill Moseley wrote: sub decoded_content { $content_ref = \Encode::decode($charset, $$content_ref, Encode::FB_CROAK() | Encode::LEAVE_SRC()); The documentation I re-read earlier even says that... This is still a far cry from being generally useful though, it only works for text/* and only if the encoding is specified in the header, or the format does not use some kind of inline label that is inconsistent with the default. Most of the time this is not the case, however. It will also find meta content-type in the markup, IIRC. It's been a long day. What other mime types are you thinking of other than text/*? -- Bill Moseley [EMAIL PROTECTED]