Re: Doing character encoding/decoding within libwww?

2007-09-24 Thread David Nesting
On 9/23/07, Bjoern Hoehrmann [EMAIL PROTECTED] wrote:


 Well that is necessarily so to keep the interface simple. Going from
 LWP::Simple::get to LWP::UserAgent-new-get(...) is easy enough to not
 warrant adding functionality to LWP::Simple.


My concern, though, is that with this approach, LWP::Simple isn't just
lacking features: it's harmful.  Users of LWP::Simple today cannot guarantee
that the octets they get are usable as text.  So long as applications use
it, these applications will never be properly internationalizable and we
will continue seeing new applications written that don't properly handle
character encodings.

Actually that is not the case, there are plenty of, say, application/*
 formats, like the XML types, that carry encoding information in the
 header, without replicating it in the content (likewise, information in
 the content may not be replicated in the header, and the two may contra-
 dict each other).


I didn't notice that application/xml and +xml media types also made the HTTP
charset authoritative.  Basically, my thought is that if it follows these
rules (by placing it in the HTTP headers), it seems appropriate to decode it
as text.  Otherwise, the charset information will require some closer
inspection, but but could easily be done by the caller even if they use
LWP::Simple.


 Well, automagic decoding of content cannot be added to LWP::Simple with-
 out some opt-in switch as that would break a lot of programs, and if you
 require some opt-in, you might as well require switching the module.


That's certainly a good argument.  You could also just supplement its
methods with variants that attempt to return text instead of octets, and
deprecate or at least discourage the use of the other methods when you're
expecting text.  (It might be appropriate to print out a warning when an
octet-based method is used to fetch a textual media type.)

If LWP::Simple can't be easily changed to manage character encodings
cluefully, reasonably completely, and transparently to the caller, the
responsible thing to do would be to add some verbiage to its documentation
making this clear and discouraging its use altogether for retrieving text.

David


Re: Doing character encoding/decoding within libwww?

2007-09-23 Thread David Nesting
On 9/22/07, Bjoern Hoehrmann [EMAIL PROTECTED] wrote:

 Generally speaking, this is rather difficult as some content may not be
 textual at all, and textual formats vary in how applications are to de-
 tect the encoding (e.g., XML has different rules than HTML, text/plain
 has no rules beyond looking at the charset parameter, and so on). If you
 want a general-purpose solution, a good start would be a module taking a
 HTTP::Response object and detecting the encoding, possibly decoding it
 on request.


Fortunately, we know the Content-Type at this point, so we can decide if
it's appropriate to decode it as text, and if so, how to go about doing it.

HTML::Encoding seems like it approaches the problem reasonably well, but
ideally, I'd like to be able some day to use LWP::Simple's get() and get
back a logical text string for text/* or application/*+xml.  Similarly,
getprint() should do the Right Thing with respect to my locale.  Users of
LWP::Simple can't invoke another layer of processing, even if they wanted
to.  So, today, it's either get back octets that may or may not be useful
as text or use the full blown LWP::UserAgent and add another layer
(perhaps too-specifically-named HTML::Encoding) to make sure you get text
right.  It just seems like we can simplify that.

Thanks for the feedback.

David


Re: Doing character encoding/decoding within libwww?

2007-09-23 Thread David Nesting
On 9/22/07, Bjoern Hoehrmann [EMAIL PROTECTED] wrote:

 * Bill Moseley wrote:
 If you have the response object:
 $response-decoded_content;

 That removes content encodings like gzip and deflate, but David is
 asking about character encodings like utf-8 and iso-8859-1. Content
 encodings are applied after character encodings.


So after reading Bill's response, I thought to myself the same thing, but
added, ...though that sounds like it would be the perfect place to
implement this.

After checking the code, decoded_content does indeed decode character
encodings and returns text instead of octets!  I don't think it used to do
that, but that's great.

It still doesn't help in the LWP::Simple case, though, and if someone is
actually using LWP::Simple for their application, they probably aren't going
to spend the time needed to ensure the octets they get back are meaningful
text either.  But this certainly simplifies the problem.

What would people think about just changing LWP::Simple to use
decoded_content instead of content?

David


Re: Doing character encoding/decoding within libwww?

2007-09-23 Thread David Nesting
On 9/22/07, Bill Moseley [EMAIL PROTECTED] wrote:

 It's been a long day.  What other mime types are you thinking of other
 than text/*?


The most complete implementation imaginable would start with at least these:

text/html (html-specific rules)
text/xml (xml-specific rules)
text/* (general-purpose text rules)
application/*+xml (xml-specific rules)

You'd probably also want this to be extensible, so that I can add my own
media types at run-time to guarantee my non-obvious textual media type is
handled properly.

On the other hand, I'm less convinced now that dipping into the HTML or XML
content to figure out the proper encoding is necessarily the proper thing to
do here.  My complaint about LWP::Simple was that the HTTP Content-Type
(charset) information is lost by the time it gets to the caller.  If the
data isn't in text at that point, it will never reliably get there.  But for
HTML and XML, if the character encoding is actually specified in the
contentrather than in the HTTP headers, then it isn't as important to
deal with it
up front.  I could see a case then for dealing with text/* only and
returning octets for everything else, since text/* is the only media type
that has character encoding details in the HTTP headers.  That being said,
applications based on LWP::Simple are likely to work better with HTML and
XML assistance for the reason I gave earlier: users of LWP::Simple
probably aren't going to take the time to do the proper parsing and
decoding.  Yes, it's still their fault for not coding a robust
application, but helping them do that is I think still a valid goal, if we
can do it safely.

David


Re: Doing character encoding/decoding within libwww?

2007-09-23 Thread Bill Moseley
On Sat, Sep 22, 2007 at 11:53:14PM -0700, David Nesting wrote:
 On the other hand, I'm less convinced now that dipping into the HTML or XML
 content to figure out the proper encoding is necessarily the proper thing to
 do here.

Well, it's often needed since content providers may not have the
ability to alter the server's Content-Type header to add the correct
charset.

On the other hand, it probably depends on what you plan to do with the
content.  Passing off to a parser (e.g. libxml2) would also figure out
the encoding.

I have a program that uses LWP and used decoded_content but then I
re-encode it before passing it on to the next tool in the chain that
also will decode.  But, I've also considered parsing the content and
removing any content-specified charsets and returning utf8 at all
times.

 My complaint about LWP::Simple was that the HTTP Content-Type
 (charset) information is lost by the time it gets to the caller.  If
 the data isn't in text at that point, it will never reliably get
 there.  But for HTML and XML, if the character encoding is actually
 specified in the contentrather than in the HTTP headers, then it
 isn't as important to deal with it up front.  I could see a case
 then for dealing with text/* only and returning octets for
 everything else, since text/* is the only media type that has
 character encoding details in the HTTP headers.  That being said,
 applications based on LWP::Simple are likely to work better with
 HTML and XML assistance for the reason I gave earlier: users of
 LWP::Simple probably aren't going to take the time to do the proper
 parsing and decoding.  Yes, it's still their fault for not coding
 a robust application, but helping them do that is I think still a
 valid goal, if we can do it safely.

I'd tend to agree.  Make LWP::Simple return decoded content and if you
need more control don'e use LWP::Simple.

-- 
Bill Moseley
[EMAIL PROTECTED]



Re: Doing character encoding/decoding within libwww?

2007-09-23 Thread Bjoern Hoehrmann
* David Nesting wrote:
The most complete implementation imaginable would start with at least these:

text/html (html-specific rules)
text/xml (xml-specific rules)
text/* (general-purpose text rules)
application/*+xml (xml-specific rules)

HTML::Encoding does all of these, except text/* (for which there are no
rules beyond checking the charset parameter, though you might also try
to check for a Unicode signature at the beginning, which almost always
indicates the Unicode encoding form, HTML::Encoding can do both but is
not designed to do that for arbitrary types).

On the other hand, I'm less convinced now that dipping into the HTML or XML
content to figure out the proper encoding is necessarily the proper thing to
do here.  My complaint about LWP::Simple was that the HTTP Content-Type
(charset) information is lost by the time it gets to the caller.

Well that is necessarily so to keep the interface simple. Going from
LWP::Simple::get to LWP::UserAgent-new-get(...) is easy enough to not
warrant adding functionality to LWP::Simple.

I could see a case then for dealing with text/* only and returning octets
for everything else, since text/* is the only media type that has character
encoding details in the HTTP headers.

Actually that is not the case, there are plenty of, say, application/*
formats, like the XML types, that carry encoding information in the
header, without replicating it in the content (likewise, information in
the content may not be replicated in the header, and the two may contra-
dict each other).

Yes, it's still their fault for not coding a robust application, but
helping them do that is I think still a valid goal, if we can do it safely.

Well, automagic decoding of content cannot be added to LWP::Simple with-
out some opt-in switch as that would break a lot of programs, and if you
require some opt-in, you might as well require switching the module.
-- 
Björn Höhrmann · mailto:[EMAIL PROTECTED] · http://bjoern.hoehrmann.de
Weinh. Str. 22 · Telefon: +49(0)621/4309674 · http://www.bjoernsworld.de
68309 Mannheim · PGP Pub. KeyID: 0xA4357E78 · http://www.websitedev.de/ 


Re: Doing character encoding/decoding within libwww?

2007-09-22 Thread Bjoern Hoehrmann
* David Nesting wrote:
For most uses of libwww, developers do little with character encoding.
Indeed, for general-case use of LWP::Simple, they can't, because that
information isn't even exposed.  Has any thought gone into doing this
internally within libwww, so that when I fetch content, I get back text
instead of octets?

Generally speaking, this is rather difficult as some content may not be
textual at all, and textual formats vary in how applications are to de-
tect the encoding (e.g., XML has different rules than HTML, text/plain
has no rules beyond looking at the charset parameter, and so on). If you
want a general-purpose solution, a good start would be a module taking a
HTTP::Response object and detecting the encoding, possibly decoding it
on request.

I'd be happy to help work on some of this, but the fact that I see no
use of character encodings within libwww makes me wonder if this is more
of a policy decision not to do it.

There was a bit of a discussion to somehow use HTML::Encoding for some
parts of it, which pretty much solves the problem for HTML and XML, cf
the list archives. Help on improving HTML::Encoding would be welcome,
I have little time to work on it at the moment.
-- 
Björn Höhrmann · mailto:[EMAIL PROTECTED] · http://bjoern.hoehrmann.de
Weinh. Str. 22 · Telefon: +49(0)621/4309674 · http://www.bjoernsworld.de
68309 Mannheim · PGP Pub. KeyID: 0xA4357E78 · http://www.websitedev.de/ 


Re: Doing character encoding/decoding within libwww?

2007-09-22 Thread Bill Moseley
On Fri, Sep 21, 2007 at 12:49:26PM -0700, David Nesting wrote:
 For most uses of libwww, developers do little with character encoding.
 Indeed, for general-case use of LWP::Simple, they can't, because that
 information isn't even exposed.  Has any thought gone into doing this
 internally within libwww, so that when I fetch content, I get back text
 instead of octets?

If you have the response object:

$response-decoded_content;

-- 
Bill Moseley
[EMAIL PROTECTED]



Re: Doing character encoding/decoding within libwww?

2007-09-22 Thread Bjoern Hoehrmann
* Bill Moseley wrote:
On Fri, Sep 21, 2007 at 12:49:26PM -0700, David Nesting wrote:
 For most uses of libwww, developers do little with character encoding.
 Indeed, for general-case use of LWP::Simple, they can't, because that
 information isn't even exposed.  Has any thought gone into doing this
 internally within libwww, so that when I fetch content, I get back text
 instead of octets?

If you have the response object:

$response-decoded_content;

That removes content encodings like gzip and deflate, but David is
asking about character encodings like utf-8 and iso-8859-1. Content
encodings are applied after character encodings.
-- 
Björn Höhrmann · mailto:[EMAIL PROTECTED] · http://bjoern.hoehrmann.de
Weinh. Str. 22 · Telefon: +49(0)621/4309674 · http://www.bjoernsworld.de
68309 Mannheim · PGP Pub. KeyID: 0xA4357E78 · http://www.websitedev.de/ 


Re: Doing character encoding/decoding within libwww?

2007-09-22 Thread Bill Moseley
On Sat, Sep 22, 2007 at 11:50:53PM +0200, Bjoern Hoehrmann wrote:
 * Bill Moseley wrote:
 On Fri, Sep 21, 2007 at 12:49:26PM -0700, David Nesting wrote:
  For most uses of libwww, developers do little with character encoding.
  Indeed, for general-case use of LWP::Simple, they can't, because that
  information isn't even exposed.  Has any thought gone into doing this
  internally within libwww, so that when I fetch content, I get back text
  instead of octets?
 
 If you have the response object:
 
 $response-decoded_content;
 
 That removes content encodings like gzip and deflate, but David is
 asking about character encodings like utf-8 and iso-8859-1. Content
 encodings are applied after character encodings.

sub decoded_content {


$content_ref = \Encode::decode($charset, $$content_ref,
   Encode::FB_CROAK() | Encode::LEAVE_SRC());

-- 
Bill Moseley
[EMAIL PROTECTED]



Re: Doing character encoding/decoding within libwww?

2007-09-22 Thread Bjoern Hoehrmann
* Bill Moseley wrote:
sub decoded_content {


$content_ref = \Encode::decode($charset, $$content_ref,
   Encode::FB_CROAK() | Encode::LEAVE_SRC());

The documentation I re-read earlier even says that... This is still a
far cry from being generally useful though, it only works for text/*
and only if the encoding is specified in the header, or the format does
not use some kind of inline label that is inconsistent with the default.
Most of the time this is not the case, however.
-- 
Björn Höhrmann · mailto:[EMAIL PROTECTED] · http://bjoern.hoehrmann.de
Weinh. Str. 22 · Telefon: +49(0)621/4309674 · http://www.bjoernsworld.de
68309 Mannheim · PGP Pub. KeyID: 0xA4357E78 · http://www.websitedev.de/ 


Re: Doing character encoding/decoding within libwww?

2007-09-22 Thread Bill Moseley
On Sun, Sep 23, 2007 at 01:22:21AM +0200, Bjoern Hoehrmann wrote:
 * Bill Moseley wrote:
 sub decoded_content {
 
 
 $content_ref = \Encode::decode($charset, $$content_ref,
Encode::FB_CROAK() | Encode::LEAVE_SRC());
 
 The documentation I re-read earlier even says that... This is still a
 far cry from being generally useful though, it only works for text/*
 and only if the encoding is specified in the header, or the format does
 not use some kind of inline label that is inconsistent with the default.
 Most of the time this is not the case, however.

It will also find meta content-type in the markup, IIRC.

It's been a long day.  What other mime types are you thinking of other
than text/*?

-- 
Bill Moseley
[EMAIL PROTECTED]