Re: Freeing HTTP::Message from HTML::Parser dependency

2012-01-16 Thread Bjoern Hoehrmann
* Christopher J. Madsen wrote:
My repo is https://github.com/madsen/io-html but since it's built with
dzil, I also made a gist of the processed module to make it easier to
read the docs: https://gist.github.com/1623654

I took a quick look at HTTP::Message, and I think you'd just need to do

elsif ($self-content_is_html) {
   require IO::HTML;
   my $charset = IO::HTML::find_charset_in($$cref);
   return $charset if $charset;
}

You're already doing the BOM and valid-UTF8 checks; all you need is the
meta check, which is what find_charset_in does.

It is not clear to me that the combination would actually conform to the
HTML5 proposal, for instance, HTTP::Message seems to recognize UTF-32
BOMs, but as I recall the HTML5 proposal does not allow that.

Your UTF-8 validation code seems wrong to me, you consider the sequence
F0 80 to be incomplete, but it's actually invalid, same for ED 80, see
the chart in http://bjoern.hoehrmann.de/utf-8/decoder/dfa/#design.

Anyway, if people think this is the way to go, maybe HTTP::Message can
adopt the Content-Type header charset extraction tests in HTML::Encoding
so they don't get lost as my module becomes redundant?
-- 
Björn Höhrmann · mailto:bjo...@hoehrmann.de · http://bjoern.hoehrmann.de
Am Badedeich 7 · Telefon: +49(0)160/4415681 · http://www.bjoernsworld.de
25899 Dagebüll · PGP Pub. KeyID: 0xA4357E78 · http://www.websitedev.de/ 


Re: Freeing HTTP::Message from HTML::Parser dependency

2012-01-16 Thread Christopher J. Madsen
On 1/16/2012 6:53 PM, Bjoern Hoehrmann wrote:
 * Christopher J. Madsen wrote:
 My repo is https://github.com/madsen/io-html but since it's built with
 dzil, I also made a gist of the processed module to make it easier to
 read the docs: https://gist.github.com/1623654

 It is not clear to me that the combination would actually conform to the
 HTML5 proposal, for instance, HTTP::Message seems to recognize UTF-32
 BOMs, but as I recall the HTML5 proposal does not allow that.

Dropping support for UTF-32 from HTTP::Message is a separate issue from
removing HTML::Parser.  I've got no comment on that.

 Your UTF-8 validation code seems wrong to me, you consider the sequence
 F0 80 to be incomplete, but it's actually invalid, same for ED 80, see
 the chart in http://bjoern.hoehrmann.de/utf-8/decoder/dfa/#design.

I guess the RE could be improved, but I'm not sure it's worth the effort
and added complication to catch a tiny fraction of false positives.

 Anyway, if people think this is the way to go, maybe HTTP::Message can
 adopt the Content-Type header charset extraction tests in HTML::Encoding
 so they don't get lost as my module becomes redundant?

I thought it already did that?

-- 
Chris Madsen  p...@cjmweb.net
    http://www.cjmweb.net  



Re: Freeing HTTP::Message from HTML::Parser dependency

2012-01-16 Thread Bjoern Hoehrmann
* Christopher J. Madsen wrote:
Dropping support for UTF-32 from HTTP::Message is a separate issue from
removing HTML::Parser.  I've got no comment on that.

(It's not quite as black and white as that, HTML5 could be exempted in
the algorithm, for instance.)

 Your UTF-8 validation code seems wrong to me, you consider the sequence
 F0 80 to be incomplete, but it's actually invalid, same for ED 80, see
 the chart in http://bjoern.hoehrmann.de/utf-8/decoder/dfa/#design.

I guess the RE could be improved, but I'm not sure it's worth the effort
and added complication to catch a tiny fraction of false positives.

Why make the check at all if you don't care if it's right?

 Anyway, if people think this is the way to go, maybe HTTP::Message can
 adopt the Content-Type header charset extraction tests in HTML::Encoding
 so they don't get lost as my module becomes redundant?

I thought it already did that?

Not as far as I can tell; links welcome though.
-- 
Björn Höhrmann · mailto:bjo...@hoehrmann.de · http://bjoern.hoehrmann.de
Am Badedeich 7 · Telefon: +49(0)160/4415681 · http://www.bjoernsworld.de
25899 Dagebüll · PGP Pub. KeyID: 0xA4357E78 · http://www.websitedev.de/ 


Re: Freeing HTTP::Message from HTML::Parser dependency

2012-01-16 Thread Christopher J. Madsen
On 1/16/2012 9:52 PM, Bjoern Hoehrmann wrote:
 * Christopher J. Madsen wrote:
 Your UTF-8 validation code seems wrong to me, you consider the sequence
 F0 80 to be incomplete, but it's actually invalid, same for ED 80, see
 the chart in http://bjoern.hoehrmann.de/utf-8/decoder/dfa/#design.

 I guess the RE could be improved, but I'm not sure it's worth the effort
 and added complication to catch a tiny fraction of false positives.
 
 Why make the check at all if you don't care if it's right?

I can't use a simple utf8::decode check, because I read a fixed number
of bytes, and that might have cut a multi-byte character in half.  So I
use Encode::FB_QUIET, and then check the leftovers to make sure that
it's a single, plausible, partial UTF-8 character.  I have to check the
leftovers, or the whole test would be meaningless.  I just make sure
it's a start byte followed by an appropriate number of continuation bytes.

As you say, certain start bytes can't validly be followed by certain
continuation bytes, but writing an RE for those rules is more complexity
than I think the problem warrants.  What are the odds that I had 1021
bytes of valid UTF-8 (including at least 1 multi-byte character)
followed by bytes that match my current RE but a strict test could have
rejected?  I'm already just assuming that the next bytes would be
additional continuation bytes.

 Anyway, if people think this is the way to go, maybe HTTP::Message can
 adopt the Content-Type header charset extraction tests in HTML::Encoding
 so they don't get lost as my module becomes redundant?

 I thought it already did that?
 
 Not as far as I can tell; links welcome though.

At the beginning of content_charset, it calls content_type_charset
(which is actually a HTTP::Headers method).

Or were you talking about t/01http.t and its associated input files?

-- 
Chris Madsen  p...@cjmweb.net
    http://www.cjmweb.net