Re: Freeing HTTP::Message from HTML::Parser dependency
* Christopher J. Madsen wrote: My repo is https://github.com/madsen/io-html but since it's built with dzil, I also made a gist of the processed module to make it easier to read the docs: https://gist.github.com/1623654 I took a quick look at HTTP::Message, and I think you'd just need to do elsif ($self-content_is_html) { require IO::HTML; my $charset = IO::HTML::find_charset_in($$cref); return $charset if $charset; } You're already doing the BOM and valid-UTF8 checks; all you need is the meta check, which is what find_charset_in does. It is not clear to me that the combination would actually conform to the HTML5 proposal, for instance, HTTP::Message seems to recognize UTF-32 BOMs, but as I recall the HTML5 proposal does not allow that. Your UTF-8 validation code seems wrong to me, you consider the sequence F0 80 to be incomplete, but it's actually invalid, same for ED 80, see the chart in http://bjoern.hoehrmann.de/utf-8/decoder/dfa/#design. Anyway, if people think this is the way to go, maybe HTTP::Message can adopt the Content-Type header charset extraction tests in HTML::Encoding so they don't get lost as my module becomes redundant? -- Björn Höhrmann · mailto:bjo...@hoehrmann.de · http://bjoern.hoehrmann.de Am Badedeich 7 · Telefon: +49(0)160/4415681 · http://www.bjoernsworld.de 25899 Dagebüll · PGP Pub. KeyID: 0xA4357E78 · http://www.websitedev.de/
Re: Freeing HTTP::Message from HTML::Parser dependency
On 1/16/2012 6:53 PM, Bjoern Hoehrmann wrote: * Christopher J. Madsen wrote: My repo is https://github.com/madsen/io-html but since it's built with dzil, I also made a gist of the processed module to make it easier to read the docs: https://gist.github.com/1623654 It is not clear to me that the combination would actually conform to the HTML5 proposal, for instance, HTTP::Message seems to recognize UTF-32 BOMs, but as I recall the HTML5 proposal does not allow that. Dropping support for UTF-32 from HTTP::Message is a separate issue from removing HTML::Parser. I've got no comment on that. Your UTF-8 validation code seems wrong to me, you consider the sequence F0 80 to be incomplete, but it's actually invalid, same for ED 80, see the chart in http://bjoern.hoehrmann.de/utf-8/decoder/dfa/#design. I guess the RE could be improved, but I'm not sure it's worth the effort and added complication to catch a tiny fraction of false positives. Anyway, if people think this is the way to go, maybe HTTP::Message can adopt the Content-Type header charset extraction tests in HTML::Encoding so they don't get lost as my module becomes redundant? I thought it already did that? -- Chris Madsen p...@cjmweb.net http://www.cjmweb.net
Re: Freeing HTTP::Message from HTML::Parser dependency
* Christopher J. Madsen wrote: Dropping support for UTF-32 from HTTP::Message is a separate issue from removing HTML::Parser. I've got no comment on that. (It's not quite as black and white as that, HTML5 could be exempted in the algorithm, for instance.) Your UTF-8 validation code seems wrong to me, you consider the sequence F0 80 to be incomplete, but it's actually invalid, same for ED 80, see the chart in http://bjoern.hoehrmann.de/utf-8/decoder/dfa/#design. I guess the RE could be improved, but I'm not sure it's worth the effort and added complication to catch a tiny fraction of false positives. Why make the check at all if you don't care if it's right? Anyway, if people think this is the way to go, maybe HTTP::Message can adopt the Content-Type header charset extraction tests in HTML::Encoding so they don't get lost as my module becomes redundant? I thought it already did that? Not as far as I can tell; links welcome though. -- Björn Höhrmann · mailto:bjo...@hoehrmann.de · http://bjoern.hoehrmann.de Am Badedeich 7 · Telefon: +49(0)160/4415681 · http://www.bjoernsworld.de 25899 Dagebüll · PGP Pub. KeyID: 0xA4357E78 · http://www.websitedev.de/
Re: Freeing HTTP::Message from HTML::Parser dependency
On 1/16/2012 9:52 PM, Bjoern Hoehrmann wrote: * Christopher J. Madsen wrote: Your UTF-8 validation code seems wrong to me, you consider the sequence F0 80 to be incomplete, but it's actually invalid, same for ED 80, see the chart in http://bjoern.hoehrmann.de/utf-8/decoder/dfa/#design. I guess the RE could be improved, but I'm not sure it's worth the effort and added complication to catch a tiny fraction of false positives. Why make the check at all if you don't care if it's right? I can't use a simple utf8::decode check, because I read a fixed number of bytes, and that might have cut a multi-byte character in half. So I use Encode::FB_QUIET, and then check the leftovers to make sure that it's a single, plausible, partial UTF-8 character. I have to check the leftovers, or the whole test would be meaningless. I just make sure it's a start byte followed by an appropriate number of continuation bytes. As you say, certain start bytes can't validly be followed by certain continuation bytes, but writing an RE for those rules is more complexity than I think the problem warrants. What are the odds that I had 1021 bytes of valid UTF-8 (including at least 1 multi-byte character) followed by bytes that match my current RE but a strict test could have rejected? I'm already just assuming that the next bytes would be additional continuation bytes. Anyway, if people think this is the way to go, maybe HTTP::Message can adopt the Content-Type header charset extraction tests in HTML::Encoding so they don't get lost as my module becomes redundant? I thought it already did that? Not as far as I can tell; links welcome though. At the beginning of content_charset, it calls content_type_charset (which is actually a HTTP::Headers method). Or were you talking about t/01http.t and its associated input files? -- Chris Madsen p...@cjmweb.net http://www.cjmweb.net