Re: First draft of work published which removes the dependency on HTML::Parser from HTTP::Message
* Mark Stosberg wrote: >On Fri, 5 Feb 2010 22:05:27 +0100 >Gisle Aas wrote: > >> http://dev.w3.org/html5/spec/Overview.html#determining-the-character-encoding >> specifies how to pre-scan an HTML document to sniff the charset. >> Would it not be simpler to just implement the algorithm as specified >> instead of using a generic parser. The use of HTML::Parser to >> implement this sniffing was just me trying a shortcut since >> HTML::Parser seemed to implement a superset of these rules. > >Those rules look somewhat involved to me, especially knowing that we already >have both XS and Pure Perl parsers at hand. > >Two thoughts: > >1. What about using HTML::Encoding, after adapting it so it has only >conditional dependency on HTML::Parser, and only uses HTML::Parser if >available. >(It already tries several detection methods before getting to HTML::Parser): > >http://search.cpan.org/~bjoern/HTML-Encoding/ > >A variation on this idea would for *it* to a pure Perl HTML parser >instead of skipping the HTML parsing check completely. Here via https://rt.cpan.org/Ticket/Display.html?id=66313 (well I am on the list anyway but didn't see this apparently), I note that I would've no problem using a different module to parse the document for character encoding meta elements (the code wrapping HTML::Parser is very simple, so this should not be too hard, but I am not up to date with what alter- natives are available). Also, if someone made a pure Perl implementation of the algorithm noted above in a manner that would fit easily with my module, I'd be happy to use that instead aswell (although I doubt the proposal there captures reality very well, but I have not kept track of the latest developments there either). I'd also be quite happy to invite a co-maintainer on board, I mostly wrote the module for the W3C Markup Validator (where it is still in use) and for some simple scraping stuff, so my interest in it is somewhat limited these days, but I am happy to do some easy fixes anyway. Any thoughts on this? -- Björn Höhrmann · mailto:bjo...@hoehrmann.de · http://bjoern.hoehrmann.de Am Badedeich 7 · Telefon: +49(0)160/4415681 · http://www.bjoernsworld.de 25899 Dagebüll · PGP Pub. KeyID: 0xA4357E78 · http://www.websitedev.de/
Re: First draft of work published which removes the dependency on HTML::Parser from HTTP::Message
On Fri, 5 Feb 2010 22:05:27 +0100 Gisle Aas wrote: > http://dev.w3.org/html5/spec/Overview.html#determining-the-character-encoding > specifies how to pre-scan an HTML document to sniff the charset. > Would it not be simpler to just implement the algorithm as specified > instead of using a generic parser. The use of HTML::Parser to > implement this sniffing was just me trying a shortcut since > HTML::Parser seemed to implement a superset of these rules. Those rules look somewhat involved to me, especially knowing that we already have both XS and Pure Perl parsers at hand. Two thoughts: 1. What about using HTML::Encoding, after adapting it so it has only conditional dependency on HTML::Parser, and only uses HTML::Parser if available. (It already tries several detection methods before getting to HTML::Parser): http://search.cpan.org/~bjoern/HTML-Encoding/ A variation on this idea would for *it* to a pure Perl HTML parser instead of skipping the HTML parsing check completely. 2. I note this from the spec page you reference: "This algorithm is a willful violation of the HTTP specification, which requires that the encoding be assumed to be ISO-8859-1 in the absence of a character encoding declaration to the contrary, and of RFC 2046, which requires that the encoding be assumed to be US-ASCII in the absence of a character encoding declaration to the contrary. This specification's third approach is motivated by a desire to be maximally compatible with legacy content. [HTTP] [RFC2046]" According to this, we can skip all this encoding-detection work and still be HTTP spec compliant (although it might more user-friendly to keep trying to guess. ) Mark -- http://mark.stosberg.com/
Re: First draft of work published which removes the dependency on HTML::Parser from HTTP::Message
http://dev.w3.org/html5/spec/Overview.html#determining-the-character-encoding specifies how to pre-scan an HTML document to sniff the charset. Would it not be simpler to just implement the algorithm as specified instead of using a generic parser. The use of HTML::Parser to implement this sniffing was just me trying a shortcut since HTML::Parser seemed to implement a superset of these rules. --Gisle
First draft of work published which removes the dependency on HTML::Parser from HTTP::Message
I now have working code published which allows HTTP::Message to work without the dependency on HTML::Parser. This is useful because it's a step towards splitting out some of the HTTP modules into their own distribution which does not have this dependency, which in turn depends on a C compiler. So, this project could help allow parts of LWP to be used in places where a C compiler is not available, or when it would be more convenient to distribute one code line that could be used directly on multiple architectures. ( But this is not the only the use of HTML::Parser by the distribution. LWP::UserAgent makes use of HTML::HeadParser which in turn uses HTML::Parser. ) My code is here: http://github.com/markstos/libwww-perl/tree/remove-html-parser-dependency The solution passes the numerous existing tests for charset detection, as well as a new one I added. However, I'm not yet recommending that the work be merged because the approach is not clean. Essentially I have have embedded a fairly full-featured Pure Perl HTML parser into HTTP::Message. :) The code was taken from my fork of the "HTML::Parser::Simple" project and specialized some for this case: http://github.com/markstos/html--parser--simple I think a cleaner approach would be to publish this Pure Perl HTML parser, and then have an option to use it if HTML::Parser is not available. A little history about HTML::Parser::Simple: Ron Savage created the project based on the htmlparser.js JavaScript parser by John Resig. This branch was not trying to be particurly compatible with anything. It defines a new API. In particular, it bundles a parse tree *consumer*, Tree::Simple, as well as parse tree producer. I forked the project and made some incompatible changes to pursue a different goal: Create a pure Perl HTML parser that is compatible with the HTML::Parser API. Or specifically, I wanted emulate the HTML::Parser 2.x API enough so that my parser could be used in place of it with HTML::FillInForm. My work met that goal-- it can be used to pass all HTML::FillInForm tests with some minor failures that I don't think matter. This new case of parsing meta tags is another specialized use of the parser that gives me another reason to publish the work. Here's the problem: While I care about these specific goals for an HTML::Parser that is "compatible enough", I'm not really interested in personally pursuing the idea of a Pure Perl HTML::Parser that is 100% compatible with HTML::Parser just for the sake of it. In short, I'm sure there will be change requests beyond what the uses I care about, and I'm not interested in maintaining the module to extend it for other uses. There's also the matter of what to name it, since a version of HTML::Parser::Simple already exists. There's always HTML::Parser::PP or HTML::Parser::PurePerl, but those names just invite the idea that the goal is to be 100% compatible with HTML::Parser. I'll discuss the matter further with Ron Savage to get his thoughts. Feedback on the topic from other LWP users is welcome. Mark -- http://mark.stosberg.com/