Re: First draft of work published which removes the dependency on HTML::Parser from HTTP::Message

2011-03-12 Thread Bjoern Hoehrmann
* Mark Stosberg wrote:
>On Fri, 5 Feb 2010 22:05:27 +0100
>Gisle Aas  wrote:
>
>> http://dev.w3.org/html5/spec/Overview.html#determining-the-character-encoding
>> specifies how to pre-scan an HTML document to sniff the charset.
>> Would it not be simpler to just implement the algorithm as specified
>> instead of using a generic parser.  The use of HTML::Parser to
>> implement this sniffing was just me trying a shortcut since
>> HTML::Parser seemed to implement a superset of these rules.
>
>Those rules look somewhat involved to me, especially knowing that we already
>have both XS and Pure Perl parsers at hand. 
>
>Two thoughts:
>
>1. What about using HTML::Encoding, after adapting it so it has only
>conditional dependency on HTML::Parser, and only uses HTML::Parser if 
>available. 
>(It already tries several detection methods before getting to HTML::Parser):
>
>http://search.cpan.org/~bjoern/HTML-Encoding/
>
>A variation on this idea would for *it* to a pure Perl HTML parser
>instead of skipping the HTML parsing check completely. 

Here via https://rt.cpan.org/Ticket/Display.html?id=66313 (well I am on
the list anyway but didn't see this apparently), I note that I would've
no problem using a different module to parse the document for character
encoding meta elements (the code wrapping HTML::Parser is very simple,
so this should not be too hard, but I am not up to date with what alter-
natives are available). Also, if someone made a pure Perl implementation
of the algorithm noted above in a manner that would fit easily with my
module, I'd be happy to use that instead aswell (although I doubt the
proposal there captures reality very well, but I have not kept track of
the latest developments there either).

I'd also be quite happy to invite a co-maintainer on board, I mostly
wrote the module for the W3C Markup Validator (where it is still in use)
and for some simple scraping stuff, so my interest in it is somewhat
limited these days, but I am happy to do some easy fixes anyway.

Any thoughts on this?
-- 
Björn Höhrmann · mailto:bjo...@hoehrmann.de · http://bjoern.hoehrmann.de
Am Badedeich 7 · Telefon: +49(0)160/4415681 · http://www.bjoernsworld.de
25899 Dagebüll · PGP Pub. KeyID: 0xA4357E78 · http://www.websitedev.de/ 


Re: First draft of work published which removes the dependency on HTML::Parser from HTTP::Message

2010-02-05 Thread Mark Stosberg
On Fri, 5 Feb 2010 22:05:27 +0100
Gisle Aas  wrote:

> http://dev.w3.org/html5/spec/Overview.html#determining-the-character-encoding
> specifies how to pre-scan an HTML document to sniff the charset.
> Would it not be simpler to just implement the algorithm as specified
> instead of using a generic parser.  The use of HTML::Parser to
> implement this sniffing was just me trying a shortcut since
> HTML::Parser seemed to implement a superset of these rules.

Those rules look somewhat involved to me, especially knowing that we already
have both XS and Pure Perl parsers at hand. 

Two thoughts:

1. What about using HTML::Encoding, after adapting it so it has only
conditional dependency on HTML::Parser, and only uses HTML::Parser if 
available. 
(It already tries several detection methods before getting to HTML::Parser):

http://search.cpan.org/~bjoern/HTML-Encoding/

A variation on this idea would for *it* to a pure Perl HTML parser
instead of skipping the HTML parsing check completely. 

2. I note this from the spec page you reference:

"This algorithm is a willful violation of the HTTP specification, which 
requires that the encoding be assumed to be ISO-8859-1 in the absence of a 
character encoding declaration to the contrary, and of RFC 2046, which requires 
that the encoding be assumed to be US-ASCII in the absence of a character 
encoding declaration to the contrary. This specification's third approach is 
motivated by a desire to be maximally compatible with legacy content. [HTTP] 
[RFC2046]"

According to this, we can skip all this encoding-detection work and still be 
HTTP spec compliant (although it might more user-friendly to keep trying to 
guess. )

Mark


-- 
http://mark.stosberg.com/





Re: First draft of work published which removes the dependency on HTML::Parser from HTTP::Message

2010-02-05 Thread Gisle Aas
http://dev.w3.org/html5/spec/Overview.html#determining-the-character-encoding
specifies how to pre-scan an HTML document to sniff the charset.
Would it not be simpler to just implement the algorithm as specified
instead of using a generic parser.  The use of HTML::Parser to
implement this sniffing was just me trying a shortcut since
HTML::Parser seemed to implement a superset of these rules.

--Gisle


First draft of work published which removes the dependency on HTML::Parser from HTTP::Message

2010-02-04 Thread Mark Stosberg

I now have working code published which allows HTTP::Message to work without
the dependency on HTML::Parser. This is useful because it's a step towards
splitting out some of the HTTP modules into their own distribution which does
not have this dependency, which in turn depends on a C compiler. So, this
project could help allow parts of LWP to be used in places where a C compiler
is not available, or when it would be more convenient to distribute one code
line that could be used directly on multiple architectures.
( But this is not the only the use of HTML::Parser by the distribution.
LWP::UserAgent makes use of HTML::HeadParser which in turn uses HTML::Parser. )

My code is here:

http://github.com/markstos/libwww-perl/tree/remove-html-parser-dependency

The solution passes the numerous existing tests for charset detection, as well
as a new one I added.

However, I'm not yet recommending that the work be merged because the approach
is not clean.

Essentially I have have embedded a fairly full-featured Pure Perl HTML parser
into HTTP::Message. :) The code was taken from my fork of the
"HTML::Parser::Simple" project and specialized some for this case:

 http://github.com/markstos/html--parser--simple

I think a cleaner approach would be to publish this Pure Perl HTML parser, and
then have an option to use it if HTML::Parser is not available.

A little history about HTML::Parser::Simple:

Ron Savage created the project based on the htmlparser.js JavaScript parser by
John Resig.  This branch was not trying to be particurly compatible with
anything. It defines a new API.  In particular, it bundles a parse tree
*consumer*, Tree::Simple, as well as parse tree producer.

I forked the project and made some incompatible changes to pursue a different
goal: Create a pure Perl HTML parser that is compatible with the HTML::Parser
API. Or specifically, I wanted emulate the HTML::Parser 2.x API enough so that
my parser could be used in place of it with HTML::FillInForm. My work met that
goal-- it can be used to pass all HTML::FillInForm tests with some minor
failures that I don't think matter.

This new case of parsing meta tags is another specialized use of the parser
that gives me another reason to publish the work.

Here's the problem: While I care about these specific goals for an HTML::Parser
that is "compatible enough", I'm not really interested in personally pursuing
the idea of a Pure Perl HTML::Parser that is 100% compatible with HTML::Parser
just for the sake of it.  In short, I'm sure there will be change requests
beyond what the uses I care about, and I'm not interested in maintaining the
module to extend it for other uses.

There's also the matter of what to name it, since a version of
HTML::Parser::Simple already exists. There's always HTML::Parser::PP or
HTML::Parser::PurePerl, but those names just invite the idea that the goal is
to be 100% compatible with HTML::Parser.

I'll discuss the matter further with Ron Savage to get his thoughts.

Feedback on the topic from other LWP users is welcome.

Mark

-- 
http://mark.stosberg.com/