Re: First draft of work published which removes the dependency on HTML::Parser from HTTP::Message

2011-03-12 Thread Bjoern Hoehrmann
* Mark Stosberg wrote:
>On Fri, 5 Feb 2010 22:05:27 +0100
>Gisle Aas  wrote:
>
>> http://dev.w3.org/html5/spec/Overview.html#determining-the-character-encoding
>> specifies how to pre-scan an HTML document to sniff the charset.
>> Would it not be simpler to just implement the algorithm as specified
>> instead of using a generic parser.  The use of HTML::Parser to
>> implement this sniffing was just me trying a shortcut since
>> HTML::Parser seemed to implement a superset of these rules.
>
>Those rules look somewhat involved to me, especially knowing that we already
>have both XS and Pure Perl parsers at hand. 
>
>Two thoughts:
>
>1. What about using HTML::Encoding, after adapting it so it has only
>conditional dependency on HTML::Parser, and only uses HTML::Parser if 
>available. 
>(It already tries several detection methods before getting to HTML::Parser):
>
>http://search.cpan.org/~bjoern/HTML-Encoding/
>
>A variation on this idea would for *it* to a pure Perl HTML parser
>instead of skipping the HTML parsing check completely. 

Here via https://rt.cpan.org/Ticket/Display.html?id=66313 (well I am on
the list anyway but didn't see this apparently), I note that I would've
no problem using a different module to parse the document for character
encoding meta elements (the code wrapping HTML::Parser is very simple,
so this should not be too hard, but I am not up to date with what alter-
natives are available). Also, if someone made a pure Perl implementation
of the algorithm noted above in a manner that would fit easily with my
module, I'd be happy to use that instead aswell (although I doubt the
proposal there captures reality very well, but I have not kept track of
the latest developments there either).

I'd also be quite happy to invite a co-maintainer on board, I mostly
wrote the module for the W3C Markup Validator (where it is still in use)
and for some simple scraping stuff, so my interest in it is somewhat
limited these days, but I am happy to do some easy fixes anyway.

Any thoughts on this?
-- 
Björn Höhrmann · mailto:bjo...@hoehrmann.de · http://bjoern.hoehrmann.de
Am Badedeich 7 · Telefon: +49(0)160/4415681 · http://www.bjoernsworld.de
25899 Dagebüll · PGP Pub. KeyID: 0xA4357E78 · http://www.websitedev.de/ 


Re: First draft of work published which removes the dependency on HTML::Parser from HTTP::Message

2010-02-05 Thread Mark Stosberg
On Fri, 5 Feb 2010 22:05:27 +0100
Gisle Aas  wrote:

> http://dev.w3.org/html5/spec/Overview.html#determining-the-character-encoding
> specifies how to pre-scan an HTML document to sniff the charset.
> Would it not be simpler to just implement the algorithm as specified
> instead of using a generic parser.  The use of HTML::Parser to
> implement this sniffing was just me trying a shortcut since
> HTML::Parser seemed to implement a superset of these rules.

Those rules look somewhat involved to me, especially knowing that we already
have both XS and Pure Perl parsers at hand. 

Two thoughts:

1. What about using HTML::Encoding, after adapting it so it has only
conditional dependency on HTML::Parser, and only uses HTML::Parser if 
available. 
(It already tries several detection methods before getting to HTML::Parser):

http://search.cpan.org/~bjoern/HTML-Encoding/

A variation on this idea would for *it* to a pure Perl HTML parser
instead of skipping the HTML parsing check completely. 

2. I note this from the spec page you reference:

"This algorithm is a willful violation of the HTTP specification, which 
requires that the encoding be assumed to be ISO-8859-1 in the absence of a 
character encoding declaration to the contrary, and of RFC 2046, which requires 
that the encoding be assumed to be US-ASCII in the absence of a character 
encoding declaration to the contrary. This specification's third approach is 
motivated by a desire to be maximally compatible with legacy content. [HTTP] 
[RFC2046]"

According to this, we can skip all this encoding-detection work and still be 
HTTP spec compliant (although it might more user-friendly to keep trying to 
guess. )

Mark


-- 
http://mark.stosberg.com/





Re: First draft of work published which removes the dependency on HTML::Parser from HTTP::Message

2010-02-05 Thread Gisle Aas
http://dev.w3.org/html5/spec/Overview.html#determining-the-character-encoding
specifies how to pre-scan an HTML document to sniff the charset.
Would it not be simpler to just implement the algorithm as specified
instead of using a generic parser.  The use of HTML::Parser to
implement this sniffing was just me trying a shortcut since
HTML::Parser seemed to implement a superset of these rules.

--Gisle