Re: HTML::Parser bug
On Sun, Mar 20, 2005 at 01:51:25PM -0800, Bill Moseley wrote: On Sun, Mar 20, 2005 at 06:02:26PM +0300, [EMAIL PROTECTED] wrote: Hello libwww, using it to parse html-forms etc... noticed, that it recognizes strange comment like !-- as starting of the comment, not like the whole empty comment, as IE. Doesn't seem like that's a valid comment. http://www.w3.org/TR/WD-html40-970917/intro/sgmltut.html#h-3.1.4 Well, the HTML:Parser perldoc says: HTML::Parser is not a generic SGML parser. We have tried to make it able to deal with the HTML that is actually out there, and it normally parses as closely as possible to the way the popular web browsers do it instead of strictly following one of the many HTML specifications from W3C. Where there is disagreement, there is often an option that you can enable to get the official behaviour. But do all versions of IE parse this the same way? What do other popular user agents do? -- Reinier
Re: HTML::Parser bug
On Mon, Mar 21, 2005 at 06:51:42PM +0100, Reinier Post wrote: On Sun, Mar 20, 2005 at 01:51:25PM -0800, Bill Moseley wrote: On Sun, Mar 20, 2005 at 06:02:26PM +0300, [EMAIL PROTECTED] wrote: Hello libwww, using it to parse html-forms etc... noticed, that it recognizes strange comment like !-- as starting of the comment, not like the whole empty comment, as IE. Doesn't seem like that's a valid comment. http://www.w3.org/TR/WD-html40-970917/intro/sgmltut.html#h-3.1.4 Well, the HTML:Parser perldoc says: HTML::Parser is not a generic SGML parser. We have tried to make it able to deal with the HTML that is actually out there, and it normally parses as closely as possible to the way the popular web browsers do it instead of strictly following one of the many HTML specifications from W3C. Where there is disagreement, there is often an option that you can enable to get the official behaviour. Hard to imagine handling every possibility as an option. I would have thought an empty comment would be at a minimum: !-- -- or maybe ! although I'm still trying to grasp the concept of an empty comment. -- Bill Moseley [EMAIL PROTECTED]
RE: HTML::Parser bug
Although not identical to your short comment, Microsoft intentionally uses similar comments like !--[if gte mso 9] (something read by MSIE 5+ but correctly considered to be a comment by other browsers) ![endif]-- See http://office.microsoft.com/en-us/assistance/HA010549981033.aspx for more info. Forrest Cahoon not speaking for merrill corporation -Original Message- From: Reinier Post [mailto:[EMAIL PROTECTED] Sent: Monday, March 21, 2005 11:52 AM To: libwww@perl.org Subject: Re: HTML::Parser bug On Sun, Mar 20, 2005 at 01:51:25PM -0800, Bill Moseley wrote: On Sun, Mar 20, 2005 at 06:02:26PM +0300, [EMAIL PROTECTED] wrote: Hello libwww, using it to parse html-forms etc... noticed, that it recognizes strange comment like !-- as starting of the comment, not like the whole empty comment, as IE. Doesn't seem like that's a valid comment. http://www.w3.org/TR/WD-html40-970917/intro/sgmltut.html#h-3.1.4 Well, the HTML:Parser perldoc says: HTML::Parser is not a generic SGML parser. We have tried to make it able to deal with the HTML that is actually out there, and it normally parses as closely as possible to the way the popular web browsers do it instead of strictly following one of the many HTML specifications from W3C. Where there is disagreement, there is often an option that you can enable to get the official behaviour. But do all versions of IE parse this the same way? What do other popular user agents do? -- Reinier
Re: HTML::Parser bug
like !-- as starting of the comment, not like the whole empty comment, as IE. Lots of browsers allow crap that modules don't. -- Andy Lester = [EMAIL PROTECTED] = www.petdance.com = AIM:petdance
Re: HTML::Parser bug?
Pedro == Pedro ProençA [EMAIL PROTECTED] writes: Pedro Hi all, Pedro When I pass the following string to HTML::Parser:parse() Pedro String containing entities to be replaced, for instance uarr2;a; Pedro this is what I get in my text handler: Pedro String containing entities to be replaced, for instance Pedro I am using Perl 5.6.0 on Mandrake Linux 8.0 (kernel 2.4.3-20mdk) and Pedro the latest HTML::Parser version (3.25). Pedro It his a known problem? Is there any work around it? $ perl use HTML::Parser; my @a; my $p = HTML::Parser-new( handlers = { text = [\@a, text ] }); $p-parse(String containing entities to be replaced, for instance uarr2;a); $p-eof; print map [$_-[0]], @a; ^D [String containing entities to be replaced, for instance][ uarr2;a] $ Looks fine to me. Try that example. Notice that it pulls it in two pieces. That's expected unless you also set $p-unbroken_text(1) before parsing. print Just another Perl hacker,; -- Randal L. Schwartz - Stonehenge Consulting Services, Inc. - +1 503 777 0095 [EMAIL PROTECTED] URL:http://www.stonehenge.com/merlyn/ Perl/Unix/security consulting, Technical writing, Comedy, etc. etc. See PerlTraining.Stonehenge.com for onsite and open-enrollment Perl training!
Re: HTML::Parser bug?
Perhaps you could give us an example of the text you are trying to parse that includes a comment that gets passed to the 'comment' event handler, but doesn't get passed to the 'default' event handler when the 'comment' handler isn't defined. A short example script that shows the problem would also be handy. I'd be especially interested in seeing all HTML::Parser method calls.. -- Mac :}) ** I may forward private database questions to the DBI mail lists. ** - Original Message - From: "Hugo Haas" [EMAIL PROTECTED] To: [EMAIL PROTECTED] Sent: Friday, July 07, 2000 3:28 PM Subject: HTML::Parser bug? The man page says about handlers: Events Handlers for the following events can be registered: [..] default This event is triggered for events that do not have a specific handler. You can set up a handler for this event to catch stuff you did not want to catch explicitly. so I didn't assign any handler to the comment event, thinking default would be called: $p-handler(default = 'text', 'self, text'); This doesn't do what I was expecting whereas: $p-handler(comment = 'text', 'self, text'); $p-handler(default = 'text', 'self, text'); this does (with version 3.08 and 3.10). Is it me not reading the documentation right (in that case, I think that it is unclear) or is it a bug?
Re: HTML::Parser bug?
On Fri, Jul 07, 2000, Michael A. Chase wrote: Perhaps you could give us an example of the text you are trying to parse that includes a comment that gets passed to the 'comment' event handler, but doesn't get passed to the 'default' event handler when the 'comment' handler isn't defined. Sorry, I realized that I sent my example without enough details but you replied before I could submit an example. A short example script that shows the problem would also be handy. I'd be especially interested in seeing all HTML::Parser method calls.. I was running a test on an excerpt of an HTML file (this is not valid HTML by itself, but I did that to isolate the problem): !-- test -- a href="fdasfafdas"/a Here's a sample script: use strict; require HTML::Parser; my $p = HTML::Parser-new; $p-handler(@EVENT@ = \text, 'text'); $p-parse_file('/tmp/foo.html'); sub text() { my ($t) = @_; print $t . "\n"; } With @EVENT@ being 'comment': [hugo:pts/2] larve:~ perl -w test.pl !-- test -- [hugo:pts/2] larve:~ With @EVENT@ being 'default': [hugo:pts/2] larve:~ perl -w test.pl [hugo:pts/2] larve:~ -- Hugo Haas, Webmaster, Systems Team - W3C/MIT mailto:[EMAIL PROTECTED] - tel:+1-617-452-2092
Re: HTML::Parser bug?
On Fri, Jul 07, 2000, Michael A. Chase wrote: I quote: If new() is called without any arguments, it will create a parser that uses callback methods compatible with version 2 of CHTML::Parser. See the section on "version 2 compatibility" below for details. A HTML::Parser v2 compatable parser has handlers defined for the usual events so the default handler does not get called. I missed that! Sorry for the trouble. It works much better like that indeed. Thanks, Hugo