Re: [xml] Patch to improve HTMLparser's robustness

2008-04-23 Thread Daniel Veillard
On Tue, Apr 22, 2008 at 12:18:20PM -0400, Daniel Veillard wrote:
 On Tue, Apr 22, 2008 at 03:56:33PM +0200, Arnold Hendriks wrote:
  Daniel Veillard wrote:
I think the embedding error condition should be noted somewhere in the 
  parser state and disable at least partially the closing tag processing so
  that the 'end text' paragraph shows up as a sibling of the 'embbeded text'
  paragraph.

  It probably should generate an error, yes. My patch simply ignores the 
  situtation.
 
   but break the normal cases, which is not acceptable, nice try ;-)

  Proper patch, reusing ctxt-depth which is not used in the HTML parser
yet to count the number of times an opening tag has been ignored, and 
reused to drop the closing tags. Of course extra or missing ending tags
are still possible, but at this point one can only do heuristics. Works
properly for me, will commit soonish unless i hear a good reason against it
in the meantime:

wei:~/XML - ./xmllint --html autoskip.html
autoskip.html:3: HTML parser error : htmlParseStartTag: misplaced html tag
html xml:lang=en xmlns=foobar
 ^
autoskip.html:4: HTML parser error : htmlParseStartTag: misplaced body tag
body
 ^
!DOCTYPE html PUBLIC -//W3C//DTD HTML 4.0 Transitional//EN 
http://www.w3.org/TR/REC-html40/loose.dtd;
htmlbody
psome text

/p
pembbeded text/p


pend text
/p
/body/html
wei:~/XML -

Daniel

-- 
Red Hat Virtualization group http://redhat.com/virtualization/
Daniel Veillard  | virtualization library  http://libvirt.org/
[EMAIL PROTECTED]  | libxml GNOME XML XSLT toolkit  http://xmlsoft.org/
http://veillard.com/ | Rpmfind RPM search engine  http://rpmfind.net/
Index: HTMLparser.c
===
--- HTMLparser.c(revision 3739)
+++ HTMLparser.c(working copy)
@@ -3482,6 +3482,7 @@ htmlParseStartTag(htmlParserCtxtPtr ctxt
 htmlParseStartTag: misplaced html tag\n,
 name, NULL);
discardtag = 1;
+   ctxt-depth++;
 }
 if ((ctxt-nameNr != 1)  
(xmlStrEqual(name, BAD_CASThead))) {
@@ -3489,6 +3490,7 @@ htmlParseStartTag(htmlParserCtxtPtr ctxt
 htmlParseStartTag: misplaced head tag\n,
 name, NULL);
discardtag = 1;
+   ctxt-depth++;
 }
 if (xmlStrEqual(name, BAD_CASTbody)) {
int indx;
@@ -3498,6 +3500,7 @@ htmlParseStartTag(htmlParserCtxtPtr ctxt
 htmlParseStartTag: misplaced body tag\n,
 name, NULL);
discardtag = 1;
+   ctxt-depth++;
}
}
 }
@@ -3648,7 +3651,6 @@ htmlParseEndTag(htmlParserCtxtPtr ctxt)
 name = htmlParseHTMLName(ctxt);
 if (name == NULL)
 return (0);
-
 /*
  * We should definitely be at the ending S? '' part
  */
@@ -3669,6 +3671,18 @@ htmlParseEndTag(htmlParserCtxtPtr ctxt)
 NEXT;
 
 /*
+ * if we ignored misplaced tags in htmlParseStartTag don't pop them
+ * out now.
+ */
+if ((ctxt-depth  0) 
+(xmlStrEqual(name, BAD_CAST html) ||
+ xmlStrEqual(name, BAD_CAST body) ||
+xmlStrEqual(name, BAD_CAST head))) {
+   ctxt-depth--;
+   return (0);
+}
+
+/*
  * If the name read is not one of the element in the parsing stack
  * then return, it's just an error.
  */
___
xml mailing list, project page  http://xmlsoft.org/
xml@gnome.org
http://mail.gnome.org/mailman/listinfo/xml


Re: [xml] Patch to improve HTMLparser's robustness

2008-04-22 Thread Daniel Veillard
On Tue, Apr 22, 2008 at 03:56:33PM +0200, Arnold Hendriks wrote:
 Daniel Veillard wrote:
 I didn't forgot about the issue, and got a bit of time to test yesterday
 and look at it. First the patch makes senses it fixes a serious problem,
 there is no leak, that's fine, but the result is still problematic
[..]
   Basically the error is correctly displayed, but the close of the embedded
 body and html tags generate a serious mess. We are able to detect the 
 embedding
 but the autoclose kind of misbehaves. moreover if using the push parser the
 autoclose ends the document immediately:
   
 Can I cheat? :) Given the fact that nothing should appear between 
 /body and /html, and /html is always the last tag, its' easiest to 
 just ignore them and let the autoclose deal with it...
[...]
 
 Which looks good enough to me. It's probably at least enough to get it 
 properly through my html email sanitizer.

  unfortunately that means you don't get the SAX end element callbacks for
body and head when they arise. it's a bit too much cheating it would kill apps
relying on those. I far prefer marking the fact that some element ends need to
be ignored instead. I need to think about it, maybe the real solution is to
still push the html/body/head on the nameTab stack but not generate the
associated callbacks ...

 
   I think the embedding error condition should be noted somewhere in the 
 parser state and disable at least partially the closing tag processing so
 that the 'end text' paragraph shows up as a sibling of the 'embbeded text'
 paragraph.
   
 It probably should generate an error, yes. My patch simply ignores the 
 situtation.

  but break the normal cases, which is not acceptable, nice try ;-)

Daniel

-- 
Red Hat Virtualization group http://redhat.com/virtualization/
Daniel Veillard  | virtualization library  http://libvirt.org/
[EMAIL PROTECTED]  | libxml GNOME XML XSLT toolkit  http://xmlsoft.org/
http://veillard.com/ | Rpmfind RPM search engine  http://rpmfind.net/
___
xml mailing list, project page  http://xmlsoft.org/
xml@gnome.org
http://mail.gnome.org/mailman/listinfo/xml


Re: [xml] Patch to improve HTMLparser's robustness

2008-04-22 Thread Liam R E Quin
On Tue, 2008-04-22 at 15:56 +0200, Arnold Hendriks wrote:
  
 Can I cheat? :) Given the fact that nothing should appear between 
 /body and /html, and /html is always the last tag, its' easiest to 
 just ignore them and let the autoclose deal with it...

In practice I expect it's not uncommon to find text after /html --
e.g. a site like geocities that appends ads to a user's page.

Liam

-- 
Liam Quin - XML Activity Lead, W3C, http://www.w3.org/People/Quin/
Pictures from old books: http://fromoldbooks.org/
Ankh: irc.sorcery.net irc.gnome.org www.advogato.org

___
xml mailing list, project page  http://xmlsoft.org/
xml@gnome.org
http://mail.gnome.org/mailman/listinfo/xml