Re: [xml] Patch to improve HTMLparser's robustness

Daniel Veillard Wed, 23 Apr 2008 05:00:07 -0700

On Tue, Apr 22, 2008 at 12:18:20PM -0400, Daniel Veillard wrote:
> On Tue, Apr 22, 2008 at 03:56:33PM +0200, Arnold Hendriks wrote:
> > Daniel Veillard wrote:
> > >  I think the embedding error condition should be noted somewhere in the 
> > >parser state and disable at least partially the closing tag processing so
> > >that the 'end text' paragraph shows up as a sibling of the 'embbeded text'
> > >paragraph.
> > >  
> > It probably should generate an error, yes. My patch simply ignores the 
> > situtation.
> 
>   but break the normal cases, which is not acceptable, nice try ;-)


  Proper patch, reusing ctxt->depth which is not used in the HTML parser
yet to count the number of times an opening tag has been ignored, and 
reused to drop the closing tags. Of course extra or missing ending tags
are still possible, but at this point one can only do heuristics. Works
properly for me, will commit soonish unless i hear a good reason against it
in the meantime:

wei:~/XML -> ./xmllint --html autoskip.html
autoskip.html:3: HTML parser error : htmlParseStartTag: misplaced <html> tag
<html xml:lang="en" xmlns="foobar">
     ^
autoskip.html:4: HTML parser error : htmlParseStartTag: misplaced <body> tag
<body>
     ^
<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN" 
"http://www.w3.org/TR/REC-html40/loose.dtd";>
<html><body>
<p>some text

</p>
<p>embbeded text</p>


<p>end text
</p>
</body></html>
wei:~/XML ->

Daniel

-- 
Red Hat Virtualization group http://redhat.com/virtualization/
Daniel Veillard      | virtualization library  http://libvirt.org/
[EMAIL PROTECTED]  | libxml GNOME XML XSLT toolkit  http://xmlsoft.org/
http://veillard.com/ | Rpmfind RPM search engine  http://rpmfind.net/

Index: HTMLparser.c
===================================================================
--- HTMLparser.c        (revision 3739)
+++ HTMLparser.c        (working copy)
@@ -3482,6 +3482,7 @@ htmlParseStartTag(htmlParserCtxtPtr ctxt
                     "htmlParseStartTag: misplaced <html> tag\n",
                     name, NULL);
        discardtag = 1;
+       ctxt->depth++;
     }
     if ((ctxt->nameNr != 1) && 
        (xmlStrEqual(name, BAD_CAST"head"))) {
@@ -3489,6 +3490,7 @@ htmlParseStartTag(htmlParserCtxtPtr ctxt
                     "htmlParseStartTag: misplaced <head> tag\n",
                     name, NULL);
        discardtag = 1;
+       ctxt->depth++;
     }
     if (xmlStrEqual(name, BAD_CAST"body")) {
        int indx;
@@ -3498,6 +3500,7 @@ htmlParseStartTag(htmlParserCtxtPtr ctxt
                             "htmlParseStartTag: misplaced <body> tag\n",
                             name, NULL);
                discardtag = 1;
+               ctxt->depth++;
            }
        }
     }
@@ -3648,7 +3651,6 @@ htmlParseEndTag(htmlParserCtxtPtr ctxt)
     name = htmlParseHTMLName(ctxt);
     if (name == NULL)
         return (0);
-
     /*
      * We should definitely be at the ending "S? '>'" part
      */
@@ -3669,6 +3671,18 @@ htmlParseEndTag(htmlParserCtxtPtr ctxt)
         NEXT;
 
     /*
+     * if we ignored misplaced tags in htmlParseStartTag don't pop them
+     * out now.
+     */
+    if ((ctxt->depth > 0) &&
+        (xmlStrEqual(name, BAD_CAST "html") ||
+         xmlStrEqual(name, BAD_CAST "body") ||
+        xmlStrEqual(name, BAD_CAST "head"))) {
+       ctxt->depth--;
+       return (0);
+    }
+
+    /*
      * If the name read is not one of the element in the parsing stack
      * then return, it's just an error.
      */

_______________________________________________
xml mailing list, project page  http://xmlsoft.org/
[email protected]
http://mail.gnome.org/mailman/listinfo/xml

Re: [xml] Patch to improve HTMLparser's robustness

Reply via email to