Daniel Veillard wrote:
> I didn't forgot about the issue, and got a bit of time to test yesterday
> and look at it. First the patch makes senses it fixes a serious problem,
> there is no leak, that's fine, but the result is still problematic
>
>   

> </body>
> </html>
> <p>end text
> </body></html>
>
>   

>   Basically the error is correctly displayed, but the close of the embedded
> body and html tags generate a serious mess. We are able to detect the 
> embedding
> but the autoclose kind of misbehaves. moreover if using the push parser the
> autoclose ends the document immediately:
>   
Can I cheat? :) Given the fact that nothing should appear between 
</body> and </html>, and </html> is always the last tag, its' easiest to 
just ignore them and let the autoclose deal with it...

vz202:~/libxml2/trunk # svn diff HTMLparser.c
Index: HTMLparser.c
===================================================================
--- HTMLparser.c        (revision 3739)
+++ HTMLparser.c        (working copy)
@@ -3646,7 +3646,9 @@
     SKIP(2);
 
     name = htmlParseHTMLName(ctxt);
-    if (name == NULL)
+    if (name == NULL
+       || xmlStrEqual(name, BAD_CAST "html")
+       || xmlStrEqual(name, BAD_CAST "body") )
         return (0);
 
     /*


With this patch, I get:

<html xml:lang="en" xmlns="foobar">
     ^
autoskip.html:4: HTML parser error : htmlParseStartTag: misplaced <body> tag
<body>
     ^
<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN" 
"http://www.w3.org/TR/REC-html40/loose.dtd";>
<html><body>
<p>some text

</p>
<p>embbeded text</p>
&gt;
&gt;
<p>end text
&gt;&gt;
</p>
</body></html>

Which looks good enough to me. It's probably at least enough to get it 
properly through my html email sanitizer.


>   I think the embedding error condition should be noted somewhere in the 
> parser state and disable at least partially the closing tag processing so
> that the 'end text' paragraph shows up as a sibling of the 'embbeded text'
> paragraph.
>   
It probably should generate an error, yes. My patch simply ignores the 
situtation.

-- 
Arnold Hendriks <[EMAIL PROTECTED]>
B-Lex Information Technologies <http://www.b-lex.com/>
Postbus 545, 7500 AM Enschede, The Netherlands

B-Lex: +31 (0)53 4836543
Mobile: +31 (0)6 51710159
MSN: [EMAIL PROTECTED]
ICQ: 86313731

_______________________________________________
xml mailing list, project page  http://xmlsoft.org/
[email protected]
http://mail.gnome.org/mailman/listinfo/xml

Reply via email to