On Wed, Dec 12, 2007 at 07:53:10PM +0100, Arnold Hendriks wrote:
> I've been running into problems parsing incoming email messages through 
> libxml2's HTML parser, which when seeing tags such as <html 
> xml:lang="en" xmlns:....> in an unexpected place, will just eat the 
> '<html' part and turn the attributes of that html tag into normal text, 
> causing odd code to appear at the top of email messages. This mostly 
> affects Outlook/Exchange generated messages.
> 
> The attached patch tries to fix it. It works for me, but I wonder 
> whether I haven't introduced memory allocation issues with it, and hope 
> the patch (or a similar solution) can be integrated into a future libxml 
> release.

  Hi Arnold,

I didn't forgot about the issue, and got a bit of time to test yesterday
and look at it. First the patch makes senses it fixes a serious problem,
there is no leak, that's fine, but the result is still problematic

laptop:~/XML -> cat autoskip.html
<html><body>
<p>some text
<html xml:lang="en" xmlns="foobar">
<body>
<p>embbeded text</p>
</body>
</html>
<p>end text
</body></html>
laptop:~/XML -> xmllint --html autoskip.html
autoskip.html:3: HTML parser error : htmlParseStartTag: misplaced <html> tag
<html xml:lang="en" xmlns="foobar">
     ^
autoskip.html:4: HTML parser error : htmlParseStartTag: misplaced <body> tag
<body>
     ^
<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN" 
"http://www.w3.org/TR/REC-html40/loose.dtd";>
<html>
<body>
<p>some text

</p>
<p>embbeded text</p>
</body>
<html><body><p>end text
</p></body></html>
</html>
laptop:~/XML ->

  Basically the error is correctly displayed, but the close of the embedded
body and html tags generate a serious mess. We are able to detect the embedding
but the autoclose kind of misbehaves. moreover if using the push parser the
autoclose ends the document immediately:

laptop:~/XML -> xmllint --html --push autoskip.html
autoskip.html:3: HTML parser error : htmlParseStartTag: misplaced <html> tag
<html xml:lang="en" xmlns="foobar">
     ^
autoskip.html:4: HTML parser error : htmlParseStartTag: misplaced <body> tag
<body>
     ^
<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN" 
"http://www.w3.org/TR/REC-html40/loose.dtd";>
<html><body>
<p>some text

</p>
<p>embbeded text</p>
</body></html>
laptop:~/XML -> 

  I think the embedding error condition should be noted somewhere in the 
parser state and disable at least partially the closing tag processing so
that the 'end text' paragraph shows up as a sibling of the 'embbeded text'
paragraph.
  That or we show the full subdocument structure, but i don't feel the
current processing is good even if it's clearly better with your patch than
without.
  I commited your patch but there is still some cleanup remaining if you
want to look at it,

  Thanks !

Daniel

-- 
Red Hat Virtualization group http://redhat.com/virtualization/
Daniel Veillard      | virtualization library  http://libvirt.org/
[EMAIL PROTECTED]  | libxml GNOME XML XSLT toolkit  http://xmlsoft.org/
http://veillard.com/ | Rpmfind RPM search engine  http://rpmfind.net/
_______________________________________________
xml mailing list, project page  http://xmlsoft.org/
xml@gnome.org
http://mail.gnome.org/mailman/listinfo/xml

Reply via email to