On Wed, Dec 12, 2007 at 07:53:10PM +0100, Arnold Hendriks wrote: > I've been running into problems parsing incoming email messages through > libxml2's HTML parser, which when seeing tags such as <html > xml:lang="en" xmlns:....> in an unexpected place, will just eat the > '<html' part and turn the attributes of that html tag into normal text, > causing odd code to appear at the top of email messages. This mostly > affects Outlook/Exchange generated messages. > > The attached patch tries to fix it. It works for me, but I wonder > whether I haven't introduced memory allocation issues with it, and hope > the patch (or a similar solution) can be integrated into a future libxml > release.
Hi Arnold, I didn't forgot about the issue, and got a bit of time to test yesterday and look at it. First the patch makes senses it fixes a serious problem, there is no leak, that's fine, but the result is still problematic laptop:~/XML -> cat autoskip.html <html><body> <p>some text <html xml:lang="en" xmlns="foobar"> <body> <p>embbeded text</p> </body> </html> <p>end text </body></html> laptop:~/XML -> xmllint --html autoskip.html autoskip.html:3: HTML parser error : htmlParseStartTag: misplaced <html> tag <html xml:lang="en" xmlns="foobar"> ^ autoskip.html:4: HTML parser error : htmlParseStartTag: misplaced <body> tag <body> ^ <!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN" "http://www.w3.org/TR/REC-html40/loose.dtd"> <html> <body> <p>some text </p> <p>embbeded text</p> </body> <html><body><p>end text </p></body></html> </html> laptop:~/XML -> Basically the error is correctly displayed, but the close of the embedded body and html tags generate a serious mess. We are able to detect the embedding but the autoclose kind of misbehaves. moreover if using the push parser the autoclose ends the document immediately: laptop:~/XML -> xmllint --html --push autoskip.html autoskip.html:3: HTML parser error : htmlParseStartTag: misplaced <html> tag <html xml:lang="en" xmlns="foobar"> ^ autoskip.html:4: HTML parser error : htmlParseStartTag: misplaced <body> tag <body> ^ <!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN" "http://www.w3.org/TR/REC-html40/loose.dtd"> <html><body> <p>some text </p> <p>embbeded text</p> </body></html> laptop:~/XML -> I think the embedding error condition should be noted somewhere in the parser state and disable at least partially the closing tag processing so that the 'end text' paragraph shows up as a sibling of the 'embbeded text' paragraph. That or we show the full subdocument structure, but i don't feel the current processing is good even if it's clearly better with your patch than without. I commited your patch but there is still some cleanup remaining if you want to look at it, Thanks ! Daniel -- Red Hat Virtualization group http://redhat.com/virtualization/ Daniel Veillard | virtualization library http://libvirt.org/ [EMAIL PROTECTED] | libxml GNOME XML XSLT toolkit http://xmlsoft.org/ http://veillard.com/ | Rpmfind RPM search engine http://rpmfind.net/ _______________________________________________ xml mailing list, project page http://xmlsoft.org/ xml@gnome.org http://mail.gnome.org/mailman/listinfo/xml