Re: [xml] LibXML Incorrectly Parses Tables by Omitting Implied TBODY

Daniel Veillard Mon, 26 Sep 2011 01:00:15 -0700

On Thu, Sep 22, 2011 at 11:52:16AM -0700, Alan Hogan wrote:
> According to HTML 4†, HTML5‡, and all the browsers I have tested* (including 
> Firefox, IE7/8/9, Chrome, Safari, Opera, Android, iOS):
> - No <table> should be without a <tbody>. 
> 
> - No <tr> should exist outside of a <thead>, <tfoot>, and <tbody>. 
> 
> - The first <tr> encountered in a <table>, if not within a <thead> or 
> <tfoot>, 
> and if no <tbody> was manually defined, implies that a <tbody> element was 
> just created as well (as the parent of this and all subsequest <tr>s). 
> 
> LibXML, however, seems happy to parse <tr> elements as if they were direct 
> children of a table. 
> This is simply wrong, nonstandard, and incompatible with user agents. 
> 
> It is creating a headache for me because CSS / XPath selections will  not act 
> as expected, and in an asymmetrical way with regards to actual users' 
> browsers.
> 
> Can we get this to be considered a bug? 
> 
> After all, it’s not that the document author declared there was no 
> tbody. Wittingly or no, they implied its presence; LibXML is simply failing 
> to 
> make the correct inference.


  The big problem is that when you start making inferences like that
you do change the document. In some basic cases it's rather hard to go
wrong, but real world HTML is not about basic cases it's about an ocean
of broken HTML in all possible ways.

  Even something as simple as implying <body> get nasty really fast,
assume a document start with <p> , you would think per the rules you can
add implicit <html><body> ... well until you hit

<p>blah
<title>foo

yes that's wrong, yes it exists, engines will parse and render this
silently. And no I won't try to fix it, maybe <p> was added by some
broken customization layer, maybe the beginner who typed this though
title was a good substicture for h1. If libxml2 start doing this it will
put policies on how to handle brokeness, and since it's a library it's
the wrong place to put this in. For the browser, they are mostly end
application so it's fine for them to implement policies, for libxml2 as
a building block, we can't.

Now for tables that even more complex.
Sometimes the best at the parser level is to just *parse* and let the
interpetation of the result to the application, because if you try to
interpret based on the specification, well in real HTML you're garanteed
to blow up one way or another.

Daniel

-- 
Daniel Veillard      | libxml Gnome XML XSLT toolkit  http://xmlsoft.org/
dan...@veillard.com  | Rpmfind RPM search engine http://rpmfind.net/
http://veillard.com/ | virtualization library  http://libvirt.org/
_______________________________________________
xml mailing list, project page  http://xmlsoft.org/
xml@gnome.org
http://mail.gnome.org/mailman/listinfo/xml

Re: [xml] LibXML Incorrectly Parses Tables by Omitting Implied TBODY

Reply via email to