On Mon, Jan 09, 2006 at 02:44:34PM +0100, iSteve wrote:
> Greetings,
> for the past week, I've been fixing various bugs in gtkhtml2. Recently, 
> I've found an issue that I -- hopefully correctly -- traced back to 
> libxml2's HTML parser.
> 
> When parsing a html such as:
> <html><body> xxx <div>aaa</div> yyy <div>bbb</div> zzz </body></html>
> 
> I get the 'xxx', 'yyy' and 'zzz' wrapped into paragraphs ("p" element, 
> eg. "[...]<body><p>xxx</p><div>[...]).
> 
> The html:
> <html><body>some <img src="foo.bar"> text</body></html>
> turns into:
> <html><body><p>some <img src="foo.bar"> text</p></body></html>
> 
> The reason is apparently that each text should be in it's own block; 
> unfortunately, wrapping them right into paragraph elements has quite a 
> few drawbacks:
> 
>  a) During later processing, eg. a stylesheet may (and in fact does) 
> get applied to the "p" element; imagine, for example, having a 
> background-image set for all <p>, and you'll suddenly see it even where 
> it shouldn't be at all... It may therefore also break rendering of eg. 
> float (please find the two attached test HTMLs, one without "p" 
> elements, one with them).
> 
>  b) It doesn't appear to be compliant with the standard either; at 
> least I didn't find any such such in the HTML 4.01 standard.
> 
>  c) I have no idea why does the text go into <p> in the second example, 
> too...

  The spec for body is at :

  http://www.w3.org/TR/REC-html40/struct/global.html#h-7.5.1
    <!ELEMENT BODY O O (%block;|SCRIPT)+ +(INS|DEL) -- document body -->

I'm not sure text nodes are to be accepted directly as child of a body element

  For div, it seems adding the <p> is superfluous

  http://www.w3.org/TR/REC-html40/struct/global.html#edef-DIV
<!ELEMENT DIV - - (%flow;)*            -- generic language/style container -->

> I do not believe that wrapping the text into paragraph (which, I 
> believe, is performed by htmlCheckParagraph()) is the best way; perhaps 
> setting the tag name to eg. NULL instead, or a zero-size string (as a 

  element with no name or element with empty names would break so much
code assuming a correct that nothing could justify such a hack, sorry !!!

> special value) would be a better way to resolve the point a) and b). If 
> no styling and rendering would be applied to the reported block (by the 
> forementioned fix), it would imply that c) would no longer matter 
> anyway, too.

Daniel

-- 
Daniel Veillard      | Red Hat http://redhat.com/
[EMAIL PROTECTED]  | libxml GNOME XML XSLT toolkit  http://xmlsoft.org/
http://veillard.com/ | Rpmfind RPM search engine http://rpmfind.net/
_______________________________________________
xml mailing list, project page  http://xmlsoft.org/
[email protected]
http://mail.gnome.org/mailman/listinfo/xml

Reply via email to