Re: [xml] Apparently incorrect paragraph wrapping in HTML parser

Daniel Veillard Wed, 11 Jan 2006 23:00:39 -0800

On Thu, Jan 12, 2006 at 01:29:39AM +0000, Gary Coady wrote:
> [EMAIL PROTECTED] wrote:
> > Hi All,
> > 
> > I personally believe that it should be based on the DTD being used for the
> > HTML.  
> > 
> > I use XHTML Strict (-//W3C//DTD XHTML 1.0 Strict//EN) and I would expect any
> > conversion from XML to XHTML to make an XML document that is valid against
> > the DTD.  If libxml2 does what you want then this will *not* be the case and
> > hence all my XHTML would be invalid.
> > 
> > In the future people will have problems with their XHTML if they do not
> > consider using the strict version.  The semantic web and machine to machine
> > communications will need to depend on the documents being as compliant as
> > possible to the standard.  
> > 
> > "-//W3C//DTD XHTML 1.0 Transitional//EN" is supposed to be for 
> > "transitional"
> > use while one is going from HTML 4.0 to XHTML.  I believe that
> > XHTML1.0-Strict is the expected standard until it is replaced by the W3C XML
> > Schema version.  For this reason I believe Libxml2 should automatically
> > provide XHTML1.0-Strict.  If not then should libxml2 be creating HTML 1.0?
> > ;--)  
> 
> I agree with most of the above, but an alteration would not affect the
> behaviour you're worried about; XHTML should be parsed with the XML
> parser, not the legacy HTML parser - and this issue involves the
> behaviour of the latter.


  Right.

> A few months ago, I came across a "bug" where whitespace nodes as a
> direct child of the <body> tag would be removed. The problem is similar
> in that pure whitespace nodes are forbidden by the strict DTD, but
> allowed by the transitional DTD.
> 
> In this case, the applied patch checked the DTD in use with code like
> 
> dtd = xmlGetIntSubset(ctxt->myDoc);
> if (dtd != NULL && dtd->ExternalID != NULL) {
>     if (!xmlStrcasecmp(dtd->ExternalID,
>         BAD_CAST "-//W3C//DTD HTML 4.01//EN") ||
>         !xmlStrcasecmp(dtd->ExternalID,
>         BAD_CAST "-//W3C//DTD HTML 4//EN"))
> {
> (line 2060, HTMLparser.c).
> 
> This code assumes that HTML 4 and HTML 4.01 are the only strict non-XML
> DTDs in existence.
> 
> Something similar might be useful for this issue - the <p> tags are not
> needed for a Transitional DTD. I'll have a look to see if there's an
> easy fix at the weekend, if nobody's supplied a patch before that :-)

  Yes, thanks ! That sounds the right approach to me, I would just turn
merge that with a new htmlParserOption HTML_PARSE_STRICT, which could be
either passed by the user to maintain the current behaviour or activated by
default when the DOCTYPE is read if it happen to be a Strict HTML one.

   make sense ? 

Daniel

-- 
Daniel Veillard      | Red Hat http://redhat.com/
[EMAIL PROTECTED]  | libxml GNOME XML XSLT toolkit  http://xmlsoft.org/
http://veillard.com/ | Rpmfind RPM search engine http://rpmfind.net/
_______________________________________________
xml mailing list, project page  http://xmlsoft.org/
[email protected]
http://mail.gnome.org/mailman/listinfo/xml

Re: [xml] Apparently incorrect paragraph wrapping in HTML parser

Reply via email to