On Wed, Nov 09, 2005 at 03:10:11PM +1100, Michael Day wrote:
>
> Hi,
>
> HTMLparser currently parses comments by looking for a --> to end the
> comment. However, this does not handle SGML comments, in which -- is used
> to toggle whether > ends the comment. It is possible for an SGML comment
> to look like this:
>
> <!-- Hel>lo -- world --> good>bye -- world >
>
> The whole thing is one comment, broken down like this:
>
> "<!--" starts the comment
> " Hel>lo " comment text ('>' is treated as text)
> "--" toggles state ('>' will end the comment)
> " world " comment text
> "--" toggles state ('>' will be treated as text)
> "> good>bye " comment text ('>' is treated as text)
> "--" toggles state ('>' will end the comment)
> " world " comment text
> ">" ends the comment
>
> This looks pretty scary, but this is how Mozilla handles HTML comments in
> standards mode and Opera is going to do the same. The Acid2 test from the
> Web Standards Project includes an SGML comment:
>
> http://www.webstandards.org/act/acid2/
>
> For further info on SGML comments in HTML, see:
>
> http://www.howtocreate.co.uk/SGMLComments.html
>
> I have a patch for HTMLparser.c to make it parse SGML comments. It also
> strips "--" from the text of the comment node, which is different from the
> existing behaviour:
>
> <!-- Hello -->
> comment(" Hello ") // identical to old behaviour
>
> <!-- Hello ---- world -->
> comment(" Hello world ") // old behaviour includes "----"
>
> <!-- Hello -- --> -- world >
> comment(" Hello > world ")
>
> Stripping out the "--" from the text of the comment node also makes it
> possible to take documents that were parsed by HTMLparser and serialise
> them as well-formed XML, which is sometimes not possible now.
>
> Would this patch be acceptable?
Sounds a good idea to fix the parser bahaviour to be more correct, yes.
I don't really know SGML, so such patches are welcome. I just have one
problem with the code, it calls GROW only when the end of the buffer is
detected with a NUL, I would rather have it called more preemtively to
in the loop to avoid a potential weakness in the case of multibyte chars.
Note also that I prefer patches than cut an paste of full routines, it
gives me the context of what was changed.
thanks !
Daniel
--
Daniel Veillard | Red Hat http://redhat.com/
[EMAIL PROTECTED] | libxml GNOME XML XSLT toolkit http://xmlsoft.org/
http://veillard.com/ | Rpmfind RPM search engine http://rpmfind.net/
_______________________________________________
xml mailing list, project page http://xmlsoft.org/
[email protected]
http://mail.gnome.org/mailman/listinfo/xml