On Thu, Aug 02, 2007 at 04:05:42PM +0400, Andrey C. (aka mohmad) wrote:
> Greetings,
>
> Daniel Veillard wrote:
> >>In recovery mode, parent 'script' or 'style' section will be parsed
> >>wrongly if it contains the same embedded one.
> >>Say, an HTML document contains following script section:
> >>================================Cut
> >>here===================================
> >><script language=javascript>
> >>...
> >>document.write('<script language=vbscript\>blah</script\>');
> >>...
> >></script>
> >>================================Cut
> >>here===================================
> >>It's content escaped incorrectly.
> >>
> >>
> >>After this document processed with HTML SAX Parser in RECOVERY mode, the
> >>original section looks corrupted:
> >>================================Cut
> >>here===================================
> >><script language=javascript>
> >>...
> >>document.write('<script language=vbscript\>blah</script>
> >>================================Cut
> >>here===================================
> >>
> >>Cause both, the parent tag and the embedded one have similar names, the
> >>Parser breaks
> >>parent section parsing prematurely, once it met the end of the embedded
> >>section.
> >>(see HTMLparser.c, htmlParseScript function, line 2689).
> >
> > Well I'm sure that HTML breaks in a number of places, not just in
> > libxml2
> >looks to me a case of broken beyond recovery data.
> >
> >>Possible patch is attached.
> >
> > Could you try to explain your patch in english, i.e. what kind of
> > workaround
> >you suggest, this may help discuss it,
>
> In RECOVER mode, during script|style tags processing, the patch counts
> number of embedded tags which are have name similar to the parent's one.
> Processing of script|style tag breaks only if the counter isn't greater
> than zero, otherwise it's assumed that the end of embedded script|style
> tag has been reached and it's being treated as CDATA.
>
> Pseudo code:
> htmlParseScript()
> {
> mtags = 0;
> tagname = {script|style};
>
> if ((cur == '<'))
> {
> if ((NXT(1) == '/'))
> {
> if (recovery && curtagname == tagname)
> if (mtags-- <= 0)
> break; // the end of tag being processed
> } else if (recovery && curtagname == tagname)
> ++mtags; // the same embedded tag
> }
>
> // treat parsed content as CDATA
> }
Seems it would trivially break if
<script/>
is embedded in the content, sorry, it looks like it will break more document
than it might fix, which is the hard dilemna for any attempt to fix broken
input.
Daniel
--
Red Hat Virtualization group http://redhat.com/virtualization/
Daniel Veillard | virtualization library http://libvirt.org/
[EMAIL PROTECTED] | libxml GNOME XML XSLT toolkit http://xmlsoft.org/
http://veillard.com/ | Rpmfind RPM search engine http://rpmfind.net/
_______________________________________________
xml mailing list, project page http://xmlsoft.org/
[email protected]
http://mail.gnome.org/mailman/listinfo/xml