On Thu, Aug 02, 2007 at 04:05:42PM +0400, Andrey C. (aka mohmad) wrote:
> Greetings,
> 
> Daniel Veillard wrote:
> >>In recovery mode, parent 'script' or 'style' section will be parsed 
> >>wrongly if it  contains the same embedded one.
> >>Say, an HTML document contains following script section:
> >>================================Cut 
> >>here===================================
> >><script language=javascript>
> >>...
> >>document.write('<script language=vbscript\>blah</script\>');
> >>...
> >></script>
> >>================================Cut 
> >>here===================================
> >>It's content escaped incorrectly.
> >>
> >>
> >>After this document processed with HTML SAX Parser in RECOVERY mode, the 
> >>original section looks corrupted:
> >>================================Cut 
> >>here===================================
> >><script language=javascript>
> >>...
> >>document.write('<script language=vbscript\>blah</script>
> >>================================Cut 
> >>here===================================
> >>
> >>Cause both, the parent tag and the embedded one have similar names, the 
> >>Parser breaks
> >>parent section parsing prematurely, once it met the end of the embedded 
> >>section.
> >>(see HTMLparser.c, htmlParseScript function, line 2689).
> >
> >  Well I'm sure that HTML breaks in a number of places, not just in 
> >  libxml2
> >looks to me a case of broken beyond recovery data.
> >
> >>Possible patch is attached.
> >
> >  Could you try to explain your patch in english, i.e. what kind of 
> >  workaround
> >you suggest, this may help discuss it,
> 
> In RECOVER mode, during script|style tags processing, the patch counts 
> number of embedded tags which are have name similar to the parent's one.
> Processing of script|style tag breaks only if the counter isn't greater 
> than zero, otherwise it's assumed that the end of embedded script|style 
> tag has been reached and it's being treated as CDATA.
> 
> Pseudo code:
> htmlParseScript()
> {
>   mtags = 0;
>   tagname = {script|style};
> 
>   if ((cur == '<'))
>   {
>      if ((NXT(1) == '/'))
>      {
>         if (recovery && curtagname == tagname)
>            if (mtags-- <= 0)
>               break; // the end of tag being processed
>      } else if (recovery && curtagname == tagname)
>         ++mtags; // the same embedded tag
>   }
> 
>   // treat parsed content as CDATA
> }

  Seems it would trivially break if

    <script/> 

is embedded in the content, sorry, it looks like it will break more document
than it might fix, which is the hard dilemna for any attempt to fix broken
input.

Daniel

-- 
Red Hat Virtualization group http://redhat.com/virtualization/
Daniel Veillard      | virtualization library  http://libvirt.org/
[EMAIL PROTECTED]  | libxml GNOME XML XSLT toolkit  http://xmlsoft.org/
http://veillard.com/ | Rpmfind RPM search engine  http://rpmfind.net/
_______________________________________________
xml mailing list, project page  http://xmlsoft.org/
[email protected]
http://mail.gnome.org/mailman/listinfo/xml

Reply via email to