Greetings,

Daniel Veillard wrote:
>> In recovery mode, parent 'script' or 'style' section will be parsed 
>> wrongly if it  contains the same embedded one.
>> Say, an HTML document contains following script section:
>> ================================Cut here===================================
>> <script language=javascript>
>> ...
>> document.write('<script language=vbscript\>blah</script\>');
>> ...
>> </script>
>> ================================Cut here===================================
>> It's content escaped incorrectly.
>>
>>
>> After this document processed with HTML SAX Parser in RECOVERY mode, the 
>> original section looks corrupted:
>> ================================Cut here===================================
>> <script language=javascript>
>> ...
>> document.write('<script language=vbscript\>blah</script>
>> ================================Cut here===================================
>>
>> Cause both, the parent tag and the embedded one have similar names, the 
>> Parser breaks
>> parent section parsing prematurely, once it met the end of the embedded 
>> section.
>> (see HTMLparser.c, htmlParseScript function, line 2689).
> 
>   Well I'm sure that HTML breaks in a number of places, not just in libxml2
> looks to me a case of broken beyond recovery data.
> 
>> Possible patch is attached.
> 
>   Could you try to explain your patch in english, i.e. what kind of workaround
> you suggest, this may help discuss it,

In RECOVER mode, during script|style tags processing, the patch counts number 
of embedded tags which are have name similar to the parent's one.
Processing of script|style tag breaks only if the counter isn't greater than 
zero, otherwise it's assumed that the end of embedded script|style tag has been 
reached and it's being treated as CDATA.

Pseudo code:
htmlParseScript()
{
   mtags = 0;
   tagname = {script|style};

   if ((cur == '<'))
   {
      if ((NXT(1) == '/'))
      {
         if (recovery && curtagname == tagname)
            if (mtags-- <= 0)
               break; // the end of tag being processed
      } else if (recovery && curtagname == tagname)
         ++mtags; // the same embedded tag
   }

   // treat parsed content as CDATA
}

Andrey.
_______________________________________________
xml mailing list, project page  http://xmlsoft.org/
[email protected]
http://mail.gnome.org/mailman/listinfo/xml

Reply via email to