Greetings,
Daniel Veillard wrote:
>> In recovery mode, parent 'script' or 'style' section will be parsed
>> wrongly if it contains the same embedded one.
>> Say, an HTML document contains following script section:
>> ================================Cut here===================================
>> <script language=javascript>
>> ...
>> document.write('<script language=vbscript\>blah</script\>');
>> ...
>> </script>
>> ================================Cut here===================================
>> It's content escaped incorrectly.
>>
>>
>> After this document processed with HTML SAX Parser in RECOVERY mode, the
>> original section looks corrupted:
>> ================================Cut here===================================
>> <script language=javascript>
>> ...
>> document.write('<script language=vbscript\>blah</script>
>> ================================Cut here===================================
>>
>> Cause both, the parent tag and the embedded one have similar names, the
>> Parser breaks
>> parent section parsing prematurely, once it met the end of the embedded
>> section.
>> (see HTMLparser.c, htmlParseScript function, line 2689).
>
> Well I'm sure that HTML breaks in a number of places, not just in libxml2
> looks to me a case of broken beyond recovery data.
>
>> Possible patch is attached.
>
> Could you try to explain your patch in english, i.e. what kind of workaround
> you suggest, this may help discuss it,
In RECOVER mode, during script|style tags processing, the patch counts number
of embedded tags which are have name similar to the parent's one.
Processing of script|style tag breaks only if the counter isn't greater than
zero, otherwise it's assumed that the end of embedded script|style tag has been
reached and it's being treated as CDATA.
Pseudo code:
htmlParseScript()
{
mtags = 0;
tagname = {script|style};
if ((cur == '<'))
{
if ((NXT(1) == '/'))
{
if (recovery && curtagname == tagname)
if (mtags-- <= 0)
break; // the end of tag being processed
} else if (recovery && curtagname == tagname)
++mtags; // the same embedded tag
}
// treat parsed content as CDATA
}
Andrey.
_______________________________________________
xml mailing list, project page http://xmlsoft.org/
[email protected]
http://mail.gnome.org/mailman/listinfo/xml