Hello All,

In recovery mode, parent 'script' or 'style' section will be parsed wrongly if 
it  contains the same embedded one.
Say, an HTML document contains following script section:
================================Cut here===================================
<script language=javascript>
...
document.write('<script language=vbscript\>blah</script\>');
...
</script>
================================Cut here===================================
It's content escaped incorrectly.


After this document processed with HTML SAX Parser in RECOVERY mode, the 
original section looks corrupted:
================================Cut here===================================
<script language=javascript>
...
document.write('<script language=vbscript\>blah</script>
================================Cut here===================================

Cause both, the parent tag and the embedded one have similar names, the Parser 
breaks
parent section parsing prematurely, once it met the end of the embedded section.
(see HTMLparser.c, htmlParseScript function, line 2689).

Possible patch is attached.

Kind regards,
Andrey C.

--- HTMLparser.c~       2007-07-20 23:47:40.000000000 +0400
+++ HTMLparser.c        2007-07-30 17:04:45.000000000 +0400
@@ -2680,41 +2680,51 @@
 static void
 htmlParseScript(htmlParserCtxtPtr ctxt) {
     xmlChar buf[HTML_PARSER_BIG_BUFFER_SIZE + 5];
+    short mtags = 0;
     int nbchar = 0;
     int cur,l;
 
     SHRINK;
     cur = CUR_CHAR(l);
     while (IS_CHAR_CH(cur)) {
-       if ((cur == '<') && (NXT(1) == '/')) {
-            /*
-             * One should break here, the specification is clear:
-             * Authors should therefore escape "</" within the content.
-             * Escape mechanisms are specific to each scripting or
-             * style sheet language.
-             *
-             * In recovery mode, only break if end tag match the
-             * current tag, effectively ignoring all tags inside the
-             * script/style block and treating the entire block as
-             * CDATA.
-             */
-            if (ctxt->recovery) {
-                if (xmlStrncasecmp(ctxt->name, ctxt->input->cur+2, 
-                                  xmlStrlen(ctxt->name)) == 0) 
-                {
-                    break; /* while */
-                } else {
-                   htmlParseErr(ctxt, XML_ERR_TAG_NAME_MISMATCH,
-                                "Element %s embeds close tag\n",
-                                ctxt->name, NULL);
-               }
-            } else {
-                if (((NXT(2) >= 'A') && (NXT(2) <= 'Z')) ||
-                    ((NXT(2) >= 'a') && (NXT(2) <= 'z'))) 
-                {
-                    break; /* while */
-                }
-            }
+        if ((cur == '<')) {
+           if ((NXT(1) == '/')) {
+               /*
+                * One should break here, the specification is clear:
+                * Authors should therefore escape "</" within the content.
+                * Escape mechanisms are specific to each scripting or
+                * style sheet language.
+                *
+                * In recovery mode, only break if end tag match the
+                * current tag, effectively ignoring all tags inside the
+                * script/style block and treating the entire block as
+                * CDATA.
+                */
+               if (ctxt->recovery) {
+                   if (xmlStrncasecmp(ctxt->name, ctxt->input->cur+2, 
+                                      xmlStrlen(ctxt->name)) == 0)
+                   {
+                       if (mtags-- <= 0)
+                           break; /* while */
+                   } else {
+                       htmlParseErr(ctxt, XML_ERR_TAG_NAME_MISMATCH,
+                                    "Element %s embeds close tag\n",
+                                    ctxt->name, NULL);
+                   }
+               } else {
+                   if (((NXT(2) >= 'A') && (NXT(2) <= 'Z')) ||
+                       ((NXT(2) >= 'a') && (NXT(2) <= 'z'))) 
+                   {
+                       break; /* while */
+                   }
+               }
+           } /* </  */
+           else if (ctxt->recovery &&
+                    xmlStrncasecmp(ctxt->name, ctxt->input->cur+1,
+                                   xmlStrlen(ctxt->name)) == 0)
+           {
+               ++mtags;
+           }
        }
        COPY_BUF(l,buf,nbchar,cur);
        if (nbchar >= HTML_PARSER_BIG_BUFFER_SIZE) {

_______________________________________________
xml mailing list, project page  http://xmlsoft.org/
[email protected]
http://mail.gnome.org/mailman/listinfo/xml

Reply via email to