Re: [xml] Encodings precedence

Daniel Veillard Mon, 16 May 2011 01:25:17 -0700

On Fri, May 13, 2011 at 09:09:12AM -0400, Extra Fu wrote:
> Hello,
> 
> I'm using libxm2 (2.7.6) and I've a question regarding encodings
> precedences.
> 
> I have a array of bytes (UTF-8 HTML data) and I invoke
> htmlCreatePushParserCtxt() with the encoding set to XML_CHAR_ENCODING_UTF8.
> When I walk in the document's nodes, everything is fine unless the HTML file
> was poorly generated, such as:
> 
> <!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN"><html><head>
> <meta http-equiv="Content-Type" content="text/html; charset=Windows-1252">
> ...
> 
> The charset specified here is wrong as the HTML data is truly UTF-8 (I know
> for sure). Nonetheless, the charset specified by the meta tag seems to take
> precedence over the encoding specifed in the htmlCreatePushParserCtxt().
> 
> That is, when walking in the document's nodes using that wrong charset, it
> seems that the xmlNodePtr's content isn't in UTF-8 - messing up my handler
> as it expects UTF-8 data.
> 
> How can I best handle this? I could for sure strip the charset parameter of
> the meta tag prior creating calling htmlCreatePushParserCtxt() but I would
> rather "force" libxml to trust me and use UTF-8 on that poorly generated
> content.


  Yes that's a problem, you ended up hitting a libxml2 deficiency:
there is no way to force ignoring the encoding defined in the document.
In your case the encoding you provide is UTF-8 which is the internal
one and as a result libxml2 behaves like if no hint had been given on
context creation.
  For XML the way to process with encodings is defined in appendix F
   http://www.w3.org/TR/REC-xml/#sec-guessing
where the "environment" encoding given is normally preempting any
internally defined one.
  Still I think the simplest is to actually provide a way to force
ignoring internal encodings when necessary, e.g. when the framework
transcode automatically the docuement encoding. The attached patch does
this, this includes a new option --noenc to xmllint doing this:

paphio:~/XML -> cat tst.html
<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01
Transitional//EN"><html><head>
<meta http-equiv="Content-Type" content="text/html; charset=foo">
</head>
<body>
  some content
</body>
</html>
paphio:~/XML -> xmllint --html --noout tst.html
tst.html:2: HTML parser error : htmlCheckEncoding: unknown encoding foo
<meta http-equiv="Content-Type" content="text/html; charset=foo">
                                                                ^
paphio:~/XML -> xmllint --html --noout --noenc tst.html
paphio:~/XML ->

  I also modified the output code to not end up with a silently dropped
docuement and no error on unknown internal encoding:

paphio:~/XML -> xmllint --html --noenc tst.html
output error : unknown encoding foo
<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN">
<html>
<head><meta http-equiv="Content-Type" content="text/html;
charset=foo"></head>
<body>
  some content
</body>
</html>
paphio:~/XML ->

Works for XML too:

paphio:~/XML -> xmllint enc.xml
enc.xml:1: parser error : Unsupported encoding foo
<?xml version="1.0" encoding="foo"?>
                                  ^
paphio:~/XML -> xmllint --noenc enc.xml
<?xml version="1.0"?>
<tst/>
paphio:~/XML ->

In that case the encoing is completely dropped from the output (which
differenciate the processing from the case where the encoding is just
passed to the parser, then the encoding= is preserved).

This may not be a good option for you if you are stuck with a released
version, but it's better to fix libxml2 there, and as you say right now
you will have to preprocess the input...

Daniel
-- 
Daniel Veillard      | libxml Gnome XML XSLT toolkit  http://xmlsoft.org/
[email protected]  | Rpmfind RPM search engine http://rpmfind.net/
http://veillard.com/ | virtualization library  http://libvirt.org/

commit a1bc2f2ba4b5317885205d4f71c7c4b1c99ec870
Author: Daniel Veillard <[email protected]>
Date:   Mon May 16 16:03:50 2011 +0800

    Add options to ignore the internal encoding
    
    For both XML and HTML, the document can provide an encoding
    either in XMLDecl in XML, or as a meta element in HTML head.
    This adds options to ignore those encodings if the encoding
    is known in advace for example if the content had been converted
    before being passed to the parser.
    
    * parser.c include/libxml/parser.h: add XML_PARSE_IGNORE_ENC option
      for XML parsing
    * include/libxml/HTMLparser.h HTMLparser.c: adds the
      HTML_PARSE_IGNORE_ENC for HTML parsing
    * HTMLtree.c: fix the handling of saving when an unknown encoding is
      defined in meta document header
    * xmllint.c: add a --noenc option to activate the new parser options

diff --git a/HTMLparser.c b/HTMLparser.c
index 4d43b93..1a4d80d 100644
--- a/HTMLparser.c
+++ b/HTMLparser.c
@@ -3448,7 +3448,8 @@ static void
 htmlCheckEncoding(htmlParserCtxtPtr ctxt, const xmlChar *attvalue) {
     const xmlChar *encoding;
 
-    if ((ctxt == NULL) || (attvalue == NULL))
+    if ((ctxt == NULL) || (attvalue == NULL) ||
+        (ctxt->options & HTML_PARSE_IGNORE_ENC))
        return;
 
     /* do not change encoding */
@@ -3500,7 +3501,9 @@ htmlCheckEncoding(htmlParserCtxtPtr ctxt, const xmlChar 
*attvalue) {
                xmlSwitchToEncoding(ctxt, handler);
                ctxt->charset = XML_CHAR_ENCODING_UTF8;
            } else {
-               ctxt->errNo = XML_ERR_UNSUPPORTED_ENCODING;
+               htmlParseErr(ctxt, XML_ERR_UNSUPPORTED_ENCODING,
+                            "htmlCheckEncoding: unknown encoding %s\n",
+                            encoding, NULL);
            }
        }
 
@@ -6537,6 +6540,10 @@ htmlCtxtUseOptions(htmlParserCtxtPtr ctxt, int options)
        ctxt->options |= HTML_PARSE_NODEFDTD;
         options -= HTML_PARSE_NODEFDTD;
     }
+    if (options & HTML_PARSE_IGNORE_ENC) {
+       ctxt->options |= HTML_PARSE_IGNORE_ENC;
+        options -= HTML_PARSE_IGNORE_ENC;
+    }
     ctxt->dictNames = 0;
     return (options);
 }
diff --git a/HTMLtree.c b/HTMLtree.c
index b508583..f23ae02 100644
--- a/HTMLtree.c
+++ b/HTMLtree.c
@@ -481,7 +481,7 @@ htmlNodeDumpFileFormat(FILE *out, xmlDocPtr doc,
        if (enc != XML_CHAR_ENCODING_UTF8) {
            handler = xmlFindCharEncodingHandler(encoding);
            if (handler == NULL)
-               return(-1);
+               htmlSaveErr(XML_SAVE_UNKNOWN_ENCODING, NULL, encoding);
        }
     }
 
@@ -562,11 +562,9 @@ htmlDocDumpMemoryFormat(xmlDocPtr cur, xmlChar**mem, int 
*size, int format) {
            }
 
            handler = xmlFindCharEncodingHandler(encoding);
-           if (handler == NULL) {
-               *mem = NULL;
-               *size = 0;
-               return;
-           }
+           if (handler == NULL)
+                htmlSaveErr(XML_SAVE_UNKNOWN_ENCODING, NULL, encoding);
+
        } else {
            handler = xmlFindCharEncodingHandler(encoding);
        }
@@ -587,7 +585,7 @@ htmlDocDumpMemoryFormat(xmlDocPtr cur, xmlChar**mem, int 
*size, int format) {
        return;
     }
 
-       htmlDocContentDumpFormatOutput(buf, cur, NULL, format);
+    htmlDocContentDumpFormatOutput(buf, cur, NULL, format);
 
     xmlOutputBufferFlush(buf);
     if (buf->conv != NULL) {
@@ -1061,7 +1059,7 @@ htmlDocDump(FILE *f, xmlDocPtr cur) {
 
            handler = xmlFindCharEncodingHandler(encoding);
            if (handler == NULL)
-               return(-1);
+               htmlSaveErr(XML_SAVE_UNKNOWN_ENCODING, NULL, encoding);
        } else {
            handler = xmlFindCharEncodingHandler(encoding);
        }
@@ -1120,7 +1118,7 @@ htmlSaveFile(const char *filename, xmlDocPtr cur) {
 
            handler = xmlFindCharEncodingHandler(encoding);
            if (handler == NULL)
-               return(-1);
+               htmlSaveErr(XML_SAVE_UNKNOWN_ENCODING, NULL, encoding);
        }
     }
 
@@ -1181,7 +1179,7 @@ htmlSaveFileFormat(const char *filename, xmlDocPtr cur,
 
            handler = xmlFindCharEncodingHandler(encoding);
            if (handler == NULL)
-               return(-1);
+               htmlSaveErr(XML_SAVE_UNKNOWN_ENCODING, NULL, encoding);
        }
         htmlSetMetaEncoding(cur, (const xmlChar *) encoding);
     } else {
diff --git a/include/libxml/HTMLparser.h b/include/libxml/HTMLparser.h
index fbcc811..10a3d65 100644
--- a/include/libxml/HTMLparser.h
+++ b/include/libxml/HTMLparser.h
@@ -184,7 +184,8 @@ typedef enum {
     HTML_PARSE_NOBLANKS        = 1<<8, /* remove blank nodes */
     HTML_PARSE_NONET   = 1<<11,/* Forbid network access */
     HTML_PARSE_NOIMPLIED= 1<<13,/* Do not add implied html/body... elements */
-    HTML_PARSE_COMPACT  = 1<<16 /* compact small text nodes */
+    HTML_PARSE_COMPACT  = 1<<16,/* compact small text nodes */
+    HTML_PARSE_IGNORE_ENC=1<<21 /* ignore internal document encoding hint */
 } htmlParserOption;
 
 XMLPUBFUN void XMLCALL
diff --git a/include/libxml/parser.h b/include/libxml/parser.h
index 47b3df1..aabb96c 100644
--- a/include/libxml/parser.h
+++ b/include/libxml/parser.h
@@ -1105,8 +1105,9 @@ typedef enum {
                                   crash if you try to modify the tree) */
     XML_PARSE_OLD10    = 1<<17,/* parse using XML-1.0 before update 5 */
     XML_PARSE_NOBASEFIX = 1<<18,/* do not fixup XINCLUDE xml:base uris */
-    XML_PARSE_HUGE      = 1<<19, /* relax any hardcoded limit from the parser 
*/
-    XML_PARSE_OLDSAX    = 1<<20 /* parse using SAX2 interface from before 
2.7.0 */
+    XML_PARSE_HUGE      = 1<<19,/* relax any hardcoded limit from the parser */
+    XML_PARSE_OLDSAX    = 1<<20,/* parse using SAX2 interface before 2.7.0 */
+    XML_PARSE_IGNORE_ENC= 1<<21 /* ignore internal document encoding hint */
 } xmlParserOption;
 
 XMLPUBFUN void XMLCALL
diff --git a/parser.c b/parser.c
index 9ab8641..02a1877 100644
--- a/parser.c
+++ b/parser.c
@@ -9922,6 +9922,13 @@ xmlParseEncodingDecl(xmlParserCtxtPtr ctxt) {
        } else {
            xmlFatalErr(ctxt, XML_ERR_STRING_NOT_STARTED, NULL);
        }
+
+        /*
+         * Non standard parsing, allowing the user to ignore encoding
+         */
+        if (ctxt->options & XML_PARSE_IGNORE_ENC)
+            return(encoding);
+
        /*
         * UTF-16 encoding stwich has already taken place at this stage,
         * more over the little-endian/big-endian selection is already done
@@ -14561,6 +14568,10 @@ xmlCtxtUseOptionsInternal(xmlParserCtxtPtr ctxt, int 
options, const char *encodi
        ctxt->options |= XML_PARSE_OLDSAX;
         options -= XML_PARSE_OLDSAX;
     }
+    if (options & XML_PARSE_IGNORE_ENC) {
+       ctxt->options |= XML_PARSE_IGNORE_ENC;
+        options -= XML_PARSE_IGNORE_ENC;
+    }
     ctxt->linenumbers = 1;
     return (options);
 }
diff --git a/xmllint.c b/xmllint.c
index b7af32f..745330d 100644
--- a/xmllint.c
+++ b/xmllint.c
@@ -130,6 +130,7 @@ static int copy = 0;
 #endif /* LIBXML_TREE_ENABLED */
 static int recovery = 0;
 static int noent = 0;
+static int noenc = 0;
 static int noblanks = 0;
 static int noout = 0;
 static int nowrap = 0;
@@ -2975,6 +2976,7 @@ static void usage(const char *name) {
     printf("\t--recover : output what was parsable on broken XML documents\n");
     printf("\t--huge : remove any internal arbitrary parser limits\n");
     printf("\t--noent : substitute entity references by their value\n");
+    printf("\t--noenc : ignore any encoding specified inside the document\n");
     printf("\t--noout : don't output the result tree\n");
     printf("\t--path 'paths': provide a set of paths for resources\n");
     printf("\t--load-trace : print trace of all external entites loaded\n");
@@ -3129,6 +3131,10 @@ main(int argc, char **argv) {
                 (!strcmp(argv[i], "--noent"))) {
            noent++;
            options |= XML_PARSE_NOENT;
+       } else if ((!strcmp(argv[i], "-noenc")) ||
+                (!strcmp(argv[i], "--noenc"))) {
+           noenc++;
+           options |= XML_PARSE_IGNORE_ENC;
        } else if ((!strcmp(argv[i], "-nsclean")) ||
                 (!strcmp(argv[i], "--nsclean"))) {
            options |= XML_PARSE_NSCLEAN;

_______________________________________________
xml mailing list, project page  http://xmlsoft.org/
[email protected]
http://mail.gnome.org/mailman/listinfo/xml

Re: [xml] Encodings precedence

Reply via email to