[xml] [PATCH] Encoding related issues

Olli Pottonen Tue, 07 Jul 2015 18:14:05 -0700

Hi,

I was playing around with lxml and noticed that it sometimes fails to decode
UTF-16. So I investigated and experimented and ended up writing patches for
more than just the UTF-16 problem.



First problem: missing space after '<!DOCTYPE' is be a fatal error
which should be reported, but it is not.

Example 1:

  static const char content[] =
    "<!DOCTYPEroot><root/>";
  int length = sizeof(content);

  xmlDocPtr doc;
  doc = xmlReadMemory(content, length, "noname.xml", NULL, 0);

  if (doc != NULL)
    fprintf(stdout, "Ex 1: failure; accepted an invalid document.\n");
  else
    fprintf(stdout, "Ex 1: success; rejected an invalid document.\n");

You will find a solution in attached patch 1.



Section 4.3.3 of the XML 1.0 standard states XML processors must be
able to read entities in UTF-16. Unicode standard section 3.10
specifies how UTF-16 is read: serialisation order (little endian
vs. big endian) is detected based on leading byte order mark
(BOM). However libxml2 fails to read the mark and assumes that UTF-16 is
always little endian.

The unicode standard is available at
http://www.unicode.org/versions/Unicode5.0.0/ch03.pdf#page=43

Example 2:

  // UTF-16 (big endian) encoded '<root/>'
  static const char content[] =
    "\xfe\xff\000<\000r\000o\000o\000t\000/\000>";
  int length = sizeof(content);

  xmlDocPtr doc;
  doc = xmlReadMemory(content, length, "noname.xml", "UTF-16", 0);
  if (doc == NULL)
    fprintf(stdout, "Ex 2: failure; failed to parse valid document.\n");
  else
    fprintf(stdout, "Ex 2: success; parsed valid document.\n");

The funny thing is that libxml2 fails to parse this document when the
encoding, UTF-16, is correctly specified, but if the encoding argument
is NULL, the encoding is detected and the document parsed correctly.

Attached patch 2 fixes UTF-16 decoding.



Next, not really a bug but a missing feature. UTF-32 can be
autodetected based on a byte order marker, but libxml2 does not do
that. Solution in patch 3.



When parsing encoding declaration or text declaration, encoding variables
are taken to mean something they do not mean, which causes some problems.

First, assume options XML_PARSE_IGNORE_ENC | XML_PARSE_DTDLOAD are
used. Let the document reference an external subset with text
declaration, e.g. <?xml encoding="ascii"?>. Then we get error "Missing
encoding in text declaration". This sort of makes sense---if the
existence of a declaration is ignored, it seems to be missing--but is
probably not correct.

Example 3:

Let there be file ext.xml with content "<?xml encoding='ascii'?>".

  static const char content[] =
    "<!DOCTYPE root SYSTEM 'ext.xml'><root/>";
  int length = sizeof(content);

  xmlDocPtr doc;
  doc = xmlReadMemory(content, length, "noname.xml", NULL, 0);
  if (doc == NULL)
    fprintf(stdout, "Ex 3: failure; failed to parse valid document.\n");
  else
    fprintf(stdout, "Ex 3: success; parsed valid document.\n");

Also if encoding declaration is ignored (either because the
declaration does not matter, or because of XML_PARSE_IGNORE_ENC
option), missing whitespace after it is not detected.

Example 4:

  // whitespace missing after 'UTF-8'
  static const char content[] =
    "<?xml version='1.0' encoding='UTF-8'standalone='yes'?><root/>";
  int length = sizeof(content);

  xmlDocPtr doc;
  doc = xmlReadMemory(content, length, "noname.xml", NULL, 0);
  if (doc != NULL)
    fprintf(stdout, "Ex 4: failure; accepted an invalid document.\n");
  else
    fprintf(stdout, "Ex 4: success; rejected an invalid document.\n");

Patch 4 addresses this bug.



The XML standard states that, in absence of an external encoding
declaration and BOM, it is a fatal error for a document to not be in UTF-8.
This is not reported as it should be.

Example 5:

  // UTF-16BE (no BOM) encoded '<?xml version="1.0"?><root/>'
  static const char content[] =
    "\x00<\x00?\x00x\x00m\x00l\x00 \x00v\x00" 
"e\x00r\x00s\x00i\x00o\x00n\x00=\x00\"\x00" "1\x00.\x00" 
"0\x00\"\x00?\x00>\x00<\x00r\x00o\x00o\x00t\x00/\x00>";
  int length = sizeof(content);

  xmlDocPtr doc;
  doc = xmlReadMemory(content, length, "noname.xml", NULL, 0);
  if (doc != NULL)
    fprintf(stdout, "Ex 5: failure; accepted an invalid document.\n");
  else
    fprintf(stdout, "Ex 5: success; rejected an invalid document.\n");

The standard also states that in absence of an external encoding
declaration, it is a fatal error for the XML declaration to claim that
the document is in an encoding which it does not actually use. In
several cases this error is ignored.

Example 6. Document is in UTF-16 but claims to be in UTF-8.

  // UTF-16 encoded '<?xml version='1.0' encoding='utf-8'?><root />'
  static const char content[] =
    "\xff\xfe<\x00?\x00x\x00m\x00l\x00 \x00v\x00" 
"e\x00r\x00s\x00i\x00o\x00n\x00=\x00'\x00" "1\x00.\x00" "0\x00'\x00 \x00" 
"e\x00n\x00" "c\x00o\x00" "d\x00i\x00n\x00g\x00=\x00'\x00u\x00t\x00" 
"f\x00-\x00" "8\x00'\x00?\x00>\x00<\x00r\x00o\x00o\x00t\x00 \x00/\x00>\x00";
  int length = sizeof(content);

  xmlDocPtr doc;
  doc = xmlReadMemory(content, length, "noname.xml", NULL, 0);
  if (doc != NULL)
    fprintf(stdout, "Ex 6: failure; accepted an invalid document.\n");
  else
    fprintf(stdout, "Ex 6: success; rejected an invalid document.\n");


Example 7. Document is in little endian UTF-16 (i.e., has BOM) but
incorrectly claims to be in UTF-16LE (i.e., claims to not have a
BOM). (Alternative interpretation: document is in UTF-16LE as it
claims, but starts with U+FEFF character. Fatal error nonetheless.)


  // UTF-16 (with BOM) '<?xml version='1.0' encoding='utf-16le'?><root/>
  static const char content[] =
"\xff\xfe<\x00?\x00x\x00m\x00l\x00 \x00v\x00" 
"e\x00r\x00s\x00i\x00o\x00n\x00=\x00'\x00" "1\x00.\x00" "0\x00'\x00 \x00" 
"e\x00n\x00" "c\x00o\x00" "d\x00i\x00n\x00g\x00=\x00'\x00u\x00t\x00" 
"f\x00-\x00" "1\x00" "6\x00l\x00" 
"e\x00'\x00?\x00>\x00<\x00r\x00o\x00o\x00t\x00/\x00>\x00";
  int length = sizeof(content);
  xmlDocPtr doc;
  doc = xmlReadMemory(content, length, "noname.xml", NULL, 0);
  if (doc != NULL)
    fprintf(stdout, "Ex 7: failure; accepted an invalid document.\n");
  else
    fprintf(stdout, "Ex 7: success; rejected an invalid document.\n");

Example 8. The document may look like valid ascii, but because of the
byte order mark at the very beginning, it is not:

  static const char content[] =
    "\xef\xbb\xbf<?xml version='1.0' encoding='ascii'?><root/>";
  int length = sizeof(content);

  xmlDocPtr doc;
  doc = xmlReadMemory(content, length, "noname.xml", NULL, 0);
  if (doc != NULL)
    fprintf(stdout, "Ex 8: failure; accepted an invalid document.\n");
  else
    fprintf(stdout, "Ex 8: success; rejected an invalid document.\n");

Example 9. Change encoding on the fly, ascii -> utf-32.

  static const char content[] = 
    "<?xml version='1.0' 
encoding='utf-32'\x00\x00\x00?\x00\x00\x00>\x00\x00\x00<\x00\x00\x00r\x00\x00\x00o\x00\x00\x00o\x00\x00\x00t\x00\x00\x00/\x00\x00\x00>";
  int length = sizeof(content);

  xmlDocPtr doc;
  doc = xmlReadMemory(content, length, "noname.xml", NULL, 0);
  if (doc != NULL)
    fprintf(stdout, "Ex 9: failure; accepted an invalid document.\n");
  else
    fprintf(stdout, "Ex 9: success; rejected an invalid document.\n");

Example 10. Change encoding on the fly, ascii -> cp424 (EBCDIC).

 static const char content[] =
    "<?xml version='1.0' encoding='cp424'onL\x99\x96\x96\xa3\x61n";
  int length = sizeof(content);

  xmlDocPtr doc;
  doc = xmlReadMemory(content, length, "noname.xml", NULL, 0);
  if (doc != NULL)
    fprintf(stdout, "Ex 10: failure; accepted an invalid document.\n");
  else
    fprintf(stdout, "Ex 10: success; rejected an invalid document.\n");

Example 11. Here we have an surrogate pair which is valid in UTF-16 but
invalid in UCS-2.

  // UTF-16 encoded '<?xml version="1.0" encoding="UCS-2"?><U+10000/>'
  static const char content[] =
    "\xff\xfe<\x00?\x00x\x00m\x00l\x00 \x00v\x00" 
"e\x00r\x00s\x00i\x00o\x00n\x00=\x00\"\x00" "1\x00.\x00" "0\x00\"\x00 \x00" 
"e\x00n\x00" "c\x00o\x00" "d\x00i\x00n\x00g\x00=\x00\"\x00U\x00" 
"C\x00S\x00-\x00" "2\x00\"\x00?\x00>\x00<\x00\x00\xd8\x00\xdc/\x00>\x00";
  int length = sizeof(content);

  xmlDocPtr doc;
  doc = xmlReadMemory(content, length, "noname.xml", NULL, 0);
  if (doc != NULL)
    fprintf(stdout, "Ex 11: failure; accepted an invalid document.\n");
  else
    fprintf(stdout, "Ex 11: success; rejected an invalid document.\n");

Patch 5 addresses these problems.



There is a small glitch of not growing the input buffer at the right
time. This sometimes leads to errors; parsing the perfectly valid
document in the next example fails.

Example 12:

  // UTF-16LE encoded '<?xml  version = "1.0" encoding = "utf-16le"?><root/>'
  static const char content[] =
    "<\x00?\x00x\x00m\x00l\x00 \x00 \x00v\x00" "e\x00r\x00s\x00i\x00o\x00n\x00 
\x00=\x00 \x00\"\x00" "1\x00.\x00" "0\x00\"\x00 \x00" "e\x00n\x00" "c\x00o\x00" 
"d\x00i\x00n\x00g\x00 \x00=\x00 \x00\"\x00u\x00t\x00" "f\x00-\x00" "1\x00" 
"6\x00l\x00" "e\x00\"\x00?\x00>\x00<\x00r\x00o\x00o\x00t\x00/\x00>\x00";
  int length = sizeof(content);

  xmlDocPtr doc;
  doc = xmlReadMemory(content, length, "noname.xml", NULL, 0);
  if (doc == NULL)
    fprintf(stdout, "Ex 12: failure; failed to parse valid document.\n");
  else
    fprintf(stdout, "Ex 12: success; parsed valid document.\n");

Patch 6 sets this right.



XML standard allows an empty external entity, and external entity may
start with a BOM. However a BOM in an empty external entity
confuses libxml2, which assumes that BOM may only occur in a string of
at least 4 bytes. UTF-16 BOM causes parsing to fail, and UTF-8 BOM
is interpreted as a #xFEFF character.

Patch 7 fixes this bugs, and also simplifies the code by avoiding
unnecessary copying of data.


Patch 8 simplifies some unneccessary complicated encoding processing
in HTMLparser.c and some minor things elsewhere.


Patch 9 implements HTML 5 encoding detection algorithm, which is more
extensive and robust than the current encoding sniffing algorithm in
HTMLparser.c. For example, it ignores commented out
declaration. Anyway it is not used by default, only when options
instruct to do so.



I can't say I really like the new code. It is convoluted and repeats itself.
However without breaking backwards compatibility I could do no better.

I'd welcome feedback especially about these questions:

Exactly when should we use ctxt->encoding and when ctxt->input->encoding?

xmlSwitchEncoding() in parserInternals.c assumed that input in
UTF-16LE UTF-16BE might contain UTF-8 BOM "As we expect this function
to be called after xmlCharEncInFunc". Why? xmlCharEncInFunc() seems to
be never called. Also, xmlCharEncInFunc has already been decoded (why
else would the BOM be in UTF-8?), can xmlSwitchEncoding() just set out
to decoding it again as it does. Overall this just seems wrong, but
there may be something I missed.

Does UCS-2 have different schemas, with/without BOM, like UTF-16?
How about UCS-4?

XML standard says that UTF-16 must have BOM. Should missing BOM be a
XML_ERR_WARNING, XML_ERR_ERROR, or something else?



Attached you will find the patches mentioned above, all the code examples
given above, and an example xml associated with one example.


Regards
 Olli Pottonen

bugdemo.c
Description: Binary data

ext.xml
Description: XML document

commit dd5550b9df5da33401997386d2d3e6af3bb6f3c2
Author: Olli Pottonen <olli.potto...@iki.fi>
Date:   Tue Jul 7 23:26:27 2015 +1000

    Require whitespace after '<!DOCTYPE'

diff --git a/parser.c b/parser.c
index fe603ac..d8c3ee3 100644
--- a/parser.c
+++ b/parser.c
@@ -8326,6 +8326,10 @@ xmlParseDocTypeDecl(xmlParserCtxtPtr ctxt) {
      * We know that '<!DOCTYPE' has been detected.
      */
     SKIP(9);
+    if (!IS_BLANK_CH(CUR)) {
+      xmlFatalErrMsg(ctxt, XML_ERR_SPACE_REQUIRED,
+                    "Space required after '<!DOCTYPE'\n");
+    }
 
     SKIP_BLANKS;

commit 7733865c3e942f6ec171b43d7202ed763e689a09
Author: Olli Pottonen <olli.potto...@iki.fi>
Date:   Tue Jul 7 23:26:59 2015 +1000

    Properly distinguish between UTF-16, UTF-16LE, UTF-16BE.
    
    UTF-16, UTF-16LE and UTF-16BE are three different schemes. UTF-16LE
    and UTF-16BE use little endiand and big endian byte order,
    respectively, and do not have a byte order mark (BOM). UTF-16 is
    either little endian or big endian, as indicated by BOM, which must be
    present. Fix mixup over UTF-16 and UTF-16LE.
    
    Same goes for UTF-32 as well.
    
    See http://www.unicode.org/versions/Unicode5.0.0/ch03.pdf#page=43.

diff --git a/encoding.c b/encoding.c
index 574e1ae..6807306 100644
--- a/encoding.c
+++ b/encoding.c
@@ -50,6 +50,7 @@
 
 static xmlCharEncodingHandlerPtr xmlUTF16LEHandler = NULL;
 static xmlCharEncodingHandlerPtr xmlUTF16BEHandler = NULL;
+static xmlCharEncodingHandlerPtr xmlUTF16Handler = NULL;
 
 typedef struct _xmlCharEncodingAlias xmlCharEncodingAlias;
 typedef xmlCharEncodingAlias *xmlCharEncodingAliasPtr;
@@ -911,7 +912,7 @@ UTF8ToUTF16BE(unsigned char* outb, int *outlen,
 /**
  * xmlDetectCharEncoding:
  * @in:  a pointer to the first bytes of the XML entity, must be at least
- *       2 bytes long (at least 4 if encoding is UTF4 variant).
+ *       2 bytes long (at least 4 if encoding is UTF32 variant).
  * @len:  pointer to the length of the buffer
  *
  * Guess the encoding of the entity using the first bytes of the entity content
@@ -964,12 +965,15 @@ xmlDetectCharEncoding(const unsigned char* in, int len)
            (in[2] == 0xBF))
            return(XML_CHAR_ENCODING_UTF8);
     }
-    /* For UTF-16 we can recognize by the BOM */
+    /* For UTF-16 we can recognize by the BOM. For backwards
+     * compatibility, return the wrong value; if there is BOM, the
+     * encoding is UTF-16, not UTF-16LE nor UTF-16BE.
+     */
     if (len >= 2) {
-       if ((in[0] == 0xFE) && (in[1] == 0xFF))
-           return(XML_CHAR_ENCODING_UTF16BE);
-       if ((in[0] == 0xFF) && (in[1] == 0xFE))
-           return(XML_CHAR_ENCODING_UTF16LE);
+       if ((in[0] == 0xFE) && (in[1] == 0xFF))
+          return(XML_CHAR_ENCODING_UTF16BE);
+       if ((in[0] == 0xFF) && (in[1] == 0xFE))
+          return(XML_CHAR_ENCODING_UTF16LE);
     }
     return(XML_CHAR_ENCODING_NONE);
 }
@@ -1164,26 +1168,15 @@ xmlParseCharEncoding(const char* name)
     if (!strcmp(upper, "UTF-8")) return(XML_CHAR_ENCODING_UTF8);
     if (!strcmp(upper, "UTF8")) return(XML_CHAR_ENCODING_UTF8);
 
-    /*
-     * NOTE: if we were able to parse this, the endianness of UTF16 is
-     *       already found and in use
-     */
-    if (!strcmp(upper, "UTF-16")) return(XML_CHAR_ENCODING_UTF16LE);
-    if (!strcmp(upper, "UTF16")) return(XML_CHAR_ENCODING_UTF16LE);
+    if (!strcmp(upper, "UTF-16LE")) return(XML_CHAR_ENCODING_UTF16LE);
+    if (!strcmp(upper, "UTF16LE")) return(XML_CHAR_ENCODING_UTF16LE);
+    if (!strcmp(upper, "UTF-16BE")) return(XML_CHAR_ENCODING_UTF16BE);
+    if (!strcmp(upper, "UTF16BE")) return(XML_CHAR_ENCODING_UTF16BE);
 
     if (!strcmp(upper, "ISO-10646-UCS-2")) return(XML_CHAR_ENCODING_UCS2);
     if (!strcmp(upper, "UCS-2")) return(XML_CHAR_ENCODING_UCS2);
     if (!strcmp(upper, "UCS2")) return(XML_CHAR_ENCODING_UCS2);
 
-    /*
-     * NOTE: if we were able to parse this, the endianness of UCS4 is
-     *       already found and in use
-     */
-    if (!strcmp(upper, "ISO-10646-UCS-4")) return(XML_CHAR_ENCODING_UCS4LE);
-    if (!strcmp(upper, "UCS-4")) return(XML_CHAR_ENCODING_UCS4LE);
-    if (!strcmp(upper, "UCS4")) return(XML_CHAR_ENCODING_UCS4LE);
-
-
     if (!strcmp(upper,  "ISO-8859-1")) return(XML_CHAR_ENCODING_8859_1);
     if (!strcmp(upper,  "ISO-LATIN-1")) return(XML_CHAR_ENCODING_8859_1);
     if (!strcmp(upper,  "ISO LATIN 1")) return(XML_CHAR_ENCODING_8859_1);
@@ -1231,19 +1224,19 @@ xmlGetCharEncodingName(xmlCharEncoding enc) {
         case XML_CHAR_ENCODING_UTF8:
            return("UTF-8");
         case XML_CHAR_ENCODING_UTF16LE:
-           return("UTF-16");
+           return("UTF-16LE");
         case XML_CHAR_ENCODING_UTF16BE:
-           return("UTF-16");
+           return("UTF-16BE");
         case XML_CHAR_ENCODING_EBCDIC:
             return("EBCDIC");
         case XML_CHAR_ENCODING_UCS4LE:
-            return("ISO-10646-UCS-4");
+            return("ISO-10646-UCS-4LE");
         case XML_CHAR_ENCODING_UCS4BE:
-            return("ISO-10646-UCS-4");
+            return("ISO-10646-UCS-4BE");
         case XML_CHAR_ENCODING_UCS4_2143:
-            return("ISO-10646-UCS-4");
+            return("ISO-10646-UCS-4-2143");
         case XML_CHAR_ENCODING_UCS4_3412:
-            return("ISO-10646-UCS-4");
+            return("ISO-10646-UCS-4-3412");
         case XML_CHAR_ENCODING_UCS2:
             return("ISO-10646-UCS-2");
         case XML_CHAR_ENCODING_8859_1:
@@ -1411,7 +1404,10 @@ xmlInitCharEncodingHandlers(void) {
           xmlNewCharEncodingHandler("UTF-16LE", UTF16LEToUTF8, UTF8ToUTF16LE);
     xmlUTF16BEHandler =
           xmlNewCharEncodingHandler("UTF-16BE", UTF16BEToUTF8, UTF8ToUTF16BE);
-    xmlNewCharEncodingHandler("UTF-16", UTF16LEToUTF8, UTF8ToUTF16);
+    // There is no decoder for UTF-16; either UTF16BEToUTF8 or
+    // UTF16LEToUTF8, is used, depending on Byte Order Mark.
+    xmlUTF16Handler =
+          xmlNewCharEncodingHandler("UTF-16", NULL, UTF8ToUTF16);
     xmlNewCharEncodingHandler("ISO-8859-1", isolat1ToUTF8, UTF8Toisolat1);
     xmlNewCharEncodingHandler("ASCII", asciiToUTF8, UTF8Toascii);
     xmlNewCharEncodingHandler("US-ASCII", asciiToUTF8, UTF8Toascii);
@@ -1423,7 +1419,8 @@ xmlInitCharEncodingHandlers(void) {
           xmlNewCharEncodingHandler("UTF-16LE", UTF16LEToUTF8, NULL);
     xmlUTF16BEHandler =
           xmlNewCharEncodingHandler("UTF-16BE", UTF16BEToUTF8, NULL);
-    xmlNewCharEncodingHandler("UTF-16", UTF16LEToUTF8, NULL);
+    xmlUTF16Handler =
+          xmlNewCharEncodingHandler("UTF-16", NULL, UTF8ToUTF16);
     xmlNewCharEncodingHandler("ISO-8859-1", isolat1ToUTF8, NULL);
     xmlNewCharEncodingHandler("ASCII", asciiToUTF8, NULL);
     xmlNewCharEncodingHandler("US-ASCII", asciiToUTF8, NULL);
@@ -1434,6 +1431,10 @@ xmlInitCharEncodingHandlers(void) {
 #endif
 #endif
 
+    xmlAddEncodingAlias("UTF-16", "UTF16");
+    xmlAddEncodingAlias("UTF-32", "UTF32");
+    xmlAddEncodingAlias("UCS-4", "ISO-10646-UCS-4");
+    xmlAddEncodingAlias("UCS4", "ISO-10646-UCS-4");
 }
 
 /**
@@ -1520,20 +1521,25 @@ xmlGetCharEncodingHandler(xmlCharEncoding enc) {
             handler = xmlFindCharEncodingHandler("IBM-037");
             if (handler != NULL) return(handler);
            break;
-        case XML_CHAR_ENCODING_UCS4BE:
-            handler = xmlFindCharEncodingHandler("ISO-10646-UCS-4");
+
+        case XML_CHAR_ENCODING_UCS4LE:
+            handler = xmlFindCharEncodingHandler("ISO-10646-UCS-4LE");
             if (handler != NULL) return(handler);
-            handler = xmlFindCharEncodingHandler("UCS-4");
+            handler = xmlFindCharEncodingHandler("UCS-4LE");
             if (handler != NULL) return(handler);
-            handler = xmlFindCharEncodingHandler("UCS4");
+            handler = xmlFindCharEncodingHandler("UCS4LE");
+            if (handler != NULL) return(handler);
+            handler = xmlFindCharEncodingHandler("UTF-32LE");
             if (handler != NULL) return(handler);
            break;
-        case XML_CHAR_ENCODING_UCS4LE:
-            handler = xmlFindCharEncodingHandler("ISO-10646-UCS-4");
+        case XML_CHAR_ENCODING_UCS4BE:
+            handler = xmlFindCharEncodingHandler("ISO-10646-UCS-4BE");
+            if (handler != NULL) return(handler);
+            handler = xmlFindCharEncodingHandler("UCS-4BE");
             if (handler != NULL) return(handler);
-            handler = xmlFindCharEncodingHandler("UCS-4");
+            handler = xmlFindCharEncodingHandler("UCS4BE");
             if (handler != NULL) return(handler);
-            handler = xmlFindCharEncodingHandler("UCS4");
+            handler = xmlFindCharEncodingHandler("UTF-32BE");
             if (handler != NULL) return(handler);
            break;
         case XML_CHAR_ENCODING_UCS4_2143:
diff --git a/include/libxml/encoding.h b/include/libxml/encoding.h
index 7967cc6..cd25b2f 100644
--- a/include/libxml/encoding.h
+++ b/include/libxml/encoding.h
@@ -39,7 +39,7 @@ extern "C" {
  *
  * Predefined values for some standard encodings.
  * Libxml does not do beforehand translation on UTF8 and ISOLatinX.
- * It also supports ASCII, ISO-8859-1, and UTF16 (LE and BE) by default.
+ * It also supports ASCII, ISO-8859-1, and all variants of UTF16 by default.
  *
  * Anything else would have to be translated to UTF8 before being
  * given to the parser itself. The BOM for UTF16 and the encoding
@@ -52,8 +52,12 @@ extern "C" {
  * to be sure to enable iconv and to provide iconv libs for the encoding
  * support needed.
  *
- * Note that the generic "UTF-16" is not a predefined value.  Instead, only
- * the specific UTF-16LE and UTF-16BE are present.
+ * Note that UTF-16, UTF-16LE and UTF-16BE are three different things.
+ * UTF-16 must have byte order marker, UTF-16LE and UTF-16BE must not.
+ *
+ * Similarly UTF-32, UTF-32LE and UTF-32BE are three different things.
+ * However UTF-32 is also known as UCS-4 (and, in addition to little endian
+ * and big endian, there are two unusual byte orders.)
  */
 typedef enum {
     XML_CHAR_ENCODING_ERROR=   -1, /* No char encoding detected */
diff --git a/parser.c b/parser.c
index d8c3ee3..d7457fc 100644
--- a/parser.c
+++ b/parser.c
@@ -15039,7 +15039,7 @@ xmlCtxtResetPush(xmlParserCtxtPtr ctxt, const char 
*chunk,
     if ((encoding == NULL) && (chunk != NULL) && (size >= 4))
         enc = xmlDetectCharEncoding((const xmlChar *) chunk, size);
 
-    buf = xmlAllocParserInputBuffer(enc);
+    buf = xmlAllocParserInputBuffer(XML_CHAR_ENCODING_NONE);
     if (buf == NULL)
         return(1);
 
diff --git a/parserInternals.c b/parserInternals.c
index df204fd..e01a252 100644
--- a/parserInternals.c
+++ b/parserInternals.c
@@ -939,6 +939,7 @@ xmlSwitchEncoding(xmlParserCtxtPtr ctxt, xmlCharEncoding 
enc)
     int len = -1;
 
     if (ctxt == NULL) return(-1);
+    int length = 0;
     switch (enc) {
        case XML_CHAR_ENCODING_ERROR:
            __xmlErrEncoding(ctxt, XML_ERR_UNKNOWN_ENCODING,
@@ -981,12 +982,39 @@ xmlSwitchEncoding(xmlParserCtxtPtr ctxt, xmlCharEncoding 
enc)
             ctxt->input->cur += 3;
         }
         len = 90;
+
+        length = ctxt->input->end - ctxt->input->cur;
+        if ((ctxt->input->cur != NULL) && (length >= 2) &&
+            (ctxt->input->cur[0] == 0xFF) && (ctxt->input->cur[1] == 0xFE)) {
+            ctxt->input->cur += 2;
+        }
+        else if ((ctxt->input->cur != NULL) && (length >= 2) &&
+            (ctxt->input->cur[0] == 0xFE) && (ctxt->input->cur[1] == 0xFF)) {
+            ctxt->input->cur += 2;
+        }
+       len = 90;
        break;
     case XML_CHAR_ENCODING_UCS2:
         len = 90;
        break;
     case XML_CHAR_ENCODING_UCS4BE:
+        length = ctxt->input->end - ctxt->input->cur;
+       if ((ctxt->input->cur != NULL) && (length >= 4) &&
+           (ctxt->input->cur[0] == 0x00) && (ctxt->input->cur[1] == 0x00) &&
+           (ctxt->input->cur[2] == 0xFE) && (ctxt->input->cur[3] == 0xFF)) {
+           ctxt->input->cur += 4;
+       }
+       len = 180;
+       break;
     case XML_CHAR_ENCODING_UCS4LE:
+        length = ctxt->input->end - ctxt->input->cur;
+       if ((ctxt->input->cur != NULL) && (length >= 4) &&
+           (ctxt->input->cur[0] == 0xFF) && (ctxt->input->cur[1] == 0xFE) &&
+           (ctxt->input->cur[2] == 0x00) && (ctxt->input->cur[3] == 0x00)) {
+           ctxt->input->cur += 4;
+       }
+       len = 180;
+       break;
     case XML_CHAR_ENCODING_UCS4_2143:
     case XML_CHAR_ENCODING_UCS4_3412:
         len = 180;
@@ -1025,12 +1053,12 @@ xmlSwitchEncoding(xmlParserCtxtPtr ctxt, 
xmlCharEncoding enc)
            case XML_CHAR_ENCODING_UCS4LE:
                __xmlErrEncoding(ctxt, XML_ERR_UNSUPPORTED_ENCODING,
                               "encoding not supported %s\n",
-                              BAD_CAST "USC4 little endian", NULL);
+                              BAD_CAST "UCS4 little endian", NULL);
                break;
            case XML_CHAR_ENCODING_UCS4BE:
                __xmlErrEncoding(ctxt, XML_ERR_UNSUPPORTED_ENCODING,
                               "encoding not supported %s\n",
-                              BAD_CAST "USC4 big endian", NULL);
+                              BAD_CAST "UCS4 big endian", NULL);
                break;
            case XML_CHAR_ENCODING_EBCDIC:
                __xmlErrEncoding(ctxt, XML_ERR_UNSUPPORTED_ENCODING,
@@ -1132,16 +1160,6 @@ xmlSwitchInputEncodingInt(xmlParserCtxtPtr ctxt, 
xmlParserInputPtr input,
                 return (0);
 
             /*
-             * "UTF-16" can be used for both LE and BE
-             if ((!xmlStrncmp(BAD_CAST input->buf->encoder->name,
-             BAD_CAST "UTF-16", 6)) &&
-             (!xmlStrncmp(BAD_CAST handler->name,
-             BAD_CAST "UTF-16", 6))) {
-             return(0);
-             }
-             */
-
-            /*
              * Note: this is a bit dangerous, but that's what it
              * takes to use nearly compatible signature for different
              * encodings.
@@ -1160,21 +1178,6 @@ xmlSwitchInputEncodingInt(xmlParserCtxtPtr ctxt, 
xmlParserInputPtr input,
            unsigned int use;
 
             /*
-             * Specific handling of the Byte Order Mark for
-             * UTF-16
-             */
-            if ((handler->name != NULL) &&
-                (!strcmp(handler->name, "UTF-16LE") ||
-                 !strcmp(handler->name, "UTF-16")) &&
-                (input->cur[0] == 0xFF) && (input->cur[1] == 0xFE)) {
-                input->cur += 2;
-            }
-            if ((handler->name != NULL) &&
-                (!strcmp(handler->name, "UTF-16BE")) &&
-                (input->cur[0] == 0xFE) && (input->cur[1] == 0xFF)) {
-                input->cur += 2;
-            }
-            /*
              * Errata on XML-1.0 June 20 2001
              * Specific handling of the Byte Order Mark for
              * UTF-8
@@ -1266,6 +1269,68 @@ static int
 xmlSwitchToEncodingInt(xmlParserCtxtPtr ctxt,
                        xmlCharEncodingHandlerPtr handler, int len) {
     int ret = 0;
+    const char *newEncoding = NULL;
+
+    if (handler != NULL && handler->name != NULL &&
+       !strcmp(handler->name, "UTF-16")) {
+        /*
+        * "UTF-16" means "either little endian or big endian as indicated
+        * by byte order mark". So let's check the mark.
+        */
+       const xmlChar *in = ctxt->input->cur;
+       int length = ctxt->input->end - ctxt->input->cur;
+       if (length == 0) {
+         /* No input. This should not happen. */
+         __xmlRaiseError(NULL, NULL, NULL,
+                         ctxt, NULL, XML_FROM_PARSER,
+                         XML_ERR_INVALID_ENCODING, XML_ERR_WARNING,
+                         NULL, 0, NULL, NULL,
+                         NULL, 0, 0, "Empty UTF-16 string (no byte mark)",
+                         NULL, NULL);
+           return (0);
+       }
+        if (length == 1) {
+           __xmlErrEncoding(ctxt, XML_IO_ENCODER,
+               "Decoding error for UTF-16: only one byte", NULL, NULL);
+           return (-1);
+       }
+        if ((in[0] == 0xFF) && (in[1] == 0xFE)) {
+           newEncoding = "UTF-16LE";
+           ctxt->input->cur += 2;
+        } else if ((in[0] == 0xFE) && (in[1] == 0xFF)) {
+           newEncoding = "UTF-16BE";
+           ctxt->input->cur += 2;
+       } else {
+           /* Error. XML REC says UTF-16 must have BOM. Not fatal however. */
+           __xmlRaiseError(NULL, NULL, NULL,
+                           ctxt, NULL, XML_FROM_PARSER,
+                           XML_ERR_INVALID_ENCODING, XML_ERR_WARNING,
+                           NULL, 0, NULL, NULL,
+                           NULL, 0, 0, "No byte order mark for UTF-16",
+                           NULL, NULL);
+           if ((in[0] == 0x3C) && (in[1] == 0x00)) {
+               newEncoding = "UTF-16LE";
+           } else if ((in[0] == 0x00) && (in[1] == 0x3C)) {
+               newEncoding = "UTF-16BE";
+           } else {
+               /* Unicode standard says BE should be the default, but
+                * for backwards compatibility we use LE. XML standard
+                * allows anything: after an error (missing BOM)
+                * results are undefined, and we can recover as best
+                * we can. */
+               newEncoding = "UTF-16LE";
+           }
+       }
+    }
+
+    if (newEncoding != NULL) {
+       handler = xmlFindCharEncodingHandler(newEncoding);
+       if (handler == NULL) {
+           __xmlErrEncoding(ctxt, XML_IO_ENCODER,
+               "Encoding %s not supported", BAD_CAST newEncoding, NULL);
+           return (-1);
+       }
+    }
 
     if (handler != NULL) {
         if (ctxt->input != NULL) {

commit 0d1398b27f446b2f05ed7eeb51e6295a46b6de65
Author: Olli Pottonen <olli.potto...@iki.fi>
Date:   Sat Jun 13 16:42:31 2015 +1000

    Add missing part of encoding detection (XML 1.0 REC appendix F, UTF-32)

diff --git a/encoding.c b/encoding.c
index 6807306..3f89fd2 100644
--- a/encoding.c
+++ b/encoding.c
@@ -932,6 +932,12 @@ xmlDetectCharEncoding(const unsigned char* in, int len)
        if ((in[0] == 0x3C) && (in[1] == 0x00) &&
            (in[2] == 0x00) && (in[3] == 0x00))
            return(XML_CHAR_ENCODING_UCS4LE);
+       if ((in[0] == 0xFF) && (in[1] == 0xFE) &&
+           (in[2] == 0x00) && (in[3] == 0x00))
+           return(XML_CHAR_ENCODING_UCS4LE);
+       if ((in[0] == 0x00) && (in[1] == 0x00) &&
+           (in[2] == 0xFE) && (in[3] == 0xFF))
+           return(XML_CHAR_ENCODING_UCS4BE);
        if ((in[0] == 0x00) && (in[1] == 0x00) &&
            (in[2] == 0x3C) && (in[3] == 0x00))
            return(XML_CHAR_ENCODING_UCS4_2143);

commit 27015b41e71261b25f99f9c0b151a239700feb1e
Author: Olli Pottonen <olli.potto...@iki.fi>
Date:   Tue Jul 7 22:23:04 2015 +1000

    Fix detecting whether encoding declaration is present.
    
    Incorrect mechanisms were used to detect whether an encoding
    declaration is present or not. This caused some bugs. Firstly, missing
    whitespace, a fatal error, is not detected, e.g.
    <?xml version='1.0' encoding='UTF-8'standalone='yes'?><root/>.
    
    Also, if XML_PARSE_IGNORE_ENC option is specified, we get a
    missing encoding declaration error in an external entity even if an
    encoding declaration is present.

diff --git a/parser.c b/parser.c
index d7457fc..93f8d35 100644
--- a/parser.c
+++ b/parser.c
@@ -6991,7 +6991,6 @@ xmlParseMarkupDecl(xmlParserCtxtPtr ctxt) {
 void
 xmlParseTextDecl(xmlParserCtxtPtr ctxt) {
     xmlChar *version;
-    const xmlChar *encoding;
 
     /*
      * We know that '<?xml' is here.
@@ -7024,16 +7023,22 @@ xmlParseTextDecl(xmlParserCtxtPtr ctxt) {
     ctxt->input->version = version;
 
     /*
-     * We must have the encoding declaration
+     * We must have the encoding declaration.
+     * Unfortunately xmlParseEncodingDecl() has no reliable, backwards
+     * compatible way of telling us whether there is one. Hack around that.
      */
-    encoding = xmlParseEncodingDecl(ctxt);
+    const xmlChar *preEncodingCur = ctxt->input->cur;
+    SKIP_BLANKS;
+    xmlParseEncodingDecl(ctxt);
     if (ctxt->errNo == XML_ERR_UNSUPPORTED_ENCODING) {
        /*
         * The XML REC instructs us to stop parsing right here
         */
         return;
     }
-    if ((encoding == NULL) && (ctxt->errNo == XML_ERR_OK)) {
+    int hasEncodingDecl = (ctxt->input->cur != preEncodingCur);
+
+    if (!hasEncodingDecl) {
        xmlFatalErrMsg(ctxt, XML_ERR_MISSING_ENCODING,
                       "Missing encoding in text declaration\n");
     }
@@ -10639,6 +10644,8 @@ xmlParseXMLDecl(xmlParserCtxtPtr ctxt) {
        }
        xmlFatalErrMsg(ctxt, XML_ERR_SPACE_REQUIRED, "Blank needed here\n");
     }
+    SKIP_BLANKS;
+    const xmlChar * preEncodingCur = ctxt->input->cur;
     xmlParseEncodingDecl(ctxt);
     if (ctxt->errNo == XML_ERR_UNSUPPORTED_ENCODING) {
        /*
@@ -10650,7 +10657,8 @@ xmlParseXMLDecl(xmlParserCtxtPtr ctxt) {
     /*
      * We may have the standalone status.
      */
-    if ((ctxt->input->encoding != NULL) && (!IS_BLANK_CH(RAW))) {
+    int hasEncodingDecl = (ctxt->input->cur != preEncodingCur);
+    if (hasEncodingDecl && (!IS_BLANK_CH(RAW))) {
         if ((RAW == '?') && (NXT(1) == '>')) {
            SKIP(2);
            return;

commit 0ad5bfe98c8b76d52db905a847d48187d75d15d1
Author: Olli Pottonen <olli.potto...@iki.fi>
Date:   Tue Jul 7 20:29:50 2015 +1000

    Detect fatal encoding errors.

diff --git a/HTMLparser.c b/HTMLparser.c
index d329d3b..8717d0b 100644
--- a/HTMLparser.c
+++ b/HTMLparser.c
@@ -3507,7 +3507,6 @@ htmlCheckEncodingDirect(htmlParserCtxtPtr ctxt, const 
xmlChar *encoding) {
         return;
 
     if (encoding != NULL) {
-       xmlCharEncoding enc;
        xmlCharEncodingHandlerPtr handler;
 
        while ((*encoding == ' ') || (*encoding == '\t')) encoding++;
@@ -3516,37 +3515,16 @@ htmlCheckEncodingDirect(htmlParserCtxtPtr ctxt, const 
xmlChar *encoding) {
            xmlFree((xmlChar *) ctxt->input->encoding);
        ctxt->input->encoding = xmlStrdup(encoding);
 
-       enc = xmlParseCharEncoding((const char *) encoding);
-       /*
-        * registered set of known encodings
-        */
-       if (enc != XML_CHAR_ENCODING_ERROR) {
-           if (((enc == XML_CHAR_ENCODING_UTF16LE) ||
-                (enc == XML_CHAR_ENCODING_UTF16BE) ||
-                (enc == XML_CHAR_ENCODING_UCS4LE) ||
-                (enc == XML_CHAR_ENCODING_UCS4BE)) &&
-               (ctxt->input->buf != NULL) &&
-               (ctxt->input->buf->encoder == NULL)) {
-               htmlParseErr(ctxt, XML_ERR_INVALID_ENCODING,
-                            "htmlCheckEncoding: wrong encoding meta\n",
-                            NULL, NULL);
-           } else {
-               xmlSwitchEncoding(ctxt, enc);
-           }
-           ctxt->charset = XML_CHAR_ENCODING_UTF8;
+        handler = xmlFindCharEncodingHandler((const char *) encoding);
+
+       if (handler == NULL || !xmlEncHandlerAsciiCompatible(handler)) {
+           htmlParseErr(ctxt, XML_ERR_INVALID_ENCODING,
+                        "htmlCheckEncoding: wrong encoding meta %s\n",
+                        encoding, NULL);
        } else {
-           /*
-            * fallback for unknown encodings
-            */
-           handler = xmlFindCharEncodingHandler((const char *) encoding);
-           if (handler != NULL) {
-               xmlSwitchToEncoding(ctxt, handler);
-               ctxt->charset = XML_CHAR_ENCODING_UTF8;
-           } else {
-               htmlParseErr(ctxt, XML_ERR_UNSUPPORTED_ENCODING,
-                            "htmlCheckEncoding: unknown encoding %s\n",
-                            encoding, NULL);
-           }
+           xmlSwitchToEncoding(ctxt, handler);
+           xmlFree((xmlChar *) ctxt->encoding);
+           ctxt->encoding = xmlStrdup(encoding);
        }
 
        if ((ctxt->input->buf != NULL) &&
diff --git a/encoding.c b/encoding.c
index 3f89fd2..3f19d71 100644
--- a/encoding.c
+++ b/encoding.c
@@ -949,7 +949,8 @@ xmlDetectCharEncoding(const unsigned char* in, int len)
            return(XML_CHAR_ENCODING_EBCDIC);
        if ((in[0] == 0x3C) && (in[1] == 0x3F) &&
            (in[2] == 0x78) && (in[3] == 0x6D))
-           return(XML_CHAR_ENCODING_UTF8);
+           /* Something ascii compatible, do not know what. */
+           return(XML_CHAR_ENCODING_NONE);
        /*
         * Although not part of the recommendation, we also
         * attempt an "auto-recognition" of UTF-16LE and
@@ -1275,6 +1276,178 @@ xmlGetCharEncodingName(xmlCharEncoding enc) {
     return(NULL);
 }
 
+/**
+ * @CompatibleEncodings:
+ * @fname : name of encoded as determined by xmlDetectCharEncoding()
+ *          and xmlSwitchEncoding()
+ * @sname: name of encoding indicated in the XML declaration
+ *
+ * Helper function for xmlParseEncodingDecl() for determining
+ * whether the document really is in the declared encoding.
+ */
+xmlEncodingCompatibility
+xmlCompatibleEncodings(const xmlChar *fname, const xmlChar *sname)
+{
+    xmlChar upper[20];
+    int i;
+    for (i = 0;i < 19;i++) {
+        upper[i] = toupper(sname[i]);
+       if (upper[i] == 0) break;
+    }
+    upper[i] = 0;
+
+    if (fname == NULL) {
+      /* fname == NULL indicates some yet unknown ascii-compatible encoding */
+        if (!xmlStrcmp(upper, BAD_CAST "UTF-8") ||
+           !xmlStrcmp(upper, BAD_CAST "UTF8") ||
+           !xmlStrcmp(upper, BAD_CAST "CSUTF8"))
+           /* The default, which is used when fname == NULL,
+            * is UTF-8, same as sname. */
+           return(XML_ENC_COMP_OK);
+       else
+            /* Don't know, let xmlParseEncodingDecl() figure it out. */
+           return(XML_ENC_COMP_UNKNOWN);
+    }
+
+    /*
+     * If xmlDetectCharEncoding() said it is UTF-8, that is not a
+     * preliminary guess, but a certain conclusion based on presence
+     * of a BOM. Then only valid declaration is UTF-8.
+     */
+    if (!xmlStrcmp(fname, BAD_CAST "UTF-8")) {
+        if (!xmlStrcmp(upper, BAD_CAST "UTF-8") ||
+           !xmlStrcmp(upper, BAD_CAST "UTF8") ||
+           !xmlStrcmp(upper, BAD_CAST "CSUTF8"))
+           return(XML_ENC_COMP_OK);
+       else
+           return(XML_ENC_COMP_ERR);
+    }
+
+    else if (!xmlStrcmp(fname, BAD_CAST "UTF-16BE")) {
+        if (!xmlStrcmp(upper, BAD_CAST "UTF-16BE") ||
+           !xmlStrcmp(upper, BAD_CAST "UTF16BE") ||
+           !xmlStrcmp(upper, BAD_CAST "CSUTF16BE"))
+           return(XML_ENC_COMP_OK);
+       else if (!xmlStrcmp(upper, BAD_CAST "UTF-16") ||
+                !xmlStrcmp(upper, BAD_CAST "UTF16") ||
+                !xmlStrcmp(upper, BAD_CAST "CSUTF16"))
+         /* UTF-16BE is equivalent to the variant of UTF-16 with no BOM.
+          * In XML missing BOM is an error, but not fatal. */
+         return XML_ENC_COMP_BOM_MISSING;
+       else if (!xmlStrcmp(upper, BAD_CAST "UCS-2BE") ||
+                !xmlStrcmp(upper, BAD_CAST "UCS-2") ||
+                !xmlStrcmp(upper, BAD_CAST "ISO-10646-UCS-2") ||
+                !xmlStrcmp(upper, BAD_CAST "ISO-10646-UCS-2BE") ||
+                !xmlStrcmp(upper, BAD_CAST "UCS2BE") ||
+                !xmlStrcmp(upper, BAD_CAST "UCS2") ||
+                !xmlStrcmp(upper, BAD_CAST "CSUNICODE"))
+         /* Ok, compatible, UTF-16 and UCS-2 are almost the same.
+          * Not exactly the same however, this equires special handling. */
+           return(XML_ENC_COMP_UCS2);
+       else
+           return(XML_ENC_COMP_ERR);
+    }
+
+    else if (!xmlStrcmp(fname, BAD_CAST "UTF-16LE")) {
+        if (!xmlStrcmp(upper, BAD_CAST "UTF-16LE") ||
+           !xmlStrcmp(upper, BAD_CAST "UTF16LE") ||
+           !xmlStrcmp(upper, BAD_CAST "CSUTF16LE"))
+           return(XML_ENC_COMP_OK);
+       else if (!xmlStrcmp(upper, BAD_CAST "UCS-2LE") ||
+                !xmlStrcmp(upper, BAD_CAST "UCS-2") ||
+                !xmlStrcmp(upper, BAD_CAST "ISO-10646-UCS-2") ||
+                !xmlStrcmp(upper, BAD_CAST "ISO-10646-UCS-2LE") ||
+                !xmlStrcmp(upper, BAD_CAST "UCS2LE") ||
+                !xmlStrcmp(upper, BAD_CAST "UCS2") ||
+                !xmlStrcmp(upper, BAD_CAST "CSUNICODE"))
+         /* Ok, compatible, but requires special handling. */
+           return(XML_ENC_COMP_UCS2);
+       else
+           return(XML_ENC_COMP_ERR);
+    }
+
+    /*
+     * If xmlDetectCharEncoding() said it is UTF-16, there must be a
+     * BOM. Then UTF-16LE, UTF-16BE which have no BOM are not compatible.
+     */
+    else if (!xmlStrcmp(fname, BAD_CAST "UTF-16")) {
+        if (!xmlStrcmp(upper, BAD_CAST "UTF-16") ||
+           !xmlStrcmp(upper, BAD_CAST "UTF16") ||
+           !xmlStrcmp(upper, BAD_CAST "CSUTF16"))
+           return(XML_ENC_COMP_OK);
+       else if (!xmlStrcmp(upper, BAD_CAST "UCS-2") ||
+                !xmlStrcmp(upper, BAD_CAST "UCS2") ||
+                !xmlStrcmp(upper, BAD_CAST "ISO-10646-UCS-2"))
+           return(XML_ENC_COMP_UCS2);
+       else
+         return(XML_ENC_COMP_ERR);
+    }
+
+    /* UTF-32 a.k.a. UCS-4 is handled almost the same as UTF-16. */
+    else if (!xmlStrcmp(fname, BAD_CAST "ISO-10646-UCS-4BE")) {
+        if (!xmlStrcmp(upper, BAD_CAST "UTF-32BE") ||
+           !xmlStrcmp(upper, BAD_CAST "UTF32BE") ||
+           !xmlStrcmp(upper, BAD_CAST "UTF-32") ||
+           !xmlStrcmp(upper, BAD_CAST "UTF32BE") ||
+           !xmlStrcmp(upper, BAD_CAST "UTF32") ||
+           !xmlStrcmp(upper, BAD_CAST "UCS-4BE") ||
+           !xmlStrcmp(upper, BAD_CAST "UCS-4") ||
+           !xmlStrcmp(upper, BAD_CAST "UCS4BE") ||
+           !xmlStrcmp(upper, BAD_CAST "UCS4") ||
+           !xmlStrcmp(upper, BAD_CAST "ISO-10646-UCS-4") ||
+           !xmlStrcmp(upper, BAD_CAST "ISO-10646-UCS-4BE") ||
+           !xmlStrcmp(upper, BAD_CAST "CSUTF32BE") ||
+           !xmlStrcmp(upper, BAD_CAST "CSUTF32"))
+           return(XML_ENC_COMP_OK);
+       else
+           return(XML_ENC_COMP_ERR);
+    }
+
+    else if (!xmlStrcmp(fname, BAD_CAST "ISO-10646-UCS-4LE")) {
+        if (!xmlStrcmp(upper, BAD_CAST "UTF-32LE") ||
+           !xmlStrcmp(upper, BAD_CAST "UTF32LE") ||
+           !xmlStrcmp(upper, BAD_CAST "UCS-4LE") ||
+           !xmlStrcmp(upper, BAD_CAST "UCS4LE") ||
+           !xmlStrcmp(upper, BAD_CAST "ISO-10646-UCS-4LE") ||
+           !xmlStrcmp(upper, BAD_CAST "CSUTF32LE"))
+           return(XML_ENC_COMP_OK);
+       else
+           return(XML_ENC_COMP_ERR);
+    }
+
+    /*
+     * If xmlDetectCharEncoding() said it is UTF-32, there must be a
+     * BOM. Then only valid declaration is UTF-32.
+     */
+    else if (!xmlStrcmp(fname, BAD_CAST "ISO-10646-UCS-4")) {
+      if (!xmlStrcmp(upper, BAD_CAST "UTF-32") ||
+         !xmlStrcmp(upper, BAD_CAST "UTF32") ||
+         !xmlStrcmp(upper, BAD_CAST "UCS-4") ||
+         !xmlStrcmp(upper, BAD_CAST "UCS4") ||
+         !xmlStrcmp(upper, BAD_CAST "ISO-10646-UCS-4") ||
+         !xmlStrcmp(upper, BAD_CAST "CSUTF32"))
+         return(XML_ENC_COMP_OK);
+      else
+         return(XML_ENC_COMP_ERR);
+    }
+
+    /*
+     * TODO: this is incomplete, there are several EBCDIC variants.
+     */
+    else if (!xmlStrcmp(fname, BAD_CAST "EBCDIC")) {
+        if (!xmlStrcmp(upper, BAD_CAST "EBCDIC"))
+           return(XML_ENC_COMP_OK);
+       else
+           return(XML_ENC_COMP_ERR);
+    }
+
+    else {
+        xmlEncodingErr(XML_ERR_INTERNAL_ERROR,
+                      "Unexpected encoding %s\n", (const char*) fname);
+       return(XML_ENC_COMP_ERR);
+    }
+}
+
 /************************************************************************
  *                                                                     *
  *                     Char encoding handlers                          *
@@ -2931,6 +3104,39 @@ xmlCharEncCloseFunc(xmlCharEncodingHandler *handler) {
 }
 
 /**
+ * xmlEncHandlerAsciiCompatible:
+ * @handler: an XML chanacter encoding handler
+ *
+ * This function finds out whether the handler is for an ASCII
+ * compatible encoding (e.g. UTF-8, ISO-8859-X, ASCII) or non-compatible
+ * (e.g. UTF-16, UTF-32, EBCDIC).
+ */
+int
+xmlEncHandlerAsciiCompatible(xmlCharEncodingHandler *handler) {
+    unsigned char test_out[11], test_in[] = "<?xml";
+    int outlen = 11, inlen = 5;
+    if (handler->input != NULL) {
+        if (handler->input(test_out, &outlen, test_in, &inlen) < 0)
+           return(0);
+  }
+#ifdef LIBXML_ICONV_ENABLED
+    else if (handler->iconv_in != NULL) {
+        if (xmlIconvWrapper(handler->iconv_in, test_out,
+                           &outlen, test_in, &inlen) < 0)
+           return(0);
+  }
+#endif /* LIBXML_ICONV_ENABLED */
+#ifdef LIBXML_ICU_ENABLED
+    else if (handler->uconv_in != NULL) {
+        if (xmlUconvWrapper(handler->uconv_in, 1, test_out,
+                           &outlen, test_in, &inlen) < 0)
+           return(0);
+    }
+#endif /* LIBXML_ICU_ENABLED */
+    return(outlen == 5 && inlen == 5 && !xmlStrncmp(test_in, test_out, 5));
+}
+
+/**
  * xmlByteConsumed:
  * @ctxt: an XML parser context
  *
diff --git a/include/libxml/encoding.h b/include/libxml/encoding.h
index cd25b2f..9121a00 100644
--- a/include/libxml/encoding.h
+++ b/include/libxml/encoding.h
@@ -156,6 +156,15 @@ struct _xmlCharEncodingHandler {
 #endif /* LIBXML_ICU_ENABLED */
 };
 
+/* Return values of xmlCompatibleEncodings */
+typedef enum {
+  XML_ENC_COMP_UNKNOWN=        0,
+  XML_ENC_COMP_OK=             1, /* Ok, compatible */
+  XML_ENC_COMP_ERR=            2, /* Fatal error, incompatible */
+  XML_ENC_COMP_BOM_MISSING=    3, /* BOM missing, non-fatal error */
+  XML_ENC_COMP_UCS2=           4, /* Ok, switch UTF-16 -> UCS2 */
+} xmlEncodingCompatibility;
+
 #ifdef __cplusplus
 }
 #endif
@@ -199,6 +208,10 @@ XMLPUBFUN xmlCharEncoding XMLCALL
 XMLPUBFUN const char * XMLCALL
        xmlGetCharEncodingName          (xmlCharEncoding enc);
 
+XMLPUBFUN xmlEncodingCompatibility
+       xmlCompatibleEncodings          (const xmlChar *fname,
+                                        const xmlChar *sname);
+
 /*
  * Interfaces directly used by the parsers.
  */
@@ -222,6 +235,9 @@ XMLPUBFUN int XMLCALL
 XMLPUBFUN int XMLCALL
        xmlCharEncCloseFunc             (xmlCharEncodingHandler *handler);
 
+XMLPUBFUN int XMLCALL
+       xmlEncHandlerAsciiCompatible    (xmlCharEncodingHandler *handler);
+
 /*
  * Export a few useful functions
  */
diff --git a/parser.c b/parser.c
index 93f8d35..b6da7c4 100644
--- a/parser.c
+++ b/parser.c
@@ -10426,68 +10426,133 @@ xmlParseEncodingDecl(xmlParserCtxtPtr ctxt) {
            xmlFatalErr(ctxt, XML_ERR_STRING_NOT_STARTED, NULL);
        }
 
-        /*
-         * Non standard parsing, allowing the user to ignore encoding
-         */
-        if (ctxt->options & XML_PARSE_IGNORE_ENC) {
-           xmlFree((xmlChar *) encoding);
-            return(NULL);
-       }
+       if (encoding == NULL)
+           return(NULL);
 
        /*
-        * UTF-16 encoding stwich has already taken place at this stage,
-        * more over the little-endian/big-endian selection is already done
-        */
-        if ((encoding != NULL) &&
-           ((!xmlStrcasecmp(encoding, BAD_CAST "UTF-16")) ||
-            (!xmlStrcasecmp(encoding, BAD_CAST "UTF16")))) {
-           /*
-            * If no encoding was passed to the parser, that we are
-            * using UTF-16 and no decoder is present i.e. the
-            * document is apparently UTF-8 compatible, then raise an
-            * encoding mismatch fatal error
-            */
-           if ((ctxt->encoding == NULL) &&
-               (ctxt->input->buf != NULL) &&
-               (ctxt->input->buf->encoder == NULL)) {
-               xmlFatalErrMsg(ctxt, XML_ERR_INVALID_ENCODING,
-                 "Document labelled UTF-16 but has UTF-8 content\n");
-           }
-           if (ctxt->encoding != NULL)
-               xmlFree((xmlChar *) ctxt->encoding);
-           ctxt->encoding = encoding;
-       }
-       /*
-        * UTF-8 encoding is handled natively
+        * XML REC Section 4.3.3: "In the absence of information
+        * provided by an external transport protocol (e.g. HTTP or
+        * MIME), it is a fatal error for an entity including an
+        * encoding declaration to be presented to the XML processor
+        * in an encoding other than that named in the declaration"
+        *
+        * The presence of information by an external protocol is
+        * indicated by XML_PARSE_IGNORE_ENC in ctxt->options.
+        *
+        * In absence of information by external protocol, we may have
+        * sniffed that the content is UTF-16 or UTF-32, or UTF-8.
+        * This initial guess can not be completely incorrect, for we have
+        * succesfully parsed the document so far, but it may be perfectly
+        * precise, e.g. it is not always possible to distinguish between
+        * UTF-8 and ASCII.
+        *
+        * xmlCompatibleEncodings() tells us whether the initial
+        * guess and declared encoding are compatible.
         */
-        else if ((encoding != NULL) &&
-           ((!xmlStrcasecmp(encoding, BAD_CAST "UTF-8")) ||
-            (!xmlStrcasecmp(encoding, BAD_CAST "UTF8")))) {
-           if (ctxt->encoding != NULL)
-               xmlFree((xmlChar *) ctxt->encoding);
-           ctxt->encoding = encoding;
+        if (ctxt->options & XML_PARSE_IGNORE_ENC) {
+           xmlFree((xmlChar *) encoding);
+           return(NULL);
        }
-       else if (encoding != NULL) {
-           xmlCharEncodingHandlerPtr handler;
 
-           if (ctxt->input->encoding != NULL)
-               xmlFree((xmlChar *) ctxt->input->encoding);
-           ctxt->input->encoding = encoding;
+        xmlCharEncodingHandlerPtr handler = NULL;
+       xmlEncodingCompatibility compatible = \
+           xmlCompatibleEncodings(ctxt->encoding, encoding);
 
-            handler = xmlFindCharEncodingHandler((const char *) encoding);
-           if (handler != NULL) {
-               xmlSwitchToEncoding(ctxt, handler);
-           } else {
-               xmlFatalErrMsgStr(ctxt, XML_ERR_UNSUPPORTED_ENCODING,
-                       "Unsupported encoding %s\n", encoding);
+       if (compatible == XML_ENC_COMP_ERR) {
+           xmlFatalErrMsgStr(ctxt, XML_ERR_INVALID_ENCODING,
+                "Incorrect encoding declaration %s\n", encoding);
+           xmlFree((xmlChar *) encoding);
+           return(NULL);
+       } else if (compatible == XML_ENC_COMP_BOM_MISSING) {
+           xmlWarningMsg(ctxt, XML_ERR_INVALID_ENCODING,
+               "Missing Byte Order Mark\n", NULL, NULL);
+           xmlFree((xmlChar *) encoding);
+           return(NULL);
+       } else if (compatible == XML_ENC_COMP_OK) {
+         /* Correct decoder already in use, no need to
+          * xmlSwitchToEncoding()
+          */
+           xmlFree((xmlChar *) ctxt->encoding);
+           ctxt->encoding = encoding;
+           return(NULL);
+       } else if (compatible == XML_ENC_COMP_UCS2) {
+           /* Switch from UTF-16 to UCS-2. Keep current UTF-16 byte
+            * order (big endian/little endian).
+            */
+           if(ctxt->input->buf->encoder == NULL) {
+               xmlFatalErr(ctxt, XML_ERR_INTERNAL_ERROR,
+                           "no encoder");
+               xmlFree((xmlChar *) encoding);
                return(NULL);
            }
+
+           if (!strcmp(ctxt->input->buf->encoder->name, "UTF-16LE")) {
+               handler = xmlFindCharEncodingHandler("UCS-2LE");
+           } else if (!strcmp(ctxt->input->buf->encoder->name, "UTF-16BE")) {
+               handler = xmlFindCharEncodingHandler("UCS-2BE");
+           } else {
+               xmlFatalErr(ctxt, XML_ERR_INTERNAL_ERROR,
+                         "unexpected encoder");
+           }
+       } else { /* compatible == XML_ENC_COMP_UNKNOWN. Ascii-like. */
+           handler = xmlFindCharEncodingHandler((const char *) encoding);
+       }
+
+       if (handler == NULL) {
+           xmlFatalErrMsgStr(ctxt, XML_ERR_INVALID_ENCODING,
+                             "Unsupported encoding %s\n", encoding);
+           xmlFree((xmlChar *) encoding);
+           return(NULL);
        }
+       if (compatible == XML_ENC_COMP_UNKNOWN &&
+           !xmlEncHandlerAsciiCompatible(handler)) {
+           xmlFatalErrMsgStr(ctxt, XML_ERR_INVALID_ENCODING,
+                             "Document starts with ASCII but declares "
+                             "incompatible encoding %s\n", encoding);
+       }
+
+       xmlSwitchToEncoding(ctxt, handler);
+       xmlFree((void*) ctxt->encoding);
+       ctxt->encoding = encoding;
     }
     return(encoding);
 }
 
 /**
+ * checkNoEncodingDecl
+ * @ctxt:  an XML parser context
+ *
+ * Necessary checks if thereis no encoding declaration in XML
+ * declaration or text declaration. XML REC section 4.3.3: "Unless an
+ * encoding is determined by a higher-level protocol, it is also a
+ * fatal error if an XML entity contains no encoding declaration and
+ * its content is not legal UTF-8 or UTF-16."
+ *
+ * Note that UTF-16 only refers to UTF-16 (with BOM),
+ * UTF-16LE or UTF-16BE (without BOM) won't do:
+ *
+ * "In the absence of information provided by an external
+ * transport protocol (e.g. HTTP or MIME), it is a fatal error
+ * ... for an entity which begins with neither a Byte Order Mark
+ * nor an encoding declaration to use an encoding other than
+ * UTF-8."
+ *
+ * For UTF-8, ctxt->encoding is either NULL or "UTF-8".
+ */
+static int checkNoEncodingDecl(xmlParserCtxtPtr ctxt) {
+    if ((ctxt->options & XML_PARSE_IGNORE_ENC) == 0 &&
+       ctxt->encoding != NULL &&
+       xmlStrcmp(ctxt->encoding, BAD_CAST "UTF-8") &&
+       xmlStrcmp(ctxt->encoding, BAD_CAST "UTF-16")) {
+       xmlFatalErrMsgStr(ctxt, XML_ERR_INVALID_ENCODING,
+           "Encoding declaration missing (not UTF-8 nor UTF-16 with BOM)",
+           NULL);
+       return(-1);
+    }
+    return(0);
+}
+
+/**
  * xmlParseSDDecl:
  * @ctxt:  an XML parser context
  *
@@ -10639,6 +10704,7 @@ xmlParseXMLDecl(xmlParserCtxtPtr ctxt) {
      */
     if (!IS_BLANK_CH(RAW)) {
         if ((RAW == '?') && (NXT(1) == '>')) {
+           checkNoEncodingDecl(ctxt);
            SKIP(2);
            return;
        }
@@ -10658,6 +10724,9 @@ xmlParseXMLDecl(xmlParserCtxtPtr ctxt) {
      * We may have the standalone status.
      */
     int hasEncodingDecl = (ctxt->input->cur != preEncodingCur);
+    if (!hasEncodingDecl && checkNoEncodingDecl(ctxt) < 0) {
+        return;
+    }
     if (hasEncodingDecl && (!IS_BLANK_CH(RAW))) {
         if ((RAW == '?') && (NXT(1) == '>')) {
            SKIP(2);
@@ -10798,8 +10867,11 @@ xmlParseDocument(xmlParserCtxtPtr ctxt) {
        ctxt->standalone = ctxt->input->standalone;
        SKIP_BLANKS;
     } else {
+        if (checkNoEncodingDecl(ctxt) < 0)
+           return(-1);
        ctxt->version = xmlCharStrdup(XML_DEFAULT_VERSION);
     }
+
     if ((ctxt->sax) && (ctxt->sax->startDocument) && (!ctxt->disableSAX))
         ctxt->sax->startDocument(ctxt->userData);
     if (ctxt->instate == XML_PARSER_EOF)
@@ -10977,6 +11049,8 @@ xmlParseExtParsedEnt(xmlParserCtxtPtr ctxt) {
        }
        SKIP_BLANKS;
     } else {
+        if (checkNoEncodingDecl(ctxt) < 0)
+           return(-1);
        ctxt->version = xmlCharStrdup(XML_DEFAULT_VERSION);
     }
     if ((ctxt->sax) && (ctxt->sax->startDocument) && (!ctxt->disableSAX))
@@ -11392,6 +11466,7 @@ xmlParseTryOrFinish(xmlParserCtxtPtr ctxt, int 
terminate) {
                        ctxt->sax->endDocument(ctxt->userData);
                    goto done;
                }
+               int hasXmlDecl = 0;
                if ((cur == '<') && (next == '?')) {
                    /* PI or XML decl */
                    if (avail < 5) return(ret);
@@ -11410,6 +11485,7 @@ xmlParseTryOrFinish(xmlParserCtxtPtr ctxt, int 
terminate) {
                        xmlGenericError(xmlGenericErrorContext,
                                "PP: Parsing XML Decl\n");
 #endif
+                       hasXmlDecl = 1;
                        xmlParseXMLDecl(ctxt);
                        if (ctxt->errNo == XML_ERR_UNSUPPORTED_ENCODING) {
                            /*
@@ -11460,6 +11536,9 @@ xmlParseTryOrFinish(xmlParserCtxtPtr ctxt, int 
terminate) {
                            "PP: entering MISC\n");
 #endif
                }
+               if (!hasXmlDecl && checkNoEncodingDecl(ctxt) < 0)
+                   return(0);
+
                break;
             case XML_PARSER_START_TAG: {
                const xmlChar *name;
@@ -13090,6 +13169,8 @@ xmlParseCtxtExternalEntity(xmlParserCtxtPtr ctx, const 
xmlChar *URL,
            xmlFatalErrMsg(ctxt, XML_ERR_VERSION_MISMATCH,
                           "Version mismatch between document and entity\n");
        }
+    } else {
+        checkNoEncodingDecl(ctxt);
     }
 
     /*
@@ -13310,6 +13391,8 @@ xmlParseExternalEntityPrivate(xmlDocPtr doc, 
xmlParserCtxtPtr oldctxt,
      */
     if ((CMP5(CUR_PTR, '<', '?', 'x', 'm', 'l')) && (IS_BLANK_CH(NXT(5)))) {
        xmlParseTextDecl(ctxt);
+    } else {
+        checkNoEncodingDecl(ctxt);
     }
 
     ctxt->instate = XML_PARSER_CONTENT;
@@ -15104,6 +15187,7 @@ xmlCtxtResetPush(xmlParserCtxtPtr ctxt, const char 
*chunk,
     }
 
     if (encoding != NULL) {
+        ctxt->options |= XML_PARSE_IGNORE_ENC;
         xmlCharEncodingHandlerPtr hdlr;
 
         if (ctxt->encoding != NULL)
@@ -15304,6 +15388,7 @@ xmlDoRead(xmlParserCtxtPtr ctxt, const char *URL, const 
char *encoding,
     xmlCtxtUseOptionsInternal(ctxt, options, encoding);
     if (encoding != NULL) {
         xmlCharEncodingHandlerPtr hdlr;
+        ctxt->options |= XML_PARSE_IGNORE_ENC;
 
        hdlr = xmlFindCharEncodingHandler(encoding);
        if (hdlr != NULL)
diff --git a/parserInternals.c b/parserInternals.c
index e01a252..957df04 100644
--- a/parserInternals.c
+++ b/parserInternals.c
@@ -940,6 +940,7 @@ xmlSwitchEncoding(xmlParserCtxtPtr ctxt, xmlCharEncoding 
enc)
 
     if (ctxt == NULL) return(-1);
     int length = 0;
+    const char *encodingName = NULL;
     switch (enc) {
        case XML_CHAR_ENCODING_ERROR:
            __xmlErrEncoding(ctxt, XML_ERR_UNKNOWN_ENCODING,
@@ -952,6 +953,8 @@ xmlSwitchEncoding(xmlParserCtxtPtr ctxt, xmlCharEncoding 
enc)
        case XML_CHAR_ENCODING_UTF8:
            /* default encoding, no conversion should be needed */
            ctxt->charset = XML_CHAR_ENCODING_UTF8;
+           xmlFree((xmlChar *) ctxt->encoding);
+           ctxt->encoding = xmlStrdup(BAD_CAST "UTF-8");
 
            /*
             * Errata on XML-1.0 June 20 2001
@@ -987,10 +990,12 @@ xmlSwitchEncoding(xmlParserCtxtPtr ctxt, xmlCharEncoding 
enc)
         if ((ctxt->input->cur != NULL) && (length >= 2) &&
             (ctxt->input->cur[0] == 0xFF) && (ctxt->input->cur[1] == 0xFE)) {
             ctxt->input->cur += 2;
+           encodingName = "UTF-16";
         }
         else if ((ctxt->input->cur != NULL) && (length >= 2) &&
             (ctxt->input->cur[0] == 0xFE) && (ctxt->input->cur[1] == 0xFF)) {
             ctxt->input->cur += 2;
+           encodingName = "UTF-16";
         }
        len = 90;
        break;
@@ -1003,6 +1008,7 @@ xmlSwitchEncoding(xmlParserCtxtPtr ctxt, xmlCharEncoding 
enc)
            (ctxt->input->cur[0] == 0x00) && (ctxt->input->cur[1] == 0x00) &&
            (ctxt->input->cur[2] == 0xFE) && (ctxt->input->cur[3] == 0xFF)) {
            ctxt->input->cur += 4;
+           encodingName = "ISO-10646-UCS-4";
        }
        len = 180;
        break;
@@ -1012,6 +1018,7 @@ xmlSwitchEncoding(xmlParserCtxtPtr ctxt, xmlCharEncoding 
enc)
            (ctxt->input->cur[0] == 0xFF) && (ctxt->input->cur[1] == 0xFE) &&
            (ctxt->input->cur[2] == 0x00) && (ctxt->input->cur[3] == 0x00)) {
            ctxt->input->cur += 4;
+           encodingName = "ISO-10646-UCS-4";
        }
        len = 180;
        break;
@@ -1037,6 +1044,8 @@ xmlSwitchEncoding(xmlParserCtxtPtr ctxt, xmlCharEncoding 
enc)
        break;
     }
     handler = xmlGetCharEncodingHandler(enc);
+    if (encodingName == NULL)
+        encodingName = xmlGetCharEncodingName(enc);
     if (handler == NULL) {
        /*
         * Default handlers.
@@ -1125,7 +1134,12 @@ xmlSwitchEncoding(xmlParserCtxtPtr ctxt, xmlCharEncoding 
enc)
     if (handler == NULL)
        return(-1);
     ctxt->charset = XML_CHAR_ENCODING_UTF8;
-    return(xmlSwitchToEncodingInt(ctxt, handler, len));
+    int res = xmlSwitchToEncodingInt(ctxt, handler, len);
+    if (res == 0 && encodingName != NULL) {
+        xmlFree((xmlChar *)ctxt->encoding);
+       ctxt->encoding = xmlStrdup(BAD_CAST encodingName);
+    }
+    return(res);
 }
 
 /**

commit c72abfed82ea489eb7220902fda354ed618398cd
Author: Olli Pottonen <olli.potto...@iki.fi>
Date:   Sun Jun 28 14:06:51 2015 +1000

    Grow buffer correctly before and after encoding is resolved.
    
    Before the encoding is known for certain, decode only XML declaration,
    no more, to avoid using wrong decoder. After encoding is known, decode
    more input to make sure there is enough data for parsing.

diff --git a/parser.c b/parser.c
index b6da7c4..0e73d47 100644
--- a/parser.c
+++ b/parser.c
@@ -7043,6 +7043,11 @@ xmlParseTextDecl(xmlParserCtxtPtr ctxt) {
                       "Missing encoding in text declaration\n");
     }
 
+    /*
+     * Now that encoding is finalised we can grow the input buffer freely
+     */
+    GROW;
+
     SKIP_BLANKS;
     if ((RAW == '?') && (NXT(1) == '>')) {
         SKIP(2);
@@ -10721,6 +10726,11 @@ xmlParseXMLDecl(xmlParserCtxtPtr ctxt) {
     }
 
     /*
+     * Now that encoding is finalised we can grow the input buffer freely
+     */
+    GROW;
+
+    /*
      * We may have the standalone status.
      */
     int hasEncodingDecl = (ctxt->input->cur != preEncodingCur);
@@ -10735,11 +10745,6 @@ xmlParseXMLDecl(xmlParserCtxtPtr ctxt) {
        xmlFatalErrMsg(ctxt, XML_ERR_SPACE_REQUIRED, "Blank needed here\n");
     }
 
-    /*
-     * We can grow the input buffer freely at that point
-     */
-    GROW;
-
     SKIP_BLANKS;
     ctxt->input->standalone = xmlParseSDDecl(ctxt);
 
@@ -12410,7 +12415,7 @@ xmldecl_done:
                                BAD_CAST "UTF-16")) ||
                 (xmlStrcasestr(BAD_CAST ctxt->input->buf->encoder->name,
                                BAD_CAST "UTF16")))
-                len = 90;
+                len = 80;
             else if ((xmlStrcasestr(BAD_CAST ctxt->input->buf->encoder->name,
                                     BAD_CAST "UCS-4")) ||
                      (xmlStrcasestr(BAD_CAST ctxt->input->buf->encoder->name,
diff --git a/parserInternals.c b/parserInternals.c
index 957df04..642dd60 100644
--- a/parserInternals.c
+++ b/parserInternals.c
@@ -984,7 +984,7 @@ xmlSwitchEncoding(xmlParserCtxtPtr ctxt, xmlCharEncoding 
enc)
             (ctxt->input->cur[2] == 0xBF)) {
             ctxt->input->cur += 3;
         }
-        len = 90;
+        len = 80;
 
         length = ctxt->input->end - ctxt->input->cur;
         if ((ctxt->input->cur != NULL) && (length >= 2) &&
@@ -997,10 +997,9 @@ xmlSwitchEncoding(xmlParserCtxtPtr ctxt, xmlCharEncoding 
enc)
             ctxt->input->cur += 2;
            encodingName = "UTF-16";
         }
-       len = 90;
        break;
     case XML_CHAR_ENCODING_UCS2:
-        len = 90;
+        len = 80;
        break;
     case XML_CHAR_ENCODING_UCS4BE:
         length = ctxt->input->end - ctxt->input->cur;

commit 7a7620af4b806d9a67ac883a0f84479e3a9fda6b
Author: Olli Pottonen <olli.potto...@iki.fi>
Date:   Sun Jun 28 11:38:26 2015 +1000

    Improve encoding detection.
    
    Call xmlDetectCharEncoding() even if there is less than 4 bytes of
    input; 2 or 3 bytes may be enough. Avoid unnecessary copying of data
    to local array.

diff --git a/HTMLparser.c b/HTMLparser.c
index 8717d0b..9c4ec04 100644
--- a/HTMLparser.c
+++ b/HTMLparser.c
@@ -4619,7 +4619,6 @@ __htmlParseContent(void *ctxt) {
 
 int
 htmlParseDocument(htmlParserCtxtPtr ctxt) {
-    xmlChar start[4];
     xmlCharEncoding enc;
     xmlDtdPtr dtd;
 
@@ -4641,18 +4640,12 @@ htmlParseDocument(htmlParserCtxtPtr ctxt) {
     if ((ctxt->sax) && (ctxt->sax->setDocumentLocator))
         ctxt->sax->setDocumentLocator(ctxt->userData, &xmlDefaultSAXLocator);
 
-    if ((ctxt->encoding == (const xmlChar *)XML_CHAR_ENCODING_NONE) &&
-        ((ctxt->input->end - ctxt->input->cur) >= 4)) {
+    if (ctxt->encoding == NULL) {
        /*
-        * Get the 4 first bytes and decode the charset
-        * if enc != XML_CHAR_ENCODING_NONE
         * plug some encoding conversion routines.
         */
-       start[0] = RAW;
-       start[1] = NXT(1);
-       start[2] = NXT(2);
-       start[3] = NXT(3);
-       enc = xmlDetectCharEncoding(&start[0], 4);
+       int length = ctxt->input->end - ctxt->input->cur;
+       enc = xmlDetectCharEncoding(ctxt->input->cur, length);
        if (enc != XML_CHAR_ENCODING_NONE) {
            xmlSwitchEncoding(ctxt, enc);
        }
diff --git a/parser.c b/parser.c
index 0e73d47..af637ca 100644
--- a/parser.c
+++ b/parser.c
@@ -2623,9 +2623,7 @@ xmlParserHandlePEReference(xmlParserCtxtPtr ctxt) {
            } else {
                if ((entity->etype == XML_INTERNAL_PARAMETER_ENTITY) ||
                    (entity->etype == XML_EXTERNAL_PARAMETER_ENTITY)) {
-                   xmlChar start[4];
                    xmlCharEncoding enc;
-
                    /*
                     * Note: external parameter entities will not be loaded, it
                     * is not required for a non-validating parser, unless the
@@ -2664,16 +2662,11 @@ xmlParserHandlePEReference(xmlParserCtxtPtr ctxt) {
                    GROW
                     if (ctxt->instate == XML_PARSER_EOF)
                         return;
-                   if ((ctxt->input->end - ctxt->input->cur)>=4) {
-                       start[0] = RAW;
-                       start[1] = NXT(1);
-                       start[2] = NXT(2);
-                       start[3] = NXT(3);
-                       enc = xmlDetectCharEncoding(start, 4);
-                       if (enc != XML_CHAR_ENCODING_NONE) {
-                           xmlSwitchEncoding(ctxt, enc);
-                       }
-                   }
+                   int len = ctxt->input->end - ctxt->input->cur;
+                   enc = xmlDetectCharEncoding(ctxt->input->cur, len);
+                   if (enc != XML_CHAR_ENCODING_NONE) {
+                       xmlSwitchEncoding(ctxt, enc);
+                    }
 
                    if ((entity->etype == XML_EXTERNAL_PARAMETER_ENTITY) &&
                        (CMP5(CUR_PTR, '<', '?', 'x', 'm', 'l' )) &&
@@ -7080,18 +7073,12 @@ xmlParseExternalSubset(xmlParserCtxtPtr ctxt, const 
xmlChar *ExternalID,
     xmlDetectSAX2(ctxt);
     GROW;
 
-    if ((ctxt->encoding == NULL) &&
-        (ctxt->input->end - ctxt->input->cur >= 4)) {
-        xmlChar start[4];
-       xmlCharEncoding enc;
-
-       start[0] = RAW;
-       start[1] = NXT(1);
-       start[2] = NXT(2);
-       start[3] = NXT(3);
-       enc = xmlDetectCharEncoding(start, 4);
-       if (enc != XML_CHAR_ENCODING_NONE)
+    if (ctxt->encoding == NULL) {
+        int length = ctxt->input->end - ctxt->input->cur;
+       xmlCharEncoding enc = xmlDetectCharEncoding(ctxt->input->cur, length);
+       if (enc != XML_CHAR_ENCODING_NONE) {
            xmlSwitchEncoding(ctxt, enc);
+       }
     }
 
     if (CMP5(CUR_PTR, '<', '?', 'x', 'm', 'l')) {
@@ -10803,7 +10790,6 @@ xmlParseMisc(xmlParserCtxtPtr ctxt) {
 
 int
 xmlParseDocument(xmlParserCtxtPtr ctxt) {
-    xmlChar start[4];
     xmlCharEncoding enc;
 
     xmlInitParser();
@@ -10826,18 +10812,9 @@ xmlParseDocument(xmlParserCtxtPtr ctxt) {
     if (ctxt->instate == XML_PARSER_EOF)
        return(-1);
 
-    if ((ctxt->encoding == NULL) &&
-        ((ctxt->input->end - ctxt->input->cur) >= 4)) {
-       /*
-        * Get the 4 first bytes and decode the charset
-        * if enc != XML_CHAR_ENCODING_NONE
-        * plug some encoding conversion routines.
-        */
-       start[0] = RAW;
-       start[1] = NXT(1);
-       start[2] = NXT(2);
-       start[3] = NXT(3);
-       enc = xmlDetectCharEncoding(&start[0], 4);
+    if (ctxt->encoding == NULL) {
+        int length = ctxt->input->end - ctxt->input->cur;
+        enc = xmlDetectCharEncoding(ctxt->input->cur, length);
        if (enc != XML_CHAR_ENCODING_NONE) {
            xmlSwitchEncoding(ctxt, enc);
        }
@@ -10997,9 +10974,7 @@ xmlParseDocument(xmlParserCtxtPtr ctxt) {
 
 int
 xmlParseExtParsedEnt(xmlParserCtxtPtr ctxt) {
-    xmlChar start[4];
     xmlCharEncoding enc;
-
     if ((ctxt == NULL) || (ctxt->input == NULL))
         return(-1);
 
@@ -11016,19 +10991,12 @@ xmlParseExtParsedEnt(xmlParserCtxtPtr ctxt) {
         ctxt->sax->setDocumentLocator(ctxt->userData, &xmlDefaultSAXLocator);
 
     /*
-     * Get the 4 first bytes and decode the charset
-     * if enc != XML_CHAR_ENCODING_NONE
      * plug some encoding conversion routines.
      */
-    if ((ctxt->input->end - ctxt->input->cur) >= 4) {
-       start[0] = RAW;
-       start[1] = NXT(1);
-       start[2] = NXT(2);
-       start[3] = NXT(3);
-       enc = xmlDetectCharEncoding(start, 4);
-       if (enc != XML_CHAR_ENCODING_NONE) {
-           xmlSwitchEncoding(ctxt, enc);
-       }
+    int length = ctxt->input->end - ctxt->input->cur;
+    enc = xmlDetectCharEncoding(ctxt->input->cur, length);
+    if (enc != XML_CHAR_ENCODING_NONE) {
+        xmlSwitchEncoding(ctxt, enc);
     }
 
 
@@ -11428,28 +11396,19 @@ xmlParseTryOrFinish(xmlParserCtxtPtr ctxt, int 
terminate) {
                goto done;
             case XML_PARSER_START:
                if (ctxt->charset == XML_CHAR_ENCODING_NONE) {
-                   xmlChar start[4];
                    xmlCharEncoding enc;
-
                    /*
                     * Very first chars read from the document flow.
                     */
                    if (avail < 4)
                        goto done;
 
-                   /*
-                    * Get the 4 first bytes and decode the charset
-                    * if enc != XML_CHAR_ENCODING_NONE
-                    * plug some encoding conversion routines,
-                    * else xmlSwitchEncoding will set to (default)
-                    * UTF8.
-                    */
-                   start[0] = RAW;
-                   start[1] = NXT(1);
-                   start[2] = NXT(2);
-                   start[3] = NXT(3);
-                   enc = xmlDetectCharEncoding(start, 4);
-                   xmlSwitchEncoding(ctxt, enc);
+                   int length = ctxt->input->end - ctxt->input->cur;
+                   enc = xmlDetectCharEncoding(ctxt->input->cur, length);
+                   if (enc != XML_CHAR_ENCODING_NONE) {
+                       xmlSwitchEncoding(ctxt, enc);
+                   }
+
                    break;
                }
 
@@ -12583,13 +12542,10 @@ xmlCreatePushParserCtxt(xmlSAXHandlerPtr sax, void 
*user_data,
     xmlParserInputBufferPtr buf;
     xmlCharEncoding enc = XML_CHAR_ENCODING_NONE;
 
-    /*
-     * plug some encoding conversion routines
-     */
-    if ((chunk != NULL) && (size >= 4))
-       enc = xmlDetectCharEncoding((const xmlChar *) chunk, size);
+    if (chunk != NULL)
+        enc = xmlDetectCharEncoding((const xmlChar *) chunk, size);
 
-    buf = xmlAllocParserInputBuffer(enc);
+    buf = xmlAllocParserInputBuffer(XML_CHAR_ENCODING_NONE);
     if (buf == NULL) return(NULL);
 
     ctxt = xmlNewParserCtxt();
@@ -12791,7 +12747,6 @@ xmlIOParseDTD(xmlSAXHandlerPtr sax, 
xmlParserInputBufferPtr input,
     xmlDtdPtr ret = NULL;
     xmlParserCtxtPtr ctxt;
     xmlParserInputPtr pinput = NULL;
-    xmlChar start[4];
 
     if (input == NULL)
        return(NULL);
@@ -12838,6 +12793,12 @@ xmlIOParseDTD(xmlSAXHandlerPtr sax, 
xmlParserInputBufferPtr input,
     }
     if (enc != XML_CHAR_ENCODING_NONE) {
         xmlSwitchEncoding(ctxt, enc);
+    } else {
+        int length = ctxt->input->end - ctxt->input->cur;
+       enc = xmlDetectCharEncoding(ctxt->input->cur, length);
+       if (enc != XML_CHAR_ENCODING_NONE) {
+           xmlSwitchEncoding(ctxt, enc);
+       }
     }
 
     pinput->filename = NULL;
@@ -12860,23 +12821,6 @@ xmlIOParseDTD(xmlSAXHandlerPtr sax, 
xmlParserInputBufferPtr input,
     ctxt->myDoc->extSubset = xmlNewDtd(ctxt->myDoc, BAD_CAST "none",
                                       BAD_CAST "none", BAD_CAST "none");
 
-    if ((enc == XML_CHAR_ENCODING_NONE) &&
-        ((ctxt->input->end - ctxt->input->cur) >= 4)) {
-       /*
-        * Get the 4 first bytes and decode the charset
-        * if enc != XML_CHAR_ENCODING_NONE
-        * plug some encoding conversion routines.
-        */
-       start[0] = RAW;
-       start[1] = NXT(1);
-       start[2] = NXT(2);
-       start[3] = NXT(3);
-       enc = xmlDetectCharEncoding(start, 4);
-       if (enc != XML_CHAR_ENCODING_NONE) {
-           xmlSwitchEncoding(ctxt, enc);
-       }
-    }
-
     xmlParseExternalSubset(ctxt, BAD_CAST "none", BAD_CAST "none");
 
     if (ctxt->myDoc != NULL) {
@@ -12979,9 +12923,10 @@ xmlSAXParseDTD(xmlSAXHandlerPtr sax, const xmlChar 
*ExternalID,
            xmlFree(systemIdCanonic);
        return(NULL);
     }
-    if ((ctxt->input->end - ctxt->input->cur) >= 4) {
-       enc = xmlDetectCharEncoding(ctxt->input->cur, 4);
-       xmlSwitchEncoding(ctxt, enc);
+    int length = ctxt->input->end - ctxt->input->cur;
+    enc = xmlDetectCharEncoding(ctxt->input->cur, length);
+    if (enc != XML_CHAR_ENCODING_NONE) {
+        xmlSwitchEncoding(ctxt, enc);
     }
 
     if (input->filename == NULL)
@@ -13084,7 +13029,6 @@ xmlParseCtxtExternalEntity(xmlParserCtxtPtr ctx, const 
xmlChar *URL,
     xmlNodePtr newRoot;
     xmlSAXHandlerPtr oldsax = NULL;
     int ret = 0;
-    xmlChar start[4];
     xmlCharEncoding enc;
 
     if (ctx == NULL) return(-1);
@@ -13150,15 +13094,11 @@ xmlParseCtxtExternalEntity(xmlParserCtxtPtr ctx, 
const xmlChar *URL,
      * plug some encoding conversion routines.
      */
     GROW
-    if ((ctxt->input->end - ctxt->input->cur) >= 4) {
-       start[0] = RAW;
-       start[1] = NXT(1);
-       start[2] = NXT(2);
-       start[3] = NXT(3);
-       enc = xmlDetectCharEncoding(start, 4);
-       if (enc != XML_CHAR_ENCODING_NONE) {
-           xmlSwitchEncoding(ctxt, enc);
-       }
+
+    int length = ctxt->input->end - ctxt->input->cur;
+    enc = xmlDetectCharEncoding(ctxt->input->cur, length);
+    if (enc != XML_CHAR_ENCODING_NONE) {
+        xmlSwitchEncoding(ctxt, enc);
     }
 
     /*
@@ -13293,7 +13233,6 @@ xmlParseExternalEntityPrivate(xmlDocPtr doc, 
xmlParserCtxtPtr oldctxt,
     xmlNodePtr newRoot;
     xmlSAXHandlerPtr oldsax = NULL;
     xmlParserErrors ret = XML_ERR_OK;
-    xmlChar start[4];
     xmlCharEncoding enc;
 
     if (((depth > 40) &&
@@ -13375,20 +13314,13 @@ xmlParseExternalEntityPrivate(xmlDocPtr doc, 
xmlParserCtxtPtr oldctxt,
     newRoot->doc = doc;
 
     /*
-     * Get the 4 first bytes and decode the charset
-     * if enc != XML_CHAR_ENCODING_NONE
      * plug some encoding conversion routines.
      */
     GROW;
-    if ((ctxt->input->end - ctxt->input->cur) >= 4) {
-       start[0] = RAW;
-       start[1] = NXT(1);
-       start[2] = NXT(2);
-       start[3] = NXT(3);
-       enc = xmlDetectCharEncoding(start, 4);
-       if (enc != XML_CHAR_ENCODING_NONE) {
-           xmlSwitchEncoding(ctxt, enc);
-       }
+    int length = ctxt->input->end - ctxt->input->cur;
+    enc = xmlDetectCharEncoding(ctxt->input->cur, length);
+    if (enc != XML_CHAR_ENCODING_NONE) {
+        xmlSwitchEncoding(ctxt, enc);
     }
 
     /*
@@ -15132,8 +15064,9 @@ xmlCtxtResetPush(xmlParserCtxtPtr ctxt, const char 
*chunk,
     if (ctxt == NULL)
         return(1);
 
-    if ((encoding == NULL) && (chunk != NULL) && (size >= 4))
-        enc = xmlDetectCharEncoding((const xmlChar *) chunk, size);
+    if ((encoding == NULL) && (chunk != NULL)) {
+       enc = xmlDetectCharEncoding(BAD_CAST chunk, size);
+    }
 
     buf = xmlAllocParserInputBuffer(XML_CHAR_ENCODING_NONE);
     if (buf == NULL)

commit 602dca621b3dd29b8d10a9e0f0d4e27383677bc2
Author: Olli Pottonen <olli.potto...@iki.fi>
Date:   Sun Jun 28 11:58:08 2015 +1000

    Code cleanup.

diff --git a/HTMLparser.c b/HTMLparser.c
index 9c4ec04..a302500 100644
--- a/HTMLparser.c
+++ b/HTMLparser.c
@@ -3497,7 +3497,6 @@ htmlParseAttribute(htmlParserCtxtPtr ctxt, xmlChar 
**value) {
  */
 static void
 htmlCheckEncodingDirect(htmlParserCtxtPtr ctxt, const xmlChar *encoding) {
-
     if ((ctxt == NULL) || (encoding == NULL) ||
         (ctxt->options & HTML_PARSE_IGNORE_ENC))
        return;
@@ -3526,27 +3525,6 @@ htmlCheckEncodingDirect(htmlParserCtxtPtr ctxt, const 
xmlChar *encoding) {
            xmlFree((xmlChar *) ctxt->encoding);
            ctxt->encoding = xmlStrdup(encoding);
        }
-
-       if ((ctxt->input->buf != NULL) &&
-           (ctxt->input->buf->encoder != NULL) &&
-           (ctxt->input->buf->raw != NULL) &&
-           (ctxt->input->buf->buffer != NULL)) {
-           int nbchars;
-           int processed;
-
-           /*
-            * convert as much as possible to the parser reading buffer.
-            */
-           processed = ctxt->input->cur - ctxt->input->base;
-           xmlBufShrink(ctxt->input->buf->buffer, processed);
-           nbchars = xmlCharEncInput(ctxt->input->buf, 1);
-           if (nbchars < 0) {
-               htmlParseErr(ctxt, XML_ERR_INVALID_ENCODING,
-                            "htmlCheckEncoding: encoder error\n",
-                            NULL, NULL);
-           }
-            xmlBufResetInput(ctxt->input->buf->buffer, ctxt->input);
-       }
     }
 }
 
@@ -4953,36 +4931,19 @@ htmlCreateDocParserCtxt(const xmlChar *cur, const char 
*encoding) {
        return(NULL);
 
     if (encoding != NULL) {
-       xmlCharEncoding enc;
        xmlCharEncodingHandlerPtr handler;
 
        if (ctxt->input->encoding != NULL)
            xmlFree((xmlChar *) ctxt->input->encoding);
        ctxt->input->encoding = xmlStrdup((const xmlChar *) encoding);
 
-       enc = xmlParseCharEncoding(encoding);
-       /*
-        * registered set of known encodings
-        */
-       if (enc != XML_CHAR_ENCODING_ERROR) {
-           xmlSwitchEncoding(ctxt, enc);
-           if (ctxt->errNo == XML_ERR_UNSUPPORTED_ENCODING) {
-               htmlParseErr(ctxt, XML_ERR_UNSUPPORTED_ENCODING,
-                            "Unsupported encoding %s\n",
-                            (const xmlChar *) encoding, NULL);
-           }
+       handler = xmlFindCharEncodingHandler((const char *) encoding);
+       if (handler != NULL) {
+           xmlSwitchToEncoding(ctxt, handler);
        } else {
-           /*
-            * fallback for unknown encodings
-            */
-           handler = xmlFindCharEncodingHandler((const char *) encoding);
-           if (handler != NULL) {
-               xmlSwitchToEncoding(ctxt, handler);
-           } else {
-               htmlParseErr(ctxt, XML_ERR_UNSUPPORTED_ENCODING,
-                            "Unsupported encoding %s\n",
-                            (const xmlChar *) encoding, NULL);
-           }
+           htmlParseErr(ctxt, XML_ERR_UNSUPPORTED_ENCODING,
+                        "Unsupported encoding %s\n",
+                        (const xmlChar *) encoding, NULL);
        }
     }
     return(ctxt);
@@ -6227,8 +6188,6 @@ htmlCreateFileParserCtxt(const char *filename, const char 
*encoding)
     htmlParserCtxtPtr ctxt;
     htmlParserInputPtr inputStream;
     char *canonicFilename;
-    /* htmlCharEncoding enc; */
-    xmlChar *content, *content_line = (xmlChar *) "charset=";
 
     if (filename == NULL)
         return(NULL);
@@ -6259,16 +6218,15 @@ htmlCreateFileParserCtxt(const char *filename, const 
char *encoding)
 
     /* set encoding */
     if (encoding) {
-        size_t l = strlen(encoding);
-
-       if (l < 1000) {
-           content = xmlMallocAtomic (xmlStrlen(content_line) + l + 1);
-           if (content) {
-               strcpy ((char *)content, (char *)content_line);
-               strcat ((char *)content, (char *)encoding);
-               htmlCheckEncoding (ctxt, content);
-               xmlFree (content);
-           }
+        xmlCharEncodingHandlerPtr handler;
+       handler = xmlFindCharEncodingHandler((const char *) encoding);
+        if (handler != NULL) {
+           xmlSwitchToEncoding(ctxt, handler);
+           ctxt->charset = XML_CHAR_ENCODING_UTF8;
+       } else {
+           htmlParseErr(ctxt, XML_ERR_UNSUPPORTED_ENCODING,
+                        "htmlCheckEncoding: unknown encoding %s\n",
+                        BAD_CAST encoding, NULL);
        }
     }
 
diff --git a/encoding.c b/encoding.c
index 3f19d71..15a8d25 100644
--- a/encoding.c
+++ b/encoding.c
@@ -1681,11 +1681,11 @@ xmlGetCharEncodingHandler(xmlCharEncoding enc) {
     if (handlers == NULL) xmlInitCharEncodingHandlers();
     switch (enc) {
         case XML_CHAR_ENCODING_ERROR:
-           return(NULL);
         case XML_CHAR_ENCODING_NONE:
-           return(NULL);
         case XML_CHAR_ENCODING_UTF8:
            return(NULL);
+        case XML_CHAR_ENCODING_ASCII:
+           return xmlFindCharEncodingHandler("ASCII");
         case XML_CHAR_ENCODING_UTF16LE:
            return(xmlUTF16LEHandler);
         case XML_CHAR_ENCODING_UTF16BE:
@@ -1722,7 +1722,6 @@ xmlGetCharEncodingHandler(xmlCharEncoding enc) {
             if (handler != NULL) return(handler);
            break;
         case XML_CHAR_ENCODING_UCS4_2143:
-           break;
         case XML_CHAR_ENCODING_UCS4_3412:
            break;
         case XML_CHAR_ENCODING_UCS2:
diff --git a/parserInternals.c b/parserInternals.c
index 642dd60..b3fc5f4 100644
--- a/parserInternals.c
+++ b/parserInternals.c
@@ -1055,38 +1055,23 @@ xmlSwitchEncoding(xmlParserCtxtPtr ctxt, 
xmlCharEncoding enc)
                ctxt->charset = XML_CHAR_ENCODING_UTF8;
                return(0);
            case XML_CHAR_ENCODING_UTF16LE:
-               break;
            case XML_CHAR_ENCODING_UTF16BE:
+             /* What, there is built in UTF-16 support, how can we
+              * end up in here? */
                break;
            case XML_CHAR_ENCODING_UCS4LE:
-               __xmlErrEncoding(ctxt, XML_ERR_UNSUPPORTED_ENCODING,
-                              "encoding not supported %s\n",
-                              BAD_CAST "UCS4 little endian", NULL);
-               break;
            case XML_CHAR_ENCODING_UCS4BE:
-               __xmlErrEncoding(ctxt, XML_ERR_UNSUPPORTED_ENCODING,
-                              "encoding not supported %s\n",
-                              BAD_CAST "UCS4 big endian", NULL);
-               break;
            case XML_CHAR_ENCODING_EBCDIC:
-               __xmlErrEncoding(ctxt, XML_ERR_UNSUPPORTED_ENCODING,
-                              "encoding not supported %s\n",
-                              BAD_CAST "EBCDIC", NULL);
-               break;
            case XML_CHAR_ENCODING_UCS4_2143:
-               __xmlErrEncoding(ctxt, XML_ERR_UNSUPPORTED_ENCODING,
-                              "encoding not supported %s\n",
-                              BAD_CAST "UCS4 2143", NULL);
-               break;
            case XML_CHAR_ENCODING_UCS4_3412:
-               __xmlErrEncoding(ctxt, XML_ERR_UNSUPPORTED_ENCODING,
-                              "encoding not supported %s\n",
-                              BAD_CAST "UCS4 3412", NULL);
-               break;
            case XML_CHAR_ENCODING_UCS2:
+           case XML_CHAR_ENCODING_2022_JP:
+           case XML_CHAR_ENCODING_SHIFT_JIS:
+           case XML_CHAR_ENCODING_EUC_JP:
+           default:
                __xmlErrEncoding(ctxt, XML_ERR_UNSUPPORTED_ENCODING,
                               "encoding not supported %s\n",
-                              BAD_CAST "UCS2", NULL);
+                                BAD_CAST encodingName, NULL);
                break;
            case XML_CHAR_ENCODING_8859_1:
            case XML_CHAR_ENCODING_8859_2:
@@ -1111,23 +1096,6 @@ xmlSwitchEncoding(xmlParserCtxtPtr ctxt, xmlCharEncoding 
enc)
                }
                ctxt->charset = enc;
                return(0);
-           case XML_CHAR_ENCODING_2022_JP:
-               __xmlErrEncoding(ctxt, XML_ERR_UNSUPPORTED_ENCODING,
-                              "encoding not supported %s\n",
-                              BAD_CAST "ISO-2022-JP", NULL);
-               break;
-           case XML_CHAR_ENCODING_SHIFT_JIS:
-               __xmlErrEncoding(ctxt, XML_ERR_UNSUPPORTED_ENCODING,
-                              "encoding not supported %s\n",
-                              BAD_CAST "Shift_JIS", NULL);
-               break;
-           case XML_CHAR_ENCODING_EUC_JP:
-               __xmlErrEncoding(ctxt, XML_ERR_UNSUPPORTED_ENCODING,
-                              "encoding not supported %s\n",
-                              BAD_CAST "EUC-JP", NULL);
-               break;
-           default:
-               break;
        }
     }
     if (handler == NULL)

commit 4cf5e24cba905cc5ec3dd33c2c46807e5df838d9
Author: Olli Pottonen <olli.potto...@iki.fi>
Date:   Sat Jul 4 08:52:08 2015 +1000

    Implement HTML5 encoding detection algorithm.

diff --git a/HTMLparser.c b/HTMLparser.c
index a302500..82545cb 100644
--- a/HTMLparser.c
+++ b/HTMLparser.c
@@ -345,6 +345,342 @@ htmlNodeInfoPop(htmlParserCtxtPtr ctxt)
     if (l == 1) b[i++] = (xmlChar) v;                                  \
     else i += xmlCopyChar(l,&b[i],v)
 
+
+/* HTML 5 REC encoding sniffing algorithm */
+
+typedef struct {
+    const xmlChar *cur;
+    const xmlChar *end;
+    xmlChar *res;
+} _encSniffState;
+
+static inline int isHtmlSpace(xmlChar c) {
+    return((c == '\t') || (c == '\n') || (c == '\f') ||
+          (c == '\r') || (c == ' '));
+}
+
+static inline int isAsciiAlpha(xmlChar c) {
+    return(('A' <= c && c <= 'Z') || ('a' <= c && c <= 'z'));
+}
+
+typedef struct {
+    const xmlChar *name;
+    int nameLen;
+    const xmlChar *value;
+    int valueLen;
+} _encSniffAttribute;
+
+/**
+ * encSniffGetAttribute:
+ * Auxiliary function for W3C HTML 5 REC encoding sniffing.
+ *
+ * Return attribute contained in a tag, if any.
+ */
+static _encSniffAttribute * encSniffGetAttribute(_encSniffState *state) {
+    static _encSniffAttribute res;
+    res.value = NULL;
+    res.valueLen = 0;
+
+    const xmlChar *cur = state->cur, *end = state->end;
+    while (cur < end &&
+          (isHtmlSpace(cur[0]) || cur[0] == '/'))
+        cur++;
+    if (cur >= end || cur[0] == '>') {
+        state->cur = cur +1;
+        return(NULL);
+    }
+
+    res.name = cur;
+    while (cur < end) {
+        if ((cur[0] == '/') || (cur[0] == '>')) {
+           res.nameLen = cur - res.name;
+           state->cur = cur+1;
+           return(&res);
+       }
+
+       if (isHtmlSpace(cur[0]) || (cur[0] == '=' && cur > res.name))
+           break;
+       cur++;
+    }
+    res.nameLen = cur - res.name;
+
+    if (cur >= end) {
+        state->cur = cur;
+        return(NULL);
+    }
+
+    while (cur < end && isHtmlSpace(cur[0]))
+        cur++;
+    if (cur >= end || cur[0] != '=') {
+        state->cur = cur;
+        return(NULL);
+    }
+    cur++;
+    while (cur < end && isHtmlSpace(cur[0]))
+        cur++;
+
+    if (cur >= end) {
+        state->cur = cur;
+        return(NULL);
+    }
+
+    if ((cur[0] == '\'') || (cur[0] == '"')) {
+        xmlChar quote_char = cur[0];
+       res.value = ++cur;
+
+       while (cur < end && cur[0] != quote_char)
+           cur++;
+
+       res.valueLen = cur - res.value;
+       state->cur = cur +1;
+    } else if(cur[0] != '>') {
+        res.value = cur;
+
+       while (cur < end && !isHtmlSpace(cur[0]) && (cur[0] != '>'))
+           cur++;
+       res.valueLen = cur - res.value;
+       state->cur = cur;
+    }
+    return((cur >= end) ? NULL : &res);
+}
+
+/**
+ * encSniffEncodingFromMeta:
+ * Auxiliary function for W3C HTML 5 REC encoding sniffing.
+ *
+ * Find encoding from a meta tag such as
+ * "Content-Type: text/html; charset=ascii".
+ */
+static int encSniffEncodingFromMeta(const xmlChar *s, const xmlChar *end,
+                                   const xmlChar **res, int *res_len) {
+    for(;;) {
+        while(s + 7 < end && xmlStrncasecmp(s, BAD_CAST "charset", 7))
+           s++;
+       if (s + 7 >= end)
+           return(0);
+       s += 7;
+
+       while (s < end && isHtmlSpace(s[0]))
+           s++;
+       if (s >= end)
+           return(0);
+
+       if (s[0] == '=')
+           break;
+    }
+    s++;
+
+    while (s < end && isHtmlSpace(s[0]))
+        s++;
+    if (s >= end)
+        return(0);
+
+    const xmlChar *start;
+    if (s[0] == '\'' || s[0] == '"') {
+        xmlChar quote_char = s[0];
+       start = ++s;
+       while (s < end && s[0] != quote_char)
+            s++;
+    } else {
+        start = s;
+        while (s < end && !isHtmlSpace(s[0]) && s[0] != ';')
+           s++;
+    }
+    if (s >= end)
+        return(0);
+    while (start < s && isHtmlSpace(start[0]) )
+        start++;
+    *res = start;
+    *res_len = s - start;
+    return(1);
+}
+
+/**
+ * encSniffScanMeta:
+ * Auxiliary function for W3C HTML 5 REC encoding sniffing.
+ *
+ * Scan a meta tag and try to find encoding declaration.
+ */
+static int encSniffScanMeta(_encSniffState *state) {
+    const xmlChar *cur = state->cur, *end = state->end;
+    if (cur + 5 > end ||
+       (cur[0] != '<') ||
+       ((cur[1] != 'm') && (cur[1] != 'M')) ||
+       ((cur[2] != 'e') && (cur[2] != 'E')) ||
+       ((cur[3] != 't') && (cur[3] != 'T')) ||
+       ((cur[4] != 'a') && (cur[4] != 'A')) ||
+       (!isHtmlSpace(cur[5]) && cur[5] != '/'))
+        return 0;
+
+    state->cur += 6;
+    int gotPragma = 0, needPragma = -1;
+    const xmlChar *charset = NULL;
+    int charset_len = 0;
+
+    for(;;) {
+        _encSniffAttribute *attr = encSniffGetAttribute(state);
+
+       if (attr == NULL) {
+           break;
+       } else if ((attr->nameLen == 10) &&
+                  !xmlStrncasecmp(attr->name, BAD_CAST "http-equiv", 10) &&
+                  (attr->valueLen == 12) &&
+                  !xmlStrncasecmp(attr->value, BAD_CAST "content-type", 12)) {
+         gotPragma = 1;
+       } else if ((attr->nameLen == 7) &&
+                  !xmlStrncasecmp(attr->name, BAD_CAST "content", 7) &&
+                  charset == NULL) {
+           if (encSniffEncodingFromMeta(attr->value,
+                                        attr->value + attr->valueLen,
+                                        &charset, &charset_len) ) {
+               needPragma = 1;
+           }
+       } else if ((attr->nameLen == 7) &&
+                  !xmlStrncasecmp(attr->name, BAD_CAST "charset", 7)) {
+         charset = attr->value;
+         charset_len = attr->valueLen;
+         needPragma = 0;
+       }
+    }
+
+    if (needPragma && !gotPragma)
+        charset = NULL;
+
+    if ((charset_len == 6 && !xmlStrcasecmp(charset, BAD_CAST "UTF-16")) ||
+       (charset_len == 8 && !xmlStrcasecmp(charset, BAD_CAST "UTF-16LE")) |
+       (charset_len == 8 && !xmlStrcasecmp(charset, BAD_CAST "UTF-16BE"))) {
+        charset = BAD_CAST "UTF-8";
+       charset_len = 5;
+    }
+
+    state->res = xmlStrndup(charset, charset_len);
+    return(1);
+}
+
+/**
+ * encSniffSkipComment:
+ * Auxiliary function for W3C HTML 5 REC encoding sniffing.
+ *
+ * Skip comment, if any.
+ */
+static int encSniffSkipComment(_encSniffState *state) {
+    const xmlChar *cur = state->cur, *end = state->end;
+    if ((cur + 3 > end) || (cur[0] != '<') || (cur[1] != '!') ||
+       (cur[2] != '-') || (cur[3] != '-'))
+        return(0);
+
+    cur += 2;
+    while (cur + 2 < end &&
+          ((cur[0] != '-') || (cur[1] != '-') || (cur[2] != '>')))
+        cur++;
+
+    state->cur = cur + 3;
+    fprintf(stderr, "done\n");
+    return(1);
+}
+
+/**
+ * encSniffSkipTag:
+ * Auxiliary function for W3C HTML 5 REC encoding sniffing.
+ *
+ * Skip element tag, if any.
+ */
+static int encSniffSkipTag(_encSniffState *state) {
+    const xmlChar *cur = state->cur, *end = state->end;
+
+    int startfound =
+      ((cur + 1 < end) && (cur[0] == '<') && isAsciiAlpha(cur[2])) ||
+      ((cur + 2 < end) && (cur[0] == '<') && (cur[1] == '/') &&
+        isAsciiAlpha(cur[2]));
+
+    if (!startfound)
+        return(0);
+
+    while (cur < end && !isHtmlSpace(cur[0]) && cur[0] != '>')
+      cur++;
+    state->cur = cur;
+
+
+    while (state->cur < end && encSniffGetAttribute(state))
+      ;
+
+    return(1);
+}
+
+/**
+ * encSniffSkipMisc:
+ * Auxiliary function for W3C HTML 5 REC encoding sniffing.
+ *
+ * Skip doctype declaration, processing instruction or SGML style
+ * element tag, if any.
+ */
+static int encSniffSkipMisc(_encSniffState *state) {
+    const xmlChar *cur = state->cur, *end = state->end;
+
+    if ((cur + 1 >= end) ||
+       (cur[0] != '<') ||
+       (cur[1] != '!' && cur[1] == '/' && cur[1] == '?')) {
+        return(0);
+    }
+
+    while (cur < end && cur[0] != '>')
+        cur++;
+
+    state->cur = cur;
+    return(1);
+}
+
+/**
+ * encSniffSkipContent:
+ * Auxiliary function for W3C HTML 5 REC encoding sniffing.
+ *
+ * Skip content until the next start tag, comment, PI or
+ * doctype declaration.
+ */
+static void encSniffSkipContent(_encSniffState *state) {
+    ++state->cur;
+    while (state->cur < state->end && state->cur[0] != '<')
+        ++state->cur;
+}
+
+/**
+ * html5FindEncoding:
+ * @the HTML parser context
+ *
+ * W3C HTML 5 Recommendation algorithm to prescan a byte stream to
+ * determine its encoding.
+ *
+ * Returns an encoding string or NULL if not found. The string needs to
+ * be freed.
+ */
+static const char *
+html5FindEncoding(xmlParserCtxtPtr ctxt) {
+    if ((ctxt == NULL) || (ctxt->input == NULL) ||
+        (ctxt->input->encoding != NULL) || (ctxt->input->buf == NULL) ||
+        (ctxt->input->buf->encoder != NULL))
+        return(NULL);
+    if ((ctxt->input->cur == NULL) || (ctxt->input->end == NULL))
+        return(NULL);
+
+    const xmlChar *end = ctxt->input->cur + 4096;
+    end = (end < ctxt->input->end) ? end : ctxt->input->end;
+    _encSniffState state = {ctxt->input->cur, end, NULL};
+
+    while (state.cur < end && state.res == NULL) {
+        if (encSniffSkipComment(&state))
+           continue;
+       if (encSniffScanMeta(&state))
+           continue;
+       if (encSniffSkipTag(&state))
+           continue;
+       if (encSniffSkipMisc(&state))
+         continue;
+       encSniffSkipContent(&state);
+    }
+
+    return((const char *) state.res);
+}
+
 /**
  * htmlFindEncoding:
  * @the HTML parser context
@@ -510,10 +846,12 @@ htmlCurrentChar(xmlParserCtxtPtr ctxt, int *len) {
      * Humm this is bad, do an automatic flow conversion
      */
     {
-        xmlChar * guess;
+        xmlChar * guess = NULL;
         xmlCharEncodingHandlerPtr handler;
 
-        guess = htmlFindEncoding(ctxt);
+       if ((ctxt->options & HTML_HTML5_ENC_SNIFF) == 0) {
+            guess = htmlFindEncoding(ctxt);
+       }
         if (guess == NULL) {
             xmlSwitchEncoding(ctxt, XML_CHAR_ENCODING_8859_1);
         } else {
@@ -3574,7 +3912,8 @@ htmlCheckMeta(htmlParserCtxtPtr ctxt, const xmlChar 
**atts) {
     int http = 0;
     const xmlChar *content = NULL;
 
-    if ((ctxt == NULL) || (atts == NULL))
+    if ((ctxt == NULL) || (atts == NULL) ||
+       (ctxt->options & HTML_HTML5_ENC_SNIFF))
        return;
 
     i = 0;
@@ -6595,6 +6934,10 @@ htmlCtxtUseOptions(htmlParserCtxtPtr ctxt, int options)
         ctxt->options |= HTML_PARSE_NOIMPLIED;
         options -= HTML_PARSE_NOIMPLIED;
     }
+    if (options & HTML_HTML5_ENC_SNIFF) {
+        ctxt->options |= HTML_HTML5_ENC_SNIFF;
+       options -= HTML_HTML5_ENC_SNIFF;
+    }
     ctxt->dictNames = 0;
     return (options);
 }
@@ -6619,6 +6962,13 @@ htmlDoRead(htmlParserCtxtPtr ctxt, const char *URL, 
const char *encoding,
 
     htmlCtxtUseOptions(ctxt, options);
     ctxt->html = 1;
+
+    int free_encoding = 0;
+    if (options & HTML_HTML5_ENC_SNIFF) {
+        encoding = html5FindEncoding(ctxt);
+       free_encoding = 1;
+    }
+
     if (encoding != NULL) {
         xmlCharEncodingHandlerPtr hdlr;
 
@@ -6630,6 +6980,9 @@ htmlDoRead(htmlParserCtxtPtr ctxt, const char *URL, const 
char *encoding,
             ctxt->input->encoding = xmlStrdup((xmlChar *)encoding);
         }
     }
+    if (free_encoding && encoding != NULL)
+        xmlFree((xmlChar *) encoding);
+
     if ((URL != NULL) && (ctxt->input != NULL) &&
         (ctxt->input->filename == NULL))
         ctxt->input->filename = (char *) xmlStrdup((const xmlChar *) URL);
diff --git a/include/libxml/HTMLparser.h b/include/libxml/HTMLparser.h
index 551186c..5c06351 100644
--- a/include/libxml/HTMLparser.h
+++ b/include/libxml/HTMLparser.h
@@ -185,7 +185,8 @@ typedef enum {
     HTML_PARSE_NONET   = 1<<11,/* Forbid network access */
     HTML_PARSE_NOIMPLIED= 1<<13,/* Do not add implied html/body... elements */
     HTML_PARSE_COMPACT  = 1<<16,/* compact small text nodes */
-    HTML_PARSE_IGNORE_ENC=1<<21 /* ignore internal document encoding hint */
+    HTML_PARSE_IGNORE_ENC=1<<21,/* ignore internal document encoding hint */
+    HTML_HTML5_ENC_SNIFF= 1<<23 /* use HTML5 encoding sniffing algorithm */
 } htmlParserOption;
 
 XMLPUBFUN void XMLCALL

_______________________________________________
xml mailing list, project page  http://xmlsoft.org/
xml@gnome.org
https://mail.gnome.org/mailman/listinfo/xml

[xml] [PATCH] Encoding related issues

Reply via email to