Re: [PATCH] characters invalid for an encoding

Daniel Rall Tue, 17 May 2005 03:20:11 -0700

On Fri, 2005-05-06 at 14:59 +0100, John Wilson wrote: 
>On 6 May 2005, at 12:03, Jochen Wiedmann wrote:
... 
>>> For maximum interoperability I would suggest we use UTF-8 but use
>>> character references for all values > 0X7F. This means that even  
>>> if  the
>>> other end gets the encoding wrong it will still almost certainly
>>> understand the characters. If the other end does not understand
>>> character encodings it will be very easy to see what the problem is
>>> (which is not quite so easy to do if it mistakes UTF-8 for ISO8859-1,
>>> for example)
>>
>> That is, as far as I can say, what Daniels proposed patch does.
>
>Yes It would appear to do this. However it also seems to emit invalid  
>XML code points as character references (e.g. the NULL character  
>would be emitted as &#x0;).


That's right -- it was intentional, as I was unaware of this
restriction, and figured I'd start with the parts it seemed that
everyone agreed on.  :-)

>I do not believe that the XML spec allows  
>this. I believe that these code points cannot appear in a well formed  
>document in any form. The intent is to allow the consuming  
>application to be 100% sure it never sees these characters.

I did a some looking around, and the closest thing I could find
supporting that is an email by Tim Bray:

http://lists.xml.org/archives/xml-dev/199804/msg00502.html

I also found some conformance testing materials against a really old XML
parser from Sun:

http://www.xml.com/1999/09/conformance/reports/report-sun-val.html

I took a look through the spec, but nothing stood out.  John, are there
any particular portions of the spec that I should be looking at in
particular?  The section on valid characters is really clear that the
majority of control characters can't occur, but I didn't see any
discussion as to why replacing them with character references isn't a
good enough escaping mechanism.  Not trying to be obstructionist -- just
trying to understand.


I've committed patches to CVS HEAD and XMLRPC_1_2_BRANCH implementing
everything we've discussed (including test cases), _except_ the blocking
of these suspect control characters.  Attached is a patch which could be
applied to CVS HEAD to block such characters, but if we end up going
that route, it's probably time to optimize the changes I've made
recently to XmlWriter.

Index: src/java/org/apache/xmlrpc/XmlWriter.java
===================================================================
RCS file: /home/cvs/ws-xmlrpc/src/java/org/apache/xmlrpc/XmlWriter.java,v
retrieving revision 1.15
diff -u -r1.15 XmlWriter.java
--- src/java/org/apache/xmlrpc/XmlWriter.java	16 May 2005 22:39:27 -0000	1.15
+++ src/java/org/apache/xmlrpc/XmlWriter.java	16 May 2005 22:50:50 -0000
@@ -429,6 +429,13 @@
                 // outside of the valid range for ASCII, too.
                 if (c > 0x7f || !isValidXMLChar(c))
                 {
+                    if (isDisallowedControlChar(c))
+                    {
+                        throw new XmlRpcException
+                            (0, "Invalid XML character corresponding to " +
+                             "code point " + String.valueOf((int) c));
+                    }
+
                     // Replace the code point with a character reference.
                     writeCharacterReference(c);
                 }
@@ -469,6 +476,31 @@
         }
     }
 
+    /**
+     * John Wilson indicates that some characters simply aren't
+     * allowed in XML documents, even as character references.
+     *
+     * @return Whether the specified character is a control character
+     * which is disallowed in XML.
+     */
+    private static final boolean isDisallowedControlChar(char c)
+    {
+        if (c < 0x20)
+        {
+            switch (c)
+            {
+            case 0x9:
+            case 0xa:  // line feed, '\n'
+            case 0xd:  // carriage return, '\r'
+                return false;
+
+            default:
+                return true;
+            }
+        }
+        return false;
+    }
+
     protected static void setTypeDecoder(TypeDecoder newTypeDecoder)
     {
         typeDecoder = newTypeDecoder;

Re: [PATCH] characters invalid for an encoding

Reply via email to