Re: [PATCH] characters invalid for an encoding

John Wilson Fri, 06 May 2005 01:52:53 -0700


On 5 May 2005, at 22:47, Daniel Rall wrote:

On Thu, 2005-05-05 at 15:24 +0100, John Wilson wrote:

I'm not sure I follow this either :)

Currently we emit an XML declaration which says we are using
ISO8859-1 encoding.

The declaration generated depends upon the encoding in use by XmlWriter, no?

        write(PROLOG_START);
        write(canonicalizeEncoding(enc));
        write(PROLOG_END);

The problem with allowing arbitrary encoding is that the writer has no idea of what the mapping of code Unicode code point to character encoding is. i.e. there is no way of answering the question "I have a Unicode code point with value X can I represent that directly in encoding Y?" If the answer to this question is "NO" then it has to emit a character reference.

If the writer wants to support arbitrary encodings it has to be given a mechanism to determine when to emit character references. Personally I don't think the flexibility in choosing a character encoding is worth the complexity in supporting it. My view is that the writer should only support UTF-8 and (possibly) UTF-16. These encodings do not require any XML declaration and can represent all Unicode code points.

For maximum interoperability I would suggest we use UTF-8 but use character references for all values > 0X7F. This means that even if the other end gets the encoding wrong it will still almost certainly understand the characters. If the other end does not understand character encodings it will be very easy to see what the problem is (which is not quite so easy to do if it mistakes UTF-8 for ISO8859-1, for example)

Unicode code points in the range 0X00 to 0XFF
have the same value as the ISO8859-1 character values. If we wish to
send Unicode code points with values > 0XFF then we have to emit
character references (e.g. &x1FF;)

If we were to change the encoding to UTF-8 or UTF-16 then we would
never have to emit character references (though we still could if we
wanted to).


Like you say below, we'd still have to emit character references for
Unicode code points not allowed in XML documents, yes?

No. Characters which are not allowed in XML documents (e.g. the USASCII control characters like DEL and HT) are not allowed even when represented by a character reference.

The XML 1.0 spec forbids some Unicode code points from appearing in a
well formed XML document (only these code points are allowed: #x9 |
#xA | #xD | [#x20-#xD7FF] | [#xE000-#xFFFD] | [#x10000-#x10FFFF] -
see section 2.2 of the spec.). Note that USASCII control characters
other than HT, CR and NL are not allowed. Using a character reference
doesn't make any difference <a>&#x0;</a> is not a well formed XML
document and should be rejected by an XML parser (MinML used not to
complain about this - later versions do).

What range are these control characters in (e.g. < 0x20)?

The USASCII control characters which are not allowed are all those with values < 0X20 with the exception of 0X09, 0X0A and 0X0D.

There is another little wrinkle with CR and LF. An XML parser is
required to "normalise" line endings (see section 2.11 of the spec).
This normalisation involves replacing CR NL or CR with NL. This
normalisation does not occur if the CR is represented using a
character reference.

So a correct XML writer should do the following:

1/ refuse to write characters with Unicode code points which are not
allowed in an XML document.

Do you suggest throwing an exception here, or writing a '?' character?

You have to throw an exception. There is no point in sending a message you know that the other end will not be able to understand.

2/ replace characters with a Unicode code point which is not allowed
in the encoding being used with the appropriate character reference.


For any random encoding, anyone know a good way of determining whether
such a character is representable by said encoding?

No - this is a classic deficiency in the Java Writer API. If we had a canRepresent() function then the world would be a better place for XML encoders.

3/ replace <,& and > with either the pre defined entities (&lt; etc)
or with a character reference.


We're already replacing them with pref-defined entities, so we're in
good shape here.

4/ replace all CR characters with a character reference.


We do this to keep them from getting normalized by the XML parser, I
take it?  Previously, we'd write them literally.

Yes - this hasn't caused problems in the past but it could in principle.

If we wanted to have the greatest possible chance of interoperating
we should emit no XML encoding declaration and replace code points
with values > 0X7F with character references.


I agree with the part about replacing code points with values > 0x7f
with character references (see exchange with Jochen).

Can non-ASCII encodings be determined by the parser using the BOM, or
some such heuristic?  Would we write all non-ASCII encoding as UTF-8?

The XML spec has a section which describes heuristics which can determine many encodings by looking at the first four octets. This is based on the fact that the first character of a well formed XML document must be '<'. See Appendix F of the spec for the fill picture (it's a really cleaver mechanism IMO).

John Wilson
The Wilson Partnership
http://www.wilson.co.uk

Re: [PATCH] characters invalid for an encoding

Reply via email to