Markus Scherer wrote:
While processing 16-bit Unicode text which is not assumed to be well-formed UTF-16, you can treat (decode) an unpaired surrogate as a mostly-inert surrogate code point. However, you cannot unambiguously encode a surrogate code point in 16-bit text (because you could not distinguish a sequence of lead+trail surrogate code points from one supplementary code point), and therefore it is not allowed to encode surrogate code points in any well-formed UTF-8/16/32. [All of this is discussed in The Unicode Standard, Chapter 3.]
I'm probably missing something here, but I don't agree that it's OK for a consumer of UTF-16 to accept an unpaired surrogate without throwing an error, or converting it to U+FFFD, or otherwise raising a fuss. Unpaired surrogates are ill-formed, and have to be caught and dealt with.
-- Doug Ewell | Thornton, Colorado, USA | http://www.ewellic.org RFC 5645, 4645, UTN #14 | ietf-languages @ is dot gd slash 2kf0s