Re: Utility to report and repair broken surrogate pairs in UTF-16 text

Doug Ewell Thu, 04 Nov 2010 17:57:44 -0700

Markus Scherer wrote:

While processing 16-bit Unicode text which is not assumed to bewell-formed UTF-16, you can treat (decode) an unpaired surrogate as amostly-inert surrogate code point. However, you cannot unambiguouslyencode a surrogate code point in 16-bit text (because you could notdistinguish a sequence of lead+trail surrogate code points from onesupplementary code point), and therefore it is not allowed to encodesurrogate code points in any well-formed UTF-8/16/32. [All of this isdiscussed in The Unicode Standard, Chapter 3.]

I'm probably missing something here, but I don't agree that it's OK fora consumer of UTF-16 to accept an unpaired surrogate without throwing anerror, or converting it to U+FFFD, or otherwise raising a fuss.Unpaired surrogates are ill-formed, and have to be caught and dealtwith.


--
Doug Ewell | Thornton, Colorado, USA | http://www.ewellic.org
RFC 5645, 4645, UTN #14 | ietf-languages @ is dot gd slash 2kf0s

Re: Utility to report and repair broken surrogate pairs in UTF-16 text

Reply via email to