RE: Utility to report and repair broken surrogate pairs in UTF-16 text

Doug Ewell Fri, 05 Nov 2010 14:09:08 -0700

Asmus Freytag <asmusf at ix dot netcom dot com> wrote:

>> Doing conversion and validation at different stages isn't a great
>> idea; that's how character encodings get involved with security
>> problems.
>
> Note that I am careful not to suggest that (and I'm sure Markus isn't
> either). "Handling" includes much more than code conversion. It
> includes uppercasing, spell checking, sorting, searching, the whole
> lot. Burdening every single one of those tasks with policing the
> integrity of the encoding seems wasteful, and, as I tried to explain,
> puts the error detection in a place where you'll be most likely
> prevented from doing something useful in recovery.


Right, but as I said, those downstream tasks shouldn't be consumers of
UTF-16 code units anyway.  They should be consumers of Unicode code
points, which by definition excludes loose surrogates.

--
Doug Ewell | Thornton, Colorado, USA | http://www.ewellic.org
RFC 5645, 4645, UTN #14 | ietf-languages @ is dot gd slash 2kf0s

RE: Utility to report and repair broken surrogate pairs in UTF-16 text

Reply via email to