Re: Utility to report and repair broken surrogate pairs in UTF-16 text

Asmus Freytag Fri, 05 Nov 2010 12:10:35 -0700

On 11/5/2010 7:02 AM, Doug Ewell wrote:

Asmus Freytag<asmusf at ix dot netcom dot com>  wrote:

I'm probably missing something here, but I don't agree that it's OK
for a consumer of UTF-16 to accept an unpaired surrogate without
throwing an error, or converting it to U+FFFD, or otherwise raising a
fuss. Unpaired surrogates are ill-formed, and have to be caught and
dealt with.

The question is whether you want every library that handles strings
perform the equivalent of a citizen's arrest, or whether you architect
things that the gatekeepers (border control) police the data stream.

If you can have upstream libraries check for unpaired surrogates at the
time they convert UTF-16 to Unicode code points, then your point is well
taken, because then the downstream libraries are no longer dealing with
UTF-16, but with code points.  Doing conversion and validation at
different stages isn't a great idea; that's how character encodings get
involved with security problems.

Note that I am careful not to suggest that (and I'm sure Markus isn'teither). "Handling" includes much more than code conversion. It includesuppercasing, spell checking, sorting, searching, the whole lot.Burdening every single one of those tasks with policing the integrity ofthe encoding seems wasteful, and, as I tried to explain, puts the errordetection in a place where you'll be most likely prevented from doingsomething useful in recovery.

Data import or code conversion routines are in a much better place,architecturally, to allow the user meaningful options to deal withcorrupted data, from rejecting to attempts of repair.

However, some tasks, such as network identifier matching, aresecurity-sensitive and must re-validate their input, even if the datahas already passed a gate keeper routine such as a validating codeconversion routine.

Corrigendum #1 closed the door on interpretation of invalid UTF-8
sequences.  I'm not sure why the approach to handling UTF-16 should be
any different.

Re: Utility to report and repair broken surrogate pairs in UTF-16 text

Reply via email to