On 11/5/2010 7:02 AM, Doug Ewell wrote:
Asmus Freytag<asmusf at ix dot netcom dot com> wrote:
I'm probably missing something here, but I don't agree that it's OK
for a consumer of UTF-16 to accept an unpaired surrogate without
throwing an error, or converting it to U+FFFD, or otherwise raising a
fuss. Unpaired surrogates are ill-formed, and have to be caught and
dealt with.
The question is whether you want every library that handles strings
perform the equivalent of a citizen's arrest, or whether you architect
things that the gatekeepers (border control) police the data stream.
If you can have upstream libraries check for unpaired surrogates at the
time they convert UTF-16 to Unicode code points, then your point is well
taken, because then the downstream libraries are no longer dealing with
UTF-16, but with code points. Doing conversion and validation at
different stages isn't a great idea; that's how character encodings get
involved with security problems.
Note that I am careful not to suggest that (and I'm sure Markus isn't
either). "Handling" includes much more than code conversion. It includes
uppercasing, spell checking, sorting, searching, the whole lot.
Burdening every single one of those tasks with policing the integrity of
the encoding seems wasteful, and, as I tried to explain, puts the error
detection in a place where you'll be most likely prevented from doing
something useful in recovery.
Data import or code conversion routines are in a much better place,
architecturally, to allow the user meaningful options to deal with
corrupted data, from rejecting to attempts of repair.
However, some tasks, such as network identifier matching, are
security-sensitive and must re-validate their input, even if the data
has already passed a gate keeper routine such as a validating code
conversion routine.
Corrigendum #1 closed the door on interpretation of invalid UTF-8
sequences. I'm not sure why the approach to handling UTF-16 should be
any different.