I'm in general agreement. 1. A Unicode 16-bit string can contain any sequence of 16-bit code units: it might or might not be valid UTF-16. 2. Whenever a process is emitting a Unicode string, if it is * guaranteeing* that it is UTF-16, it must catch any unpaired surrogates and fix (eg replace by FFFD). 3. It is a burden on processes to always guarantee UTF-16 conformance, and the vast majority of processing can handle a Unicode string robustly, just treating the unpaired surrogates as UNASSIGNED. 4. Whenever a process is accepting a Unicode string, if it is requiring that the string is UTF-16 it has a couple of choices: if the source is 'trusted' and purports to supply UTF-16, no problem; otherwise the process need to validate the input for safety.
Mark *— Il meglio è l’inimico del bene —* On Fri, Nov 5, 2010 at 11:54, Asmus Freytag <asm...@ix.netcom.com> wrote: > On 11/5/2010 7:02 AM, Doug Ewell wrote: > >> Asmus Freytag<asmusf at ix dot netcom dot com> wrote: >> >> I'm probably missing something here, but I don't agree that it's OK >>>> for a consumer of UTF-16 to accept an unpaired surrogate without >>>> throwing an error, or converting it to U+FFFD, or otherwise raising a >>>> fuss. Unpaired surrogates are ill-formed, and have to be caught and >>>> dealt with. >>>> >>> The question is whether you want every library that handles strings >>> perform the equivalent of a citizen's arrest, or whether you architect >>> things that the gatekeepers (border control) police the data stream. >>> >> If you can have upstream libraries check for unpaired surrogates at the >> time they convert UTF-16 to Unicode code points, then your point is well >> taken, because then the downstream libraries are no longer dealing with >> UTF-16, but with code points. Doing conversion and validation at >> different stages isn't a great idea; that's how character encodings get >> involved with security problems. >> > > Note that I am careful not to suggest that (and I'm sure Markus isn't > either). "Handling" includes much more than code conversion. It includes > uppercasing, spell checking, sorting, searching, the whole lot. Burdening > every single one of those tasks with policing the integrity of the encoding > seems wasteful, and, as I tried to explain, puts the error detection in a > place where you'll be most likely prevented from doing something useful in > recovery. > > Data import or code conversion routines are in a much better place, > architecturally, to allow the user meaningful options to deal with corrupted > data, from rejecting to attempts of repair. > > However, some tasks, such as network identifier matching, are > security-sensitive and must re-validate their input, even if the data has > already passed a gate keeper routine such as a validating code conversion > routine. > > > Corrigendum #1 closed the door on interpretation of invalid UTF-8 >> sequences. I'm not sure why the approach to handling UTF-16 should be >> any different. >> >> >> >