Re: Utility to report and repair broken surrogate pairs in UTF-16 text

Mark Davis ☕ Fri, 05 Nov 2010 14:09:07 -0700

I'm in general agreement.

   1. A Unicode 16-bit string can contain any sequence of 16-bit code units:
   it might or might not be valid UTF-16.
   2. Whenever a process is emitting a Unicode string, if it is *
   guaranteeing* that it is UTF-16, it must catch any unpaired surrogates
   and fix (eg replace by FFFD).
   3. It is a burden on processes to always guarantee UTF-16 conformance,
   and the vast majority of processing can handle a Unicode string robustly,
   just treating the unpaired surrogates as UNASSIGNED.
   4. Whenever a process is accepting a Unicode string, if it is requiring
   that the string is UTF-16 it has a couple of choices: if the source is
   'trusted' and purports to supply UTF-16, no problem; otherwise the process
   need to validate the input for safety.


Mark

*— Il meglio è l’inimico del bene —*


On Fri, Nov 5, 2010 at 11:54, Asmus Freytag <asm...@ix.netcom.com> wrote:

> On 11/5/2010 7:02 AM, Doug Ewell wrote:
>
>> Asmus Freytag<asmusf at ix dot netcom dot com>  wrote:
>>
>>  I'm probably missing something here, but I don't agree that it's OK
>>>> for a consumer of UTF-16 to accept an unpaired surrogate without
>>>> throwing an error, or converting it to U+FFFD, or otherwise raising a
>>>> fuss. Unpaired surrogates are ill-formed, and have to be caught and
>>>> dealt with.
>>>>
>>> The question is whether you want every library that handles strings
>>> perform the equivalent of a citizen's arrest, or whether you architect
>>> things that the gatekeepers (border control) police the data stream.
>>>
>> If you can have upstream libraries check for unpaired surrogates at the
>> time they convert UTF-16 to Unicode code points, then your point is well
>> taken, because then the downstream libraries are no longer dealing with
>> UTF-16, but with code points.  Doing conversion and validation at
>> different stages isn't a great idea; that's how character encodings get
>> involved with security problems.
>>
>
> Note that I am careful not to suggest that (and I'm sure Markus isn't
> either). "Handling" includes much more than code conversion. It includes
> uppercasing, spell checking, sorting, searching, the whole lot. Burdening
> every single one of those tasks with policing the integrity of the encoding
> seems wasteful, and, as I tried to explain, puts the error detection in a
> place where you'll be most likely prevented from doing something useful in
> recovery.
>
> Data import or code conversion routines are in a much better place,
> architecturally, to allow the user meaningful options to deal with corrupted
> data, from rejecting to attempts of repair.
>
> However, some tasks, such as network identifier matching, are
> security-sensitive and must re-validate their input, even if the data has
> already passed a gate keeper routine such as a validating code conversion
> routine.
>
>
>  Corrigendum #1 closed the door on interpretation of invalid UTF-8
>> sequences.  I'm not sure why the approach to handling UTF-16 should be
>> any different.
>>
>>
>>
>

Re: Utility to report and repair broken surrogate pairs in UTF-16 text

Reply via email to