On Tue, 16 May 2017 11:36:39 -0700 Markus Scherer via Unicode <unicode@unicode.org> wrote:
> Why do we care how we carve up an illegal sequence into subsequences? > Only for debugging and visual inspection. Maybe some process is using > illegal, overlong sequences to encode something special (à la Java > string serialization, "modified UTF-8"), and for that it might be > convenient too to treat overlong sequences as single errors. I think that's not quite true. If we are moving back and forth through a buffer containing corrupt text, we need to make sure that moving three characters forward and then three characters back leaves us where we started. That requires internal consistency. One possible issue is with text input methods that access an application's backing store. They can issue updates in the form of 'delete 3 characters and insert ...'. However, if the input method is accessing characters it hasn't written, it's probably misbehaving anyway. Such commands do rather heavily assume that any relevant normalisation by the application will be taken into account by the input method. I once had a go at fixing an application that was misinterpreting 'delete x characters' as 'delete x UTF-16 code units'. It was a horrible mess, as the application's interface layer couldn't peek at the string being edited. Richard.