On 15 May 2017, at 11:21, Henri Sivonen via Unicode <[email protected]> wrote: > > In reference to: > http://www.unicode.org/L2/L2017/17168-utf-8-recommend.pdf > > I think Unicode should not adopt the proposed change.
Disagree. An over-long UTF-8 sequence is clearly a single error. Emitting multiple errors there makes no sense. > ICU uses UTF-16 as its in-memory Unicode representation, so ICU isn't > representative of implementation concerns of implementations that use > UTF-8 as their in-memory Unicode representation. > > Even though there are notable systems (Win32, Java, C#, JavaScript, > ICU, etc.) that are stuck with UTF-16 as their in-memory > representation, which makes concerns of such implementation very > relevant, I think the Unicode Consortium should acknowledge that > UTF-16 was, in retrospect, a mistake You may think that. There are those of us who do not. The fact is that UTF-16 makes sense as a default encoding in many cases. Yes, UTF-8 is more efficient for primarily ASCII text, but that is not the case for other situations and the fact is that handling surrogates (which is what proponents of UTF-8 or UCS-4 usually focus on) is no more complicated than handling combining characters, which you have to do anyway. > Therefore, despite UTF-16 being widely used as an in-memory > representation of Unicode and in no way going away, I think the > Unicode Consortium should be *very* sympathetic to technical > considerations for implementations that use UTF-8 as the in-memory > representation of Unicode. I don’t think the Unicode Consortium should be unsympathetic to people who use UTF-8 internally, for sure, but I don’t see what that has to do with either the original proposal or with your criticism of UTF-16. [snip] > If the proposed > change was adopted, while Draconian decoders (that fail upon first > error) could retain their current state machine, implementations that > emit U+FFFD for errors and continue would have to add more state > machine states (i.e. more complexity) to consolidate more input bytes > into a single U+FFFD even after a valid sequence is obviously > impossible. “Impossible”? Why? You just need to add some error states (or *an* error state and a counter); it isn’t exactly difficult, and I’m sure ICU isn’t the only library that already did just that *because it’s clearly the right thing to do*. Kind regards, Alastair. -- http://alastairs-place.net

