On Mon, May 15, 2017 at 6:37 PM, Alastair Houghton <alast...@alastairs-place.net> wrote: > On 15 May 2017, at 11:21, Henri Sivonen via Unicode <unicode@unicode.org> > wrote: >> >> In reference to: >> http://www.unicode.org/L2/L2017/17168-utf-8-recommend.pdf >> >> I think Unicode should not adopt the proposed change. > > Disagree. An over-long UTF-8 sequence is clearly a single error. Emitting > multiple errors there makes no sense.
The currently-specced behavior makes perfect sense when you add error emission on top of a fail-fast UTF-8 validation state machine. >> ICU uses UTF-16 as its in-memory Unicode representation, so ICU isn't >> representative of implementation concerns of implementations that use >> UTF-8 as their in-memory Unicode representation. >> >> Even though there are notable systems (Win32, Java, C#, JavaScript, >> ICU, etc.) that are stuck with UTF-16 as their in-memory >> representation, which makes concerns of such implementation very >> relevant, I think the Unicode Consortium should acknowledge that >> UTF-16 was, in retrospect, a mistake > > You may think that. There are those of us who do not. My point is: The proposal seems to arise from the "UTF-16 as the in-memory representation" mindset. While I don't expect that case in any way to go away, I think the Unicode Consortium should recognize the serious technical merit of the "UTF-8 as the in-memory representation" case as having significant enough merit that proposals like this should consider impact to both cases equally despite "UTF-8 as the in-memory representation" case at present appearing to be the minority case. That is, I think it's wrong to view things only or even primarily through the lens of the "UTF-16 as the in-memory representation" case that ICU represents. -- Henri Sivonen hsivo...@hsivonen.fi https://hsivonen.fi/