In reference to: http://www.unicode.org/L2/L2017/17168-utf-8-recommend.pdf
I think Unicode should not adopt the proposed change. The proposal is to make ICU's spec violation conforming. I think there is both a technical and a political reason why the proposal is a bad idea. First, the technical reason: ICU uses UTF-16 as its in-memory Unicode representation, so ICU isn't representative of implementation concerns of implementations that use UTF-8 as their in-memory Unicode representation. Even though there are notable systems (Win32, Java, C#, JavaScript, ICU, etc.) that are stuck with UTF-16 as their in-memory representation, which makes concerns of such implementation very relevant, I think the Unicode Consortium should acknowledge that UTF-16 was, in retrospect, a mistake (since Unicode grew past 16 bits anyway making UTF-16 both variable-width *and* ASCII-incompatible--i.e. widening the the code units to be ASCII-incompatible didn't buy a constant-width encoding after all) and that when the legacy constraints of Win32, Java, C#, JavaScript, ICU, etc. don't force UTF-16 as the internal Unicode representation, using UTF-8 as the internal Unicode representation is the technically superior design: Using UTF-8 as the internal Unicode representation is memory-efficient and cache-efficient when dealing with data formats whose syntax is mostly ASCII (e.g. HTML), forces developers to handle variable-width issues right away, makes input decode a matter of mere validation without copy when the input is conforming and makes output encode infinitely fast (no encode step needed). Therefore, despite UTF-16 being widely used as an in-memory representation of Unicode and in no way going away, I think the Unicode Consortium should be *very* sympathetic to technical considerations for implementations that use UTF-8 as the in-memory representation of Unicode. When looking this issue from the ICU perspective of using UTF-16 as the in-memory representation of Unicode, it's easy to consider the proposed change as the easier thing for implementation (after all, no change for the ICU implementation is involved!). However, when UTF-8 is the in-memory representation of Unicode and "decoding" UTF-8 input is a matter of *validating* UTF-8, a state machine that rejects a sequence as soon as it's impossible for the sequence to be valid UTF-8 (under the definition that excludes surrogate code points and code points beyond U+10FFFF) makes a whole lot of sense. If the proposed change was adopted, while Draconian decoders (that fail upon first error) could retain their current state machine, implementations that emit U+FFFD for errors and continue would have to add more state machine states (i.e. more complexity) to consolidate more input bytes into a single U+FFFD even after a valid sequence is obviously impossible. When the decision can easily go either way for implementations that use UTF-16 internally but the options are not equal when using UTF-8 internally, the "UTF-8 internally" case should be decisive. (Especially when spec-wise that decision involves no change. I further note the proposal PDF argues on the level of "feels right" without even discussing the impact on implementations that use UTF-8 internally.) As a matter of implementation experience, the implementation I've written (https://github.com/hsivonen/encoding_rs) supports both the UTF-16 as the in-memory Unicode representation and the UTF-8 as the in-memory Unicode representation scenarios, and the fail-fast requirement wasn't onerous in the UTF-16 as the in-memory representation scenario. Second, the political reason: Now that ICU is a Unicode Consortium project, I think the Unicode Consortium should be particular sensitive to biases arising from being both the source of the spec and the source of a popular implementation. It looks *really bad* both in terms of equal footing of ICU vs. other implementations for the purpose of how the standard is developed as well as the reliability of the standard text vs. ICU source code as the source of truth that other implementors need to pay attention to if the way the Unicode Consortium resolves a discrepancy between ICU behavior and a well-known spec provision (this isn't some ill-known corner case, after all) is by changing the spec instead of changing ICU *especially* when the change is not neutral for implementations that have made different but completely valid per then-existing spec and, in the absence of legacy constraints, superior architectural choices compared to ICU (i.e. UTF-8 internally instead of UTF-16 internally). I can see the irony of this viewpoint coming from a WHATWG-aligned browser developer, but I note that even browsers that use ICU for legacy encodings don't use ICU for UTF-8, so the ICU UTF-8 behavior isn't, in fact, the dominant browser UTF-8 behavior. That is, even Blink and WebKit use their own non-ICU UTF-8 decoder. The Web is the environment that's the most sensitive to how issues like this are handled, so it would be appropriate for the proposal to survey current browser behavior instead of just saying that ICU "feels right" or is "natural". -- Henri Sivonen [email protected] https://hsivonen.fi/

