On Sat Jun 3 23:09:01 CDT 2017Sat Jun 3 23:09:01 CDT 2017 Markus Scherer wrote: > I suggest you submit a write-up via http://www.unicode.org/reporting.html > > and make the case there that you think the UTC should retract > > http://www.unicode.org/L2/L2017/17103.htm#151-C19
The submission has been made: http://www.unicode.org/L2/L2017/17197-utf8-retract.pdf > Also, since Chromium/Blink/v8 are using ICU, I suggest you submit an ICU > ticket via http://bugs.icu-project.org/trac/newticket Although they use ICU for most legacy encodings, they don't use ICU for UTF-8. Hence, the difference between Chrome and ICU in the above write-up. > and make the case there, too, that you think (assuming you do) that ICU > should change its handling of illegal UTF-8 sequences. Whether I think ICU should change isn't quite that simple. On one hand, a key worry that I have about Unicode changing the long-standing guidance for UTF-8 error handling is that inducing implementations to change (either by the developers feeling that they have to implement the "best practice" or by others complaining when "best practice" isn't implemented) is wasteful and a potential source of bugs. In that sense, I feel I shouldn't ask ICU to change, either. On the other hand, I care about implementations of the WHATWG Encoding Standard being compliant and it appears that Node.js is on track to exposing ICU's UTF-8 decoder via the WHATWG TextDecoder API: https://github.com/nodejs/node/pull/13644 . Additionally, this episode of ICU behavior getting cited in a proposal to change the guidance in the Unicode Standard is a reason why I'd be happier if ICU followed the Unicode 10-and-earlier / WHATWG behavior, since there wouldn't be the risk of ICU's behavior getting cited as a different reference as happened with the proposal to change the guidance for Unicode 11. Still, since I'm not affiliated with the Node.js implementation, I'm a bit worried that if I filed an ICU bug on Node's behalf, I'd be engaging in the kind of behavior towards ICU that I don't want to see towards other implementations, including the one I've written, in response to the new pending Unicode 11 guidance (which I'm requesting be retracted), so at this time I haven't filed an ICU bug on Node's behalf and have instead mentioned the difference between ICU and the WHATWG spec when my input on testing the Node TextDecoder implementation was sought (https://github.com/nodejs/node/issues/13646#issuecomment-308084459). >> But the matter at hand is decoding potentially-invalid UTF-8 input >> into a valid in-memory Unicode representation, so later processing is >> somewhat a red herring as being out of scope for this step. I do agree >> that if you already know that the data is valid UTF-8, it makes sense >> to work from the bit pattern definition only. > > No, it's not a red herring. Not every piece of software has a neat "inside" > with all valid text, and with a controllable surface to the "outside". Fair enough. However, I don't think this supports adopting the ICU behavior as "best practice" when looking at a prominent real-world example of such a system. The Go programming language is a example of a system that post-dates UTF-8, is even designed by the same people as UTF-8 and where strings in memory are potentially-invalid UTF-8, i.e. there isn't a clear distinction with UTF-8 on the outside and UTF-8 on the inside. (In contrast to e.g. Rust where the type system maintains a clear distinction between byte buffers and strings, and strings are guaranteed-valid UTF-8.) Go bakes UTF-8 error handling in the language spec by specifying per-code point iteration over potentially-invalid in-memory UTF-8 buffers. See item 2 in the list at https://golang.org/ref/spec#For_range . The behavior baked into the language is one REPLACEMENT CHARACTER per bogus byte, which is neither the Unicode 10-and-earlier "best practice" nor the ICU behavior. However, it is closer to the Unicode 10-and-earlier "best practice" than to the ICU behavior. (It differs from the Unicode-and-earlier behavior only for truncated sequences that form a prefix of a valid sequence.) (To be clear, I not saying that the guidance in the Unicode Standard should be changed to match Go, either. I'm just saying that Go is an example of a prominent system with ambiguous inside and outside for UTF-8 and it exhibits behavior closer to Unicode 10 than to ICU and, therefore, is not a data point in favor of adopting the ICU behavior.) -- Henri Sivonen hsivo...@mozilla.com