On 5/15/2017 11:50 PM, Henri Sivonen via Unicode wrote:
On Tue, May 16, 2017 at 1:16 AM, Shawn Steele via Unicode
<unicode@unicode.org> wrote:
I’m not sure how the discussion of “which is better” relates to the
discussion of ill-formed UTF-8 at all.
Clearly, the "which is better" issue is distracting from the
underlying issue. I'll clarify what I meant on that point and then
move on:

I acknowledge that UTF-16 as the internal memory representation is the
dominant design. However, because UTF-8 as the internal memory
representation is *such a good design* (when legacy constraits permit)
that *despite it not being the current dominant design*, I think the
Unicode Consortium should be fully supportive of UTF-8 as the internal
memory representation and not treat UTF-16 as the internal
representation as the one true way of doing things that gets
considered when speccing stuff.
There are cases where it is prohibitive to transcode external data from UTF-8 to any other format, as a precondition to doing any work. In these situations processing has to be done in UTF-8, effectively making that the in-memory representation. I've encountered this issue on separate occasions, both for my own code as well as code I reviewed for clients.

I therefore think that Henri has a point when he's concerned about tacit assumptions favoring one memory representation over another, but I think the way he raises this point is needlessly antagonistic.
....At the very least a proposal should discuss the impact on the "UTF-8
internally" case, which the proposal at hand doesn't do.

This is a key point. It may not be directly relevant to any other modifications to the standard, but the larger point is to not make assumption about how people implement the standard (or any of the algorithms).
(Moving on to a different point.)

The matter at hand isn't, however, a new green-field (in terms of
implementations) issue to be decided but a proposed change to a
standard that has many widely-deployed implementations. Even when
observing only "UTF-16 internally" implementations, I think it would
be appropriate for the proposal to include a review of what existing
implementations, beyond ICU, do.
I would like to second this as well.

The level of documented review of existing implementation practices tends to be thin (at least thinner than should be required for changing long-established edge cases or recommendations, let alone core conformance requirements).

Consider https://hsivonen.com/test/moz/broken-utf-8.html . A quick
test with three major browsers that use UTF-16 internally and have
independent (of each other) implementations of UTF-8 decoding
(Firefox, Edge and Chrome) shows agreement on the current spec: there
is one REPLACEMENT CHARACTER per bogus byte (i.e. 2 on the first line,
6 on the second, 4 on the third and 6 on the last line). Changing the
Unicode standard away from that kind of interop needs *way* better
rationale than "feels right".
It would be good if the UTC could work out some minimal requirements for evaluating proposals for changes to properties and algorithms, much like the criteria for encoding new code points
A./

Reply via email to