Re: Feedback on the proposal to change U+FFFD generation when decoding ill-formed UTF-8

Asmus Freytag via Unicode Tue, 16 May 2017 00:27:49 -0700

On 5/15/2017 11:50 PM, Henri Sivonen via Unicode wrote:

On Tue, May 16, 2017 at 1:16 AM, Shawn Steele via Unicode
<unicode@unicode.org> wrote:

I’m not sure how the discussion of “which is better” relates to the
discussion of ill-formed UTF-8 at all.

Clearly, the "which is better" issue is distracting from the
underlying issue. I'll clarify what I meant on that point and then
move on:


I acknowledge that UTF-16 as the internal memory representation is the
dominant design. However, because UTF-8 as the internal memory
representation is *such a good design* (when legacy constraits permit)
that *despite it not being the current dominant design*, I think the
Unicode Consortium should be fully supportive of UTF-8 as the internal
memory representation and not treat UTF-16 as the internal
representation as the one true way of doing things that gets
considered when speccing stuff.

There are cases where it is prohibitive to transcode external data fromUTF-8 to any other format, as a precondition to doing any work. In thesesituations processing has to be done in UTF-8, effectively making thatthe in-memory representation. I've encountered this issue on separateoccasions, both for my own code as well as code I reviewed for clients.

I therefore think that Henri has a point when he's concerned about tacitassumptions favoring one memory representation over another, but I thinkthe way he raises this point is needlessly antagonistic.

....At the very least a proposal should discuss the impact on the "UTF-8
internally" case, which the proposal at hand doesn't do.

This is a key point. It may not be directly relevant to any othermodifications to the standard, but the larger point is to not makeassumption about how people implement the standard (or any of thealgorithms).

(Moving on to a different point.)

The matter at hand isn't, however, a new green-field (in terms of
implementations) issue to be decided but a proposed change to a
standard that has many widely-deployed implementations. Even when
observing only "UTF-16 internally" implementations, I think it would
be appropriate for the proposal to include a review of what existing
implementations, beyond ICU, do.

I would like to second this as well.

The level of documented review of existing implementation practicestends to be thin (at least thinner than should be required for changinglong-established edge cases or recommendations, let alone coreconformance requirements).


Consider https://hsivonen.com/test/moz/broken-utf-8.html . A quick
test with three major browsers that use UTF-16 internally and have
independent (of each other) implementations of UTF-8 decoding
(Firefox, Edge and Chrome) shows agreement on the current spec: there
is one REPLACEMENT CHARACTER per bogus byte (i.e. 2 on the first line,
6 on the second, 4 on the third and 6 on the last line). Changing the
Unicode standard away from that kind of interop needs *way* better
rationale than "feels right".

It would be good if the UTC could work out some minimal requirements forevaluating proposals for changes to properties and algorithms, much likethe criteria for encoding new code points

A./

Re: Feedback on the proposal to change U+FFFD generation when decoding ill-formed UTF-8

Reply via email to