Re: Feedback on the proposal to change U+FFFD generation when decoding ill-formed UTF-8

Henri Sivonen via Unicode Tue, 16 May 2017 01:36:07 -0700

On Tue, May 16, 2017 at 10:22 AM, Asmus Freytag <[email protected]> wrote:
> but I think the way he raises this point is needlessly antagonistic.

I apologize. My level of dismay at the proposal's ICU-centricity overcame me.

On Tue, May 16, 2017 at 10:42 AM, Alastair Houghton
<[email protected]> wrote:
> That would be true if the in-memory representation had any effect on what 
> we’re talking about, but it really doesn’t.

If the internal representation is UTF-16 (or UTF-32), it is a likely
design that there is a variable into which the scalar value of the
current code point is accumulated during UTF-8 decoding. In such a
scenario, it can be argued as "natural" to first operate according to
the general structure of UTF-8 and then inspect what you got in the
accumulation variable (ruling out non-shortest forms, values above the
Unicode range and surrogate values after the fact).

When the internal representation is UTF-8, only UTF-8 validation is
needed, and it's natural to have a fail-fast validator, which *doesn't
necessarily need such a scalar value accumulator at all*. The
construction at http://bjoern.hoehrmann.de/utf-8/decoder/dfa/ when
used as a UTF-8 validator is the best illustration of a UTF-8
validator not necessarily looking like a "natural" UTF-8 to UTF-16
converter at all.

>>> Consider https://hsivonen.com/test/moz/broken-utf-8.html . A quick
>>> test with three major browsers that use UTF-16 internally and have
>>> independent (of each other) implementations of UTF-8 decoding
>>> (Firefox, Edge and Chrome) shows agreement on the current spec: there
>>> is one REPLACEMENT CHARACTER per bogus byte (i.e. 2 on the first line,
>>> 6 on the second, 4 on the third and 6 on the last line). Changing the
>>> Unicode standard away from that kind of interop needs *way* better
>>> rationale than "feels right”.
>
> In what sense is this “interop”?

In the sense that prominent independent implementations do the same
externally observable thing.

> Under what circumstance would it matter how many U+FFFDs you see?

Maybe it doesn't, but I don't think the burden of proof should be on
the person advocating keeping the spec and major implementations as
they are. If anything, I think those arguing for a change of the spec
in face of browsers, OpenJDK, Python 3 (and, likely, "etc.") agreeing
with the current spec should show why it's important to have a
different number of U+FFFDs than the spec's "best practice" calls for
now.

>  If you’re about to mutter something about security, consider this: security 
> code *should* refuse to compare strings that contain U+FFFD (or at least 
> should never treat them as equal, even to themselves), because it has no way 
> to know what that code point represents.

In practice, e.g. the Web Platform doesn't allow for stopping
operating on input that contains an U+FFFD, so the focus is mainly on
making sure that U+FFFDs are placed well enough to prevent bad stuff
under normal operations. At least typically, the number of U+FFFDs
doesn't matter for that purpose, but when browsers agree on the number
 of U+FFFDs, changing that number should have an overwhelmingly strong
rationale. A security reason could be a strong reason, but such a
security motivation for fewer U+FFFDs has not been shown, to my
knowledge.

> Would you advocate replacing
>
>   e0 80 80
>
> with
>
>   U+FFFD U+FFFD U+FFFD     (1)
>
> rather than
>
>   U+FFFD                   (2)

I advocate (1), most simply because that's what Firefox, Edge and
Chrome do *in accordance with the currently-recommended best practice*
and, less simply, because it makes sense in the presence of a
fail-fast UTF-8 validator. I think the burden of proof to show an
overwhelmingly good reason to change should, at this point, be on
whoever proposes doing it differently than what the current
widely-implemented spec says.

> It’s pretty clear what the intent of the encoder was there, I’d say, and 
> while we certainly don’t want to decode it as a NUL (that was the source of 
> previous security bugs, as I recall), I also don’t see the logic in insisting 
> that it must be decoded to *three* code points when it clearly only 
> represented one in the input.

As noted previously, the logic is that you generate a U+FFFD whenever
a fail-fast validator fails.

> This isn’t just a matter of “feels nicer”.  (1) is simply illogical 
> behaviour, and since behaviours (1) and (2) are both clearly out there today, 
> it makes sense to pick the more logical alternative as the official 
> recommendation.

Again, the current best practice makes perfect logical sense in the
context of a fail-fast UTF-8 validator. Moreover, it doesn't look like
both are "out there" equally when major browsers, OpenJDK and Python 3
agree. (I expect I could find more prominent implementations that
implement the currently-stated best practice, but I feel I shouldn't
have to.) From my experience from working on Web standards and
implementing them, I think it's a bad idea to change something to be
"more logical" when the change would move away from browser consensus.

-- 
Henri Sivonen
[email protected]
https://hsivonen.fi/

Re: Feedback on the proposal to change U+FFFD generation when decoding ill-formed UTF-8

Reply via email to