On Tue, May 16, 2017 at 10:22 AM, Asmus Freytag <[email protected]> wrote: > but I think the way he raises this point is needlessly antagonistic.
I apologize. My level of dismay at the proposal's ICU-centricity overcame me. On Tue, May 16, 2017 at 10:42 AM, Alastair Houghton <[email protected]> wrote: > That would be true if the in-memory representation had any effect on what > we’re talking about, but it really doesn’t. If the internal representation is UTF-16 (or UTF-32), it is a likely design that there is a variable into which the scalar value of the current code point is accumulated during UTF-8 decoding. In such a scenario, it can be argued as "natural" to first operate according to the general structure of UTF-8 and then inspect what you got in the accumulation variable (ruling out non-shortest forms, values above the Unicode range and surrogate values after the fact). When the internal representation is UTF-8, only UTF-8 validation is needed, and it's natural to have a fail-fast validator, which *doesn't necessarily need such a scalar value accumulator at all*. The construction at http://bjoern.hoehrmann.de/utf-8/decoder/dfa/ when used as a UTF-8 validator is the best illustration of a UTF-8 validator not necessarily looking like a "natural" UTF-8 to UTF-16 converter at all. >>> Consider https://hsivonen.com/test/moz/broken-utf-8.html . A quick >>> test with three major browsers that use UTF-16 internally and have >>> independent (of each other) implementations of UTF-8 decoding >>> (Firefox, Edge and Chrome) shows agreement on the current spec: there >>> is one REPLACEMENT CHARACTER per bogus byte (i.e. 2 on the first line, >>> 6 on the second, 4 on the third and 6 on the last line). Changing the >>> Unicode standard away from that kind of interop needs *way* better >>> rationale than "feels right”. > > In what sense is this “interop”? In the sense that prominent independent implementations do the same externally observable thing. > Under what circumstance would it matter how many U+FFFDs you see? Maybe it doesn't, but I don't think the burden of proof should be on the person advocating keeping the spec and major implementations as they are. If anything, I think those arguing for a change of the spec in face of browsers, OpenJDK, Python 3 (and, likely, "etc.") agreeing with the current spec should show why it's important to have a different number of U+FFFDs than the spec's "best practice" calls for now. > If you’re about to mutter something about security, consider this: security > code *should* refuse to compare strings that contain U+FFFD (or at least > should never treat them as equal, even to themselves), because it has no way > to know what that code point represents. In practice, e.g. the Web Platform doesn't allow for stopping operating on input that contains an U+FFFD, so the focus is mainly on making sure that U+FFFDs are placed well enough to prevent bad stuff under normal operations. At least typically, the number of U+FFFDs doesn't matter for that purpose, but when browsers agree on the number of U+FFFDs, changing that number should have an overwhelmingly strong rationale. A security reason could be a strong reason, but such a security motivation for fewer U+FFFDs has not been shown, to my knowledge. > Would you advocate replacing > > e0 80 80 > > with > > U+FFFD U+FFFD U+FFFD (1) > > rather than > > U+FFFD (2) I advocate (1), most simply because that's what Firefox, Edge and Chrome do *in accordance with the currently-recommended best practice* and, less simply, because it makes sense in the presence of a fail-fast UTF-8 validator. I think the burden of proof to show an overwhelmingly good reason to change should, at this point, be on whoever proposes doing it differently than what the current widely-implemented spec says. > It’s pretty clear what the intent of the encoder was there, I’d say, and > while we certainly don’t want to decode it as a NUL (that was the source of > previous security bugs, as I recall), I also don’t see the logic in insisting > that it must be decoded to *three* code points when it clearly only > represented one in the input. As noted previously, the logic is that you generate a U+FFFD whenever a fail-fast validator fails. > This isn’t just a matter of “feels nicer”. (1) is simply illogical > behaviour, and since behaviours (1) and (2) are both clearly out there today, > it makes sense to pick the more logical alternative as the official > recommendation. Again, the current best practice makes perfect logical sense in the context of a fail-fast UTF-8 validator. Moreover, it doesn't look like both are "out there" equally when major browsers, OpenJDK and Python 3 agree. (I expect I could find more prominent implementations that implement the currently-stated best practice, but I feel I shouldn't have to.) From my experience from working on Web standards and implementing them, I think it's a bad idea to change something to be "more logical" when the change would move away from browser consensus. -- Henri Sivonen [email protected] https://hsivonen.fi/

