On Tue, May 16, 2017 at 12:42 AM Alastair Houghton <
alast...@alastairs-place.net> wrote:

> If you’re about to mutter something about security, consider this:
> security code *should* refuse to compare strings that contain U+FFFD (or at
> least should never treat them as equal, even to themselves), because it has
> no way to know what that code point represents.
>

Which causes various other security problems; if an object (file, database
element, etc.) gets a name with a FFFD in it, it becomes impossible to
reference. That an IEEE 754 float may not equal itself is a perpetual
source of confusion for programmers.


> Would you advocate replacing
>
>   e0 80 80
>
> with
>
>   U+FFFD U+FFFD U+FFFD     (1)
>
> rather than
>
>   U+FFFD                   (2)
>
> It’s pretty clear what the intent of the encoder was there, I’d say, and
> while we certainly don’t want to decode it as a NUL (that was the source of
> previous security bugs, as I recall), I also don’t see the logic in
> insisting that it must be decoded to *three* code points when it clearly
> only represented one in the input.
>

In this case, It's pretty clear, but I don't see it as a general rule.  Any
rule has to handle e0 e0 e0 or 80 80 80 or any variety of charset or
mojibake or random binary data. 88 A0 8B D4 is UTF-16 Chinese, but I'm not
going to insist that it get replaced with U+FFFD U+FFFD because it's clear
(to me) it was meant as two characters.

Reply via email to