On Tue, May 16, 2017 at 1:45 AM Alastair Houghton < [email protected]> wrote:
> That’s true anyway; imagine the database holds raw bytes, that just happen > to decode to U+FFFD. There might seem to be *two* names that both contain > U+FFFD in the same place. How do you distinguish between them? > If the database holds raw bytes, then the name is a byte string, not a Unicode string, and can't contain U+FFFD at all. It's a relatively easy rule to make and enforce that a string in a database is a validly formatted string; I would hope that most SQL servers do in fact reject malformed UTF-8 strings. On the other hand, I'd expect that an SQL server would accept U+FFFD in a Unicode string. > I don’t see a problem; the point is that where a structurally valid UTF-8 > encoding has been used, albeit in an invalid manner (e.g. encoding a number > that is not a valid code point, or encoding a valid code point as an > over-long sequence), a single U+FFFD is appropriate. That seems a > perfectly sensible rule to adopt. > It seems like a perfectly arbitrary rule to adopt; I'd like to assume that the only source of such UTF-8 data is willful attempts to break security, and in that case, how is this a win? Nonattack sources of broken data are much more likely to be the result of mixing UTF-8 with other character encodings or raw binary data. >

