On Tue, 16 May 2017 20:08:52 +0900 "Martin J. Dürst via Unicode" <[email protected]> wrote:
> I agree with others that ICU should not be considered to have a > special status, it should be just one implementation among others. > [The next point is a side issue, please don't spend too much time on > it.] I find it particularly strange that at a time when UTF-8 is > firmly defined as up to 4 bytes, never including any bytes above > 0xF4, the Unicode consortium would want to consider recommending that > <FD 81 82 83 84 85> be converted to a single U+FFFD. I note with > agreement that Markus seems to have thoughts in the same direction, > because the proposal (17168-utf-8-recommend.pdf) says "(I suppose > that lead bytes above F4 could be somewhat debatable.)". The undesirable sidetrack, I suppose, is worrying about how many planes will be required for emoji. However, it does make for the point that, while some practices may be better than other, there isn't necessarily a best practice. The English of the proposal is unclear - the text would benefit from showing some maximal subsequences (poor terminology - some of us are used to non-contiguous subsequences). When he writes, "For UTF-8, recommend evaluating maximal subsequences based on the original structural definition of UTF-8, without ever restricting trail bytes to less than 80..BF", I am pretty sure he means "For UTF-8, recommend evaluating maximal subsequences based on the original structural definition of UTF-8, with the only restriction on trailing bytes beyond the number of them being that they must be in the range 80..BF". Thus Philippe's example of "E0 E0 C3 89" would be converted with an error flagged to a sequence of scalar values FFFD FFFD C9. This may make a UTF-8 system usable if it tries to use something like non-characters as understood before CLDR was caught publishing them as an essential part of text files. Richard.

