On Tue, Dec 20, 2016 at 8:59 AM, Ken Whistler <[email protected]> wrote:
> You found the resulting text in TUS 9.0, p. 126 - 129. The origin of the > text there about best practices for using U+FFFD was the discussion and > resolution of PRI #121 in August, 2008: > > http://www.unicode.org/review/pr-121.html > Yes. However, some of the discussion in this thread is due to details that were not spelled out in the PRI. There is basically a 2a and a 2b, while the examples in PRI #121 work the same in both variants. 2a. As Richard said, "The natural logic is to read the requisite number of continuation bytes, converting the whole to a codepoint value, and then check that the codepoint value is allowed in UTF-8. Obviously one also has to check that the requisite continuation bytes are present." This naturally treats overlong sequences, surrogate-code-point sequences, and 5/6-byte sequences (and prefixes thereof) as single errors. (I suppose that lead byte above F4 could be somewhat debatable.) (This is what ICU does for UTF-8.) 2b. The text in the standard represents the workings of a state machine that walks strictly valid sequences. Overlong/surrogate/etc. sequences become multiple errors. (This is what ICU converters do for multi-byte charsets like Shift-JIS and GB 18030.) In my opinion, 2a. "feels right" for UTF-8, because of the history and mechanics of the encoding, and 2b. is a good fit for MBCS where concepts like overlong sequences don't exist. (And for GB 18030 you do have to walk a validity state machine, you can't just look at the lead byte.) markus

