On Thu, 1 Jun 2017 19:19:51 -0700 Ken Whistler via Unicode <unicode@unicode.org> wrote:
> > and therefore should start a > > sequence of 6 characters. > > That is completely false, and has nothing to do with the current > definition of UTF-8. > > The current, normative definition of UTF-8, in the Unicode Standard, > and in ISO/IEC 10646:2014, and in RFC 3629 (which explicitly > "obsoletes and replaces RFC 2279") states clearly that 0xFC cannot > start a sequence of anything identifiable as UTF-8. TUS Section 3 is like the Augean Stables. It is a complete mess as a standards document, imputing mental states to computing processes. Table 3-7 for example, should be a consequence of a 'definition' that UTF-8 only represents Unicode Scalar values and excludes 'non-shortest forms'. Instead, the exclusion of the sequence <ED A0 80> is presented as a brute definition, rather than as a consequence of 0xD800 not being a Unicode scalar value. Likewise, 0xFC fails to be legal because it would define either a 'non-shortest form' or a value that is not a Unicode scalar value. The differences are a matter of presentation; the outcome as to what is permitted is the same. The difference lies rather in whether the rules are comprehensible. A comprehensible definition is more likely to be implemented correctly. Where the presentation makes a difference is in how malformed sequences are naturally handled. Richard.