On 6/1/2017 2:39 PM, Richard Wordingham via Unicode wrote:
You were implicitly invited to argue that there was no need to handle
5 and 6 byte invalid sequences.
Well, working from the *current* specification:
FC 80 80 80 80 80
and
FF FF FF FF FF FF
are equal trash, uninterpretable as *anything* in UTF-8.
By definition D39b, either sequence of bytes, if encountered by an
conformant UTF-8 conversion process, would be interpreted as a sequence
of 6 maximal subparts of an ill-formed subsequence. Whatever your
particular strategy for conversion fallbacks for uninterpretable
sequences, it ought to treat either one of those trash sequences the
same, in my book.
I don't see a good reason to build in special logic to treat FC 80 80 80
80 80 as somehow privileged as a unit for conversion fallback, simply
because *if* UTF-8 were defined as the Unix gods intended (which it
ain't no longer) then that sequence *could* be interpreted as an
out-of-bounds scalar value (which it ain't) on spec that the codespace
*might* be extended past 10FFFF at some indefinite time in the future
(which it won't).
--Ken