Re: Running out of code points, redux (was: Re: Feedback on the proposal...)

Ken Whistler via Unicode Thu, 01 Jun 2017 17:16:44 -0700


On 6/1/2017 2:39 PM, Richard Wordingham via Unicode wrote:

You were implicitly invited to argue that there was no need to handle
5 and 6 byte invalid sequences.


Well, working from the *current* specification:

FC 80 80 80 80 80
and
FF FF FF FF FF FF

are equal trash, uninterpretable as *anything* in UTF-8.

By definition D39b, either sequence of bytes, if encountered by anconformant UTF-8 conversion process, would be interpreted as a sequenceof 6 maximal subparts of an ill-formed subsequence. Whatever yourparticular strategy for conversion fallbacks for uninterpretablesequences, it ought to treat either one of those trash sequences thesame, in my book.

I don't see a good reason to build in special logic to treat FC 80 80 8080 80 as somehow privileged as a unit for conversion fallback, simplybecause *if* UTF-8 were defined as the Unix gods intended (which itain't no longer) then that sequence *could* be interpreted as anout-of-bounds scalar value (which it ain't) on spec that the codespace*might* be extended past 10FFFF at some indefinite time in the future(which it won't).


--Ken

Re: Running out of code points, redux (was: Re: Feedback on the proposal...)

Reply via email to