Re: UTF-8 ill-formed question

Otto Stolz Sun, 16 Dec 2012 04:25:40 -0800

Hello,

am 2012-12-15 schrieb Philippe Verdy:

But there's still a bug (or request for enhancement) for your Pocket
converters :


- For UTF-16 you correctly exclude the range U+D800..U+DFFF (surrogates)
from the sets of convertible codepoints.

- But you don't exclude this range in the case of your UTF-8 and UTF-32
"magic encoders" which could forget this case. Of course your encoder would
create distinct sequences for these code points, but they are not valid
UTF-8 or valid UTF-32 encodings.


Only the UTF-16 variant is really *my* “magic pocket encoder” (MPE);
the author is nominated on every one of the three.

I would not demand more from those MPEs than converting
a valid UCS character to a valid, and equivalen, UTF
sequence – and to illustrate the underlying algorithm.
I guess, originally, they were meant as jokes – partially,
at least; I have used them as a didactic device, in my
beginner's lecture in Unicode.

Clearly, Mike Ayers made the point that the UTF-32 encoding
is nothing but a simple shortcut (in the terms of its two
predecessors). His one-row-only MPE expresses this quite
aptly, and any additional branch would spoil the impression.

The reason I excluded the surrogates from my UTF-8 MPE
was really that I needed additional space for the user’s
guide on the reverse side.

Cheers,
  Otto Stolz

Re: UTF-8 ill-formed question

Reply via email to