On 18 Sep 2013, at 04:57, Stephan Stiller <[email protected]> wrote:
> In what way does UTF-16 "use" surrogate code points? An encoding form is a > mapping. Let's look at this mapping: > • One inputs scalar values (not surrogate code points). > • The encoding form will output a short sequence of encoding > form–specific code units. (Various voices on this list have stated that these > should never be called code points.) > • The algorithm mapping from input to output doesn't make use of > surrogate code points. (Even though the Glossary states, under "Surrogate > Code Point", that they are "for use by UTF-16".) The only "use" is indirect, > through awareness of the positioning and size of the range of non-code-point > scalar values. This is in fact a mistake in the construction of UTF-16 that you observe. As you mention, the correct way is to define character numbers, plus a way to translate into binary format. This is how the original UTF-8 worked, the UNIX version. The current construction is legacy, so there is not much to do about it. Use UTF-8 or UTF-32 if you can. Hans

