On 18 Sep 2013, at 04:57, Stephan Stiller <[email protected]> wrote:

> In what way does UTF-16 "use" surrogate code points? An encoding form is a 
> mapping. Let's look at this mapping:
>       • One inputs scalar values (not surrogate code points).
>       • The encoding form will output a short sequence of encoding 
> form–specific code units. (Various voices on this list have stated that these 
> should never be called code points.)
>       • The algorithm mapping from input to output doesn't make use of 
> surrogate code points. (Even though the Glossary states, under "Surrogate 
> Code Point", that they are "for use by UTF-16".) The only "use" is indirect, 
> through awareness of the positioning and size of the range of non-code-point 
> scalar values.

This is in fact a mistake in the construction of UTF-16 that you observe. As 
you mention, the correct way is to define character numbers, plus a way to 
translate into binary format. This is how the original UTF-8 worked, the UNIX 
version. The current construction is legacy, so there is not much to do about 
it. Use UTF-8 or UTF-32 if you can.

Hans




Reply via email to