On 9/19/2013 6:32 AM, Hans Aberg wrote:
On 18 Sep 2013, at 04:57, Stephan Stiller <[email protected]> wrote:

In what way does UTF-16 "use" surrogate code points? An encoding form is a 
mapping. Let's look at this mapping:
        • One inputs scalar values (not surrogate code points).
        • The encoding form will output a short sequence of encoding 
form–specific code units. (Various voices on this list have stated that these 
should never be called code points.)
        • The algorithm mapping from input to output doesn't make use of surrogate code points. (Even though 
the Glossary states, under "Surrogate Code Point", that they are "for use by UTF-16".) 
The only "use" is indirect, through awareness of the positioning and size of the range of 
non-code-point scalar values.
This is in fact a mistake in the construction of UTF-16 that you observe. As 
you mention, the correct way is to define character numbers, plus a way to 
translate into binary format. This is how the original UTF-8 worked, the UNIX 
version. The current construction is legacy, so there is not much to do about 
it. Use UTF-8 or UTF-32 if you can.


The legacy difference was the existence of UCS-2 in parallel with UTF-16.

A./

Reply via email to