Re: Surrogate space in Unicode

DougEwell2 Fri, 16 Feb 2001 08:45:56 -0800
In a message dated 2001-02-16 7:56:12 Pacific Standard Time, 
[EMAIL PROTECTED] writes:

>  It's clearer, but misses what I understand to be the absolutely crucial
>  distinction between a code point (correctly defined) and a code unit
>  (mentioned by Mark but not by Doug). For what a code unit is, see
>  http://www.unicode.org/unicode/reports/tr17

I didn't mention code units because, embarrassingly, I am still having a hard 
time telling the difference between code points and code units.  I have read 
UTR #17 many times and am still somewhat confused.  I'll try again.

>  I would question whether 'surrogate code points' are really code points. In
>  the sense that they are a subset of 'code points' as defined, I guess they
>  are; but they are not only unlike every other code point in that they "do
>  not directly represent characters", they are explicitly and inexorably
>  disqualified from so doing, being reserved for use, in pairs, as UTF-16 
code
>  units. (Which is what Mark said, of course.)

I think they would still be code points, just like 0xFFFE and 0xFFFF (and now 
others) which are guaranteed never to be characters, for a different reason.

>  Looked at in this way, surely it makes it clearer that the transcoding of a
>  surrogate (code point) into UTF-8 is an abomination.
>  
>  Simplification is all very well, but it can be taken too far, as when
>  important distinctions are lost.

Yes, that is true.  I might have known better than to respond to a "cut the 
mumbo-jumbo" post.  Einstein said, "Everything should be made as simple as 
possible, but not one bit simpler," and I think that is especially true when 
working with standards and specifications, where precise and unambiguous 
wording is crucial.

-Doug Ewell
 Fullerton, California
Re: Surrogate space in Unicode

Reply via email to