Re: "A Programmer's Introduction to Unicode"

Alastair Houghton Tue, 14 Mar 2017 01:56:07 -0700

On 14 Mar 2017, at 02:03, Richard Wordingham <[email protected]> 
wrote:
> 
> On Mon, 13 Mar 2017 19:18:00 +0000
> Alastair Houghton <[email protected]> wrote:
> 
>> IMO, returning code points by index is a mistake.  It over-emphasises
>> the importance of the code point, which helps to continue the notion
>> in some developers’ minds that code points are somehow “characters”.
>> It also leads to people unnecessarily using UCS-4 as an internal
>> representation, which seems to have very few advantages in practice
>> over UTF-16.
> 
> The problem is that UTF-16 based code can very easily overlook the
> handling of surrogate pairs, and one very easily get confused over what
> string lengths mean.


Yet the same problem exists for UCS-4; it could very easily overlook the 
handling of combining characters.  As for string lengths, string lengths in 
code points are no more meaningful than string lengths in UTF-16 code units.  
They don’t tell you anything about the number of user-visible characters; or 
anything about the width the string will take up if rendered on the display 
(even in a fixed-width font); or anything about the number of glyphs that a 
given string might be transformed into by glyph mapping.  The *only* think a 
string length of a Unicode string will tell you is the number of code units.

Kind regards,

Alastair.

--
http://alastairs-place.net

Re: "A Programmer's Introduction to Unicode"

Reply via email to