On 14 Mar 2017, at 02:03, Richard Wordingham <[email protected]> wrote: > > On Mon, 13 Mar 2017 19:18:00 +0000 > Alastair Houghton <[email protected]> wrote: > >> IMO, returning code points by index is a mistake. It over-emphasises >> the importance of the code point, which helps to continue the notion >> in some developers’ minds that code points are somehow “characters”. >> It also leads to people unnecessarily using UCS-4 as an internal >> representation, which seems to have very few advantages in practice >> over UTF-16. > > The problem is that UTF-16 based code can very easily overlook the > handling of surrogate pairs, and one very easily get confused over what > string lengths mean.
Yet the same problem exists for UCS-4; it could very easily overlook the handling of combining characters. As for string lengths, string lengths in code points are no more meaningful than string lengths in UTF-16 code units. They don’t tell you anything about the number of user-visible characters; or anything about the width the string will take up if rendered on the display (even in a fixed-width font); or anything about the number of glyphs that a given string might be transformed into by glyph mapping. The *only* think a string length of a Unicode string will tell you is the number of code units. Kind regards, Alastair. -- http://alastairs-place.net

