Re: How does Python Unicode treat surrogates?

Rick McGowan Mon, 25 Jun 2001 15:28:13 -0700

Marc-Andre Lemburg wrote:

> Do you have references which we could look at
> to determine which of these boundary kinds would actually be
> useful in daily programming ?

There are two things utterly useful in daily programming...  One is to get  
a "character", whether it's a surrogate or not; another is to get a base  
character and all associated combining marks.

It's useful to find the range covered by a "character" at some given  
index.  That allows the programmer to easily write an increment loop:

        while (index i is valid) {
            c = next_char_at_Index [i] of string s;
            i += lengthOfChar_at_Index [i] of string s;
            // do something with c...
        }

or similar...

Also, in a similar vein, finding the "range" covered by the combining  
character sequence or "locale-independent grapheme" at the given index.

Please also see the FAQ pages on combining marks and the Tech Report #18  
of Unicode, section 3.3:

http://www.unicode.org/unicode/reports/tr18/#Locale-Independent Graphemes

There is now some work going on with regard to more precise definition of  
such useful chunking units.

I would also take a look at the specifications for NSString and NSText in  
Apple's Cocoa environment.  Python has a some of these operations already  
built-in of course.

        Rick

Re: How does Python Unicode treat surrogates?

Reply via email to