Re: Text Editors and Canonical Equivalence (was Coloured diacriti cs)

Doug Ewell Tue, 09 Dec 2003 23:52:11 -0800

Peter Kirk <peterkirk at qaya dot org> wrote:

>> The "wcslen" has nothing whatsoever to do with the Unicode standard,
>> but it has all to do with the *C* standard. And, according to the C
>> standard, "wcslen" must simply count the number "wchar_t" array
>> elements from the location pointed to by its argument up to the first
>> "wchar_t" element whose value is L'\0'. Full stop.
>
> OK, as a C function handling wchar_t arrays it is not expected to
> conform to Unicode. But if it is presented as a function available to
> users for handling Unicode text, for determining how many characters
> (as defined by Unicode) are in a string, it should conform to Unicode,
> including C9.


wcslen() is very definitely presented as a function for counting
_code_units_.  You can't even rely on it to count Unicode characters
accurately, if a wchar_t is 16 bits long, because supplementary
characters will require 2 code points (high + low surrogate).

Programmers rely on primitive functions like wcslen() to do what they do
very rapidly, and not to change their meaning in new versions of the
language standard.  It would be very handy to have a suite of C
functions that normalize their input string to any of NFK*[CD], or to
compare strings or measure their length taking normalization into
account, but those would have to be all-new functions.

-Doug Ewell
 Fullerton, California
 http://users.adelphia.net/~dewell/

Re: Text Editors and Canonical Equivalence (was Coloured diacriti cs)

Reply via email to