Re: Non-ascii string processing?

Peter Kirk Mon, 06 Oct 2003 06:10:19 -0700

On 06/10/2003 03:09, Marco Cimarosti wrote:

Doug Ewell wrote:

Depends on what "processing" you are talking about. Just to cite the most obvious case, passing a non-ASCII, UTF-8 string to byte-oriented strlen() will fail dramatically.
Why? The purpose of strlen() is counting the number of *bytes* needed to
store a certain string, and this works just as fine for UTF-8 as it does for
SBCS's or DBCS's.
What strlen() cannot do is countîng the number of *characters* in a string.
But who cares? I can imagine very few situations where someone such an
information would be useful.
_ Marco

This depends on what kind of operations you are wanting to do with the text. Of course if you are concerned only with storage and transmission of the text, you don't need to count characters rather than bytes, except that, as you mention in another posting, you may need to avoid splitting strings in the middle of characters (and there is actually a very simple algorithm to avoid that, never split before a byte 10xxxxxx). But if you want to render the text, the rendering system needs to split the text into characters at some point. And if you want to do to the text the kinds of processing which I as a linguist am interested in, you absolutely need to work with characters rather than bytes, and it can be very important to know the number of characters in a string - although this number may get confused by normalisation issues.

--
Peter Kirk
[EMAIL PROTECTED] (personal)
[EMAIL PROTECTED] (work)
http://www.qaya.org/

Re: Non-ascii string processing?

Reply via email to