On 06/10/2003 03:09, Marco Cimarosti wrote:

Doug Ewell wrote:


Depends on what "processing" you are talking about. Just to cite the
most obvious case, passing a non-ASCII, UTF-8 string to byte-oriented
strlen() will fail dramatically.



Why? The purpose of strlen() is counting the number of *bytes* needed to store a certain string, and this works just as fine for UTF-8 as it does for SBCS's or DBCS's.

What strlen() cannot do is countîng the number of *characters* in a string.
But who cares? I can imagine very few situations where someone such an
information would be useful.

_ Marco




This depends on what kind of operations you are wanting to do with the text. Of course if you are concerned only with storage and transmission of the text, you don't need to count characters rather than bytes, except that, as you mention in another posting, you may need to avoid splitting strings in the middle of characters (and there is actually a very simple algorithm to avoid that, never split before a byte 10xxxxxx). But if you want to render the text, the rendering system needs to split the text into characters at some point. And if you want to do to the text the kinds of processing which I as a linguist am interested in, you absolutely need to work with characters rather than bytes, and it can be very important to know the number of characters in a string - although this number may get confused by normalisation issues.

--
Peter Kirk
[EMAIL PROTECTED] (personal)
[EMAIL PROTECTED] (work)
http://www.qaya.org/





Reply via email to