Re: Non-ascii string processing?

jon Mon, 06 Oct 2003 07:57:29 -0700

> > a word like "�lite" is always counted as five characters,
> regardless
> > that it might be encoded as six Unicode "characters".
> 
> I assume that everybody on this list knows that you count characters
> only after a proper normalization... (like many operations on Unicode
> texts).


A word like "�lite" will be counted as either five or size things depending on just 
what the things are in a given context. Whether you call those things "characters" or 
not is another matter.

Normalisation might result in that string being five or six Unicode characters in 
length, depending on the normalisation form used. Even while NFC would mean that 
characters and grapheme-clusters would coincide in this case, that does not apply to 
all uses of combining characters, so a character count on NFC Unicode is not a 
reliable means to give a character count.

However a byte count is probably of even less use to an end user anyway (except in so 
far as diskspace and download times go, and then a rough estimate would serve their 
purposes). Both byte counts and Unicode-character counts have uses within the 
implementation of higher-level functionality, and as such both are required.

> &gt; 3) That is a very silly count anyway. If you want to have an idea of
> the
> &gt; &quot;size&quot; of a document, lines or words are much more useful
> units.

To estimate column-inches that will be used characters are much more useful than 
words, and far more than lines (which will vary according to column-width, font, 
justification algorithm, etc.)

Re: Non-ascii string processing?

Reply via email to