"Theodore H. Smith" <[EMAIL PROTECTED]> writes: >>> It's because code points have variable lengths in bytes, so >>> extracting individual characters is almost meaningless > > Same with UTF-16 and UTF-32. A character is multiple code-points, > remember? (decomposed chars?)
> Nope. I've done tons of UTF-8 string processing. I've even done a case > insensitive word-frequency measuring algorithm on UTF-8. It runs > blastingly fast, because I can do the processing with bytes. Ah, so first you say that "a character" mean "a base code point plus a number of combining code points", and then you admit that your program actually process strings in terms of even lower level units: bytes of UTF-8 encoding? Why don't you treat a string as a sequence of "base code point with combining code points" items? Answer: because often this grouping is irrelevant, like in your example of word statistics. Code point grouping is more important: Unicode algorithms are typically described in terms of code points. > It just requires you to understand the actual logic of UTF-8 well > enough to know that you can treat it as bytes, most of the time. When I implemented the word boundary algorithm from Unicode, I was glad that I could do it in terms of UTF-32 and ISO-8859-1 instead of UTF-8, even though I do understand the logic of UTF-8. > As for isspace... sure there is a UTF-8 non-byte space. I don't understand. If a string is exposed as a sequence of UTF-8 units, it makes no sense to ask whether a particular unit isspace. And it makes no sense to ask this about a whole string either. It would have to be a function which works in terms of some iterator over strings. Well, some things do work in terms of positions inside strings, for example word boundaries. But people are used to think about isspace as a property of a *character*, whatever the language exactly means under this concept. My language means a Unicode code point, for conceptual simplicity of the concept of a string as seen by the language. > My case insensitive utf-8 word frequency counter (which runs > blastingly fast) however didn't find this to be any problem. It > dealt with non-single byte all sorts of word breaks :o) > > It appears to run at about 3MB/second on my laptop, which involves > for every word, doing a word check on the entire previous collection > of words. I happen to have written a case insensitive word frequency counter as an example in my language, to test some Unicode algorithms. It uses the word boundary algorithm to specify words; a segment between boundaries must include a character of class L* or N* in order to be counted as a word. It maintains subcounts of case-sensitive forms of a case-insensitive word (implemented as a hash table of hash tables of integers). It converts input using iconv(), i.e. from an arbitrary locale encoding supported by the system. It was not written with speed in mind. It has 24 lines, 10 of which are formatting the output (statistics about 20 most common words). http://cvs.sourceforge.net/viewcvs.py/kokogut/kokogut/tests/WordStat.ko?view=markup It's written in a dynamically typed language, with dynamic dispatches and higher order functions everywhere, where all values except small integers are pointers, with immutable strings. Each line separately is divided into words; a subsequence of spaces is materialized as a string object before the program checks that there are no letters nor numbers in it and thus it's not a word. It processed 4.8MB in 3.2s on my machine (Athlon 2000, 1.25GHz), which I think is good enough under these conditions. This input happens to be ASCII (a mailbox) but the program didn't know beforehand that it's ASCII. -- __("< Marcin Kowalczyk \__/ [EMAIL PROTECTED] ^^ http://qrnik.knm.org.pl/~qrczak/

