Re: Non-ascii string processing?

Theodore H. Smith Sun, 05 Oct 2003 16:05:47 -0700

Hi Doug,

heres some things I think.

If you really aren't processing anything but the ASCII characters within your strings, like "<" and ">" in your example, you can probably get away with keeping your existing byte-oriented code. At least you won't get false matches on the ASCII characters (this was a primary design goal of UTF-8).

Yes, and in fact, UTF8 doesn't generate any false matches when searching for a valid UTF8 string, within another valid UTF8 string.

In fact, if there is UTF8 between the < and >, the processing works just fine.

However, if your goal is to simplify processing of arbitrary UTF-8 text, including non-ASCII characters, I haven't found a better way than to read in the UTF-8, convert it on the fly to UTF-32, and THEN do your processing on the fixed-width UTF-32. That way you don't have to do one thing for Basic Latin characters and something else for the rest.

Well, I can do most processing just fine, as I said. I only have a problem with lexical string processing (A = �), or spell checking. And in fact, lexical string processing is already so complex, it probably won't make much difference with UTF32 or UTF8, because of conjoining characters and that.

You will probably hear from some very prominent Unicode people that converting to UTF-16 is better, because "most" characters are in the BMP, for which UTF-16 uses half as much memory. But this approach doesn't really solve the variable-width problem -- it merely moves it, from "ASCII vs. non-ASCII" to "BMP vs. non-BMP." Unless you are keeping large amounts of text in memory, or are working with a small device such as a handheld, the extra size of UTF-32 compared to UTF-16 is unlikely to be a big problem, and you have the advantage of dealing with a fixed-width representation for the entire Unicode code space.

Unfortunately, I'm more concerned about the speed of converting the UTF8 to UTF32, and back. This is because usually, I can process my UTF8 with byte functions.

All of this assumes that you don't have multi-character processing
issues, like combining characters and normalization, or culturally
appropriate sorting, in which case your character processing WILL be
more complex than ASCII no matter which CES you use.

Yes. Actually, I haven't yet seen any reasons to not use byte-oriented-only functions for UTF8, now. Thanks for trying though!

Maybe someone whose native language isn't English and who spends a lot of time writing string processing code could help me with suggestions for tasks that need character modes? (like lexical processing a=�, and spell checking).

Re: Non-ascii string processing?

Reply via email to