heres some things I think.
If you really aren't processing anything but the ASCII characters within
your strings, like "<" and ">" in your example, you can probably get
away with keeping your existing byte-oriented code. At least you won't
get false matches on the ASCII characters (this was a primary design
goal of UTF-8).
Yes, and in fact, UTF8 doesn't generate any false matches when searching for a valid UTF8 string, within another valid UTF8 string.
In fact, if there is UTF8 between the < and >, the processing works just fine.
However, if your goal is to simplify processing of arbitrary UTF-8 text,
including non-ASCII characters, I haven't found a better way than to
read in the UTF-8, convert it on the fly to UTF-32, and THEN do your
processing on the fixed-width UTF-32. That way you don't have to do one
thing for Basic Latin characters and something else for the rest.
Well, I can do most processing just fine, as I said. I only have a problem with lexical string processing (A = �), or spell checking. And in fact, lexical string processing is already so complex, it probably won't make much difference with UTF32 or UTF8, because of conjoining characters and that.
You will probably hear from some very prominent Unicode people that
converting to UTF-16 is better, because "most" characters are in the
BMP, for which UTF-16 uses half as much memory. But this approach
doesn't really solve the variable-width problem -- it merely moves it,
from "ASCII vs. non-ASCII" to "BMP vs. non-BMP." Unless you are keeping
large amounts of text in memory, or are working with a small device such
as a handheld, the extra size of UTF-32 compared to UTF-16 is unlikely
to be a big problem, and you have the advantage of dealing with a
fixed-width representation for the entire Unicode code space.
Unfortunately, I'm more concerned about the speed of converting the UTF8 to UTF32, and back. This is because usually, I can process my UTF8 with byte functions.
All of this assumes that you don't have multi-character processing issues, like combining characters and normalization, or culturally appropriate sorting, in which case your character processing WILL be more complex than ASCII no matter which CES you use.
Yes. Actually, I haven't yet seen any reasons to not use byte-oriented-only functions for UTF8, now. Thanks for trying though!
Maybe someone whose native language isn't English and who spends a lot of time writing string processing code could help me with suggestions for tasks that need character modes? (like lexical processing a=�, and spell checking).

