> > If you really aren't processing anything but the ASCII characters
> > within
> > your strings, like "<" and ">" in your example,
> you can probably get
> > away with keeping your existing byte-oriented code. At least you won't
> > get false matches on the ASCII characters (this was a primary design
> > goal of UTF-8).
>
> Yes, and in fact, UTF8 doesn't generate any false matches when
> searching for a valid UTF8 string, within another valid UTF8 string.
However it will generate false misses when searching for a valid UTF-8 string within
an invalid UTF-8 string. In important cases this can lead to severe security issues,
for example if you were doing the searching to filter disallowed sequences (say
"<script" in an HTML filter or "../" in a URI filter) and the UTF-8 is later converted
by a toleratant converter then the disallowed sequences can be sneaked past the filter
by sending invalid UTF-8. Certainly there have been cases of this in the past (IIS for
example could be fooled into accessing files outside of the webroot).
Hence your search function must either include a check for invalid UTF-8 or be used
only in a situation where you know that this won't cause problems (either because
invalid UTF-8 will raise an error elsewhere, or becuase there is no possible security
problems from such data). In particular if it was part of a library that might be used
elsewhere there could be problems as the user of the library might assume you are
doing more checking than you are, and neglect to check him- or herself.
> Unfortunately, I'm more concerned about the speed of converting the
> UTF8 to UTF32, and back. This is because usually, I can process my UTF8
> with byte functions.
This is a "swings and roundabouts" situation. Granted dealing with a large array or
transmitting a stream of 8-bit units will generally be faster than dealing with a
similarly sized stream of 32-bit units (they will be similarly sized if they mainly
have ASCII data - and even the worse-case scenario for UTF-8 won't be larger than the
equivalent UTF-32 for valid Unicode characters). At the same time though dealing with
a single 32-bit unit is generally faster than dealing with a single 8-bit unit on most
modern machines; the 8-bit unit will generally be converted to and from 32-bit or
larger units anyway - so if you have an average of 1.2 (say it's mainly from the
ASCII) octets per character in UTF-8 you are really dealing with 1.2 times as many
32-bit units as if you used UTF-32. If you are coming closer to an average of 4 octets
per character in UTF-8 then you are qadrupling the number of 32-bit units to process,
as well as possible conversion overhead.
The effects of this on processing efficiency is going to depend on just what you are
doing with the characters, and what optimisations can be applied (whether by the
programmer or the compiler). For some operations UTF-8 can be considerably less
efficient than UTF-32.
It also depends on how much the properties you are dealing with are "hidden" by UTF-8.
On the one hand the character-based strlen mentioned in this thread is easy to write
for UTF-8:
size_t charlen(const char* str){//assumes valid UTF-8
size_t ret = 0;
while (*str)
if ((*str++ & 0x80) = 0)
++ret;
return ret;
}
How this compares with the UTF-32 equivalent will vary. Note that it still has
validity issues. Generally though UTF-8 doesn't have many problems with this. On the
other hand while it is certainly possible to use UTF-8 to do the property lookup
needed for most functionality that threats Unicode as more than just a bunch of 21-bit
numbers encoded in various ways, it is easier and more efficient (often including
memory size of the program) to do much of it with UTF-16 or UTF-32.