Re: Non-ascii string processing?

jon Mon, 06 Oct 2003 06:18:49 -0700

> > If you really aren't processing anything but the ASCII characters 
> > within
> > your strings, like "<" and ">" in your example,
> you can probably get
> > away with keeping your existing byte-oriented code.  At least you won't
> > get false matches on the ASCII characters (this was a primary design
> > goal of UTF-8).
> 
> Yes, and in fact, UTF8 doesn't generate any false matches when 
> searching for a valid UTF8 string, within another valid UTF8 string.


However it will generate false misses when searching for a valid UTF-8 string within 
an invalid UTF-8 string. In important cases this can lead to severe security issues, 
for example if you were doing the searching to filter disallowed sequences (say 
"<script" in an HTML filter or "../" in a URI filter) and the UTF-8 is later converted 
by a toleratant converter then the disallowed sequences can be sneaked past the filter 
by sending invalid UTF-8. Certainly there have been cases of this in the past (IIS for 
example could be fooled into accessing files outside of the webroot).
Hence your search function must either include a check for invalid UTF-8 or be used 
only in a situation where you know that this won't cause problems (either because 
invalid UTF-8 will raise an error elsewhere, or becuase there is no possible security 
problems from such data). In particular if it was part of a library that might be used 
elsewhere there could be problems as the user of the library might assume you are 
doing more checking than you are, and neglect to check him- or herself.

> Unfortunately, I'm more concerned about the speed of converting the 
> UTF8 to UTF32, and back. This is because usually, I can process my UTF8 
> with byte functions.

This is a "swings and roundabouts" situation. Granted dealing with a large array or 
transmitting a stream of 8-bit units will generally be faster than dealing with a 
similarly sized stream of 32-bit units (they will be similarly sized if they mainly 
have ASCII data - and even the worse-case scenario for UTF-8 won't be larger than the 
equivalent UTF-32 for valid Unicode characters). At the same time though dealing with 
a single 32-bit unit is generally faster than dealing with a single 8-bit unit on most 
modern machines; the 8-bit unit will generally be converted to and from 32-bit or 
larger units anyway - so if you have an average of 1.2 (say it's mainly from the 
ASCII) octets per character in UTF-8 you are really dealing with 1.2 times as many 
32-bit units as if you used UTF-32. If you are coming closer to an average of 4 octets 
per character in UTF-8 then you are qadrupling the number of 32-bit units to process, 
as well as possible conversion overhead.

The effects of this on processing efficiency is going to depend on just what you are 
doing with the characters, and what optimisations can be applied (whether by the 
programmer or the compiler). For some operations UTF-8 can be considerably less 
efficient than UTF-32.

It also depends on how much the properties you are dealing with are "hidden" by UTF-8. 
On the one hand the character-based strlen mentioned in this thread is easy to write 
for UTF-8:

size_t charlen(const char* str){//assumes valid UTF-8
        size_t ret = 0;
        while (*str)
                if ((*str++ & 0x80) = 0)
                        ++ret;
        return ret;

}

How this compares with the UTF-32 equivalent will vary. Note that it still has 
validity issues. Generally though UTF-8 doesn't have many problems with this. On the 
other hand while it is certainly possible to use UTF-8 to do the property lookup 
needed for most functionality that threats Unicode as more than just a bunch of 21-bit 
numbers encoded in various ways, it is easier and more efficient (often including 
memory size of the program) to do much of it with UTF-16 or UTF-32.

Re: Non-ascii string processing?

Reply via email to