Re: Nicest UTF

Philippe Verdy Fri, 03 Dec 2004 15:02:14 -0800

From: "Asmus Freytag" <[EMAIL PROTECTED]>

A simplistic model of the 'cost' for UTF-16 over UTF-32 would consider

1) 1 extra test per character (to see whether it's a surrogate)

2) special handling every 100 to 1000 characters (say 10 instructions)

3) additional cost of accessing 16-bit registers (per character)

4) reduction in cache misses (each the equivalent of many instructions)

5) reduction in disk access (each the equivaletn of many many

instructions)

(...)
For 4 and 5, the multiplier is somewhere in the 100s or 1000s, for each
occurrence depending on the architecture. Their relative weight depends
not only on cache sizes, but also on how many other instructions per
character are performed. For text scanning operations, their cost
does predominate with large data sets.

I tend to disagree with you on points 4 and 5: cache misses, and disk accesses (more commonly refered to as "data locality" in computing performances) really favors UTF-16 face to UTF-32, simply because UTF-16 will be more compact for almost every text you need to process, unless you are working on texts that only contain characters from a script *not present at all* in the BMP (this sentence excludes Han, even if there are tons of ideographs out of the BMP, because these ideographs are almost never used alone, but used seldomly within tons of other conventional Han characters in the BMP).

Given that these scripts are all historic ones, or were encoded for technical purpose with very specific usage, a very large majority of texts will not use significant numbers of characters out of the BMP, so the use of surrogates in UTF-16 will remain a minority. In all cases, even for texts made only of characters out of the BMP, UTF-16 can't be larger than UTF-32.

The only case where it would be worse than UTF-32 is for the internal representation of strings in memory, where 16-bit code units can't be represented with 16-bit only, for example if memory cells are not individually addressable below units of at least 32 bits, and the CPU architecture is very inefficient when working with 16-bit bitfields within 32-bit memory units or registers, due to extra shifts and masking operations needed to pack and unpack 16-bit bitfields into a single 32-bit memory cell.

I doubt that such architecture would be very successful, given that too many standard protocols depend on being able to work with datastreams made of 8-bit bytes: with such architecture, all data I/O would need to store 8-bit bytes in separate but addressable 32-bit memory cells, which would really be a poor usage of available central memory (such architecture would require much more RAM to work with equivalent performances for data I/O, and even the very costly fast RAM caches would need to be increased a lot, meaning higher hardware construction costs).

So even on such 32-bit only (or 64-bit only...) architectures (where for example the C datatype "char" would be 32-bit or 64-bit), there would be efficient instructions in the CPU to allow packing/unpacking bytes in 32-bit (or 64-bit) memory cells (or at least at the register level, with instructions allowing to work efficiently with such bitfields).

Re: Nicest UTF

Reply via email to