Asmus Freytag wrote:
A simplistic model of the 'cost' for UTF-16 over UTF-32 would consider

1) 1 extra test per character (to see whether it's a surrogate)

In my experience with tuning a fair amount of utf-16 software, this test takes pretty close to zero time. All modern processors have branch and pipeline trickery that fairly effectively disappears the cost of a predictable branch within a tight loop. Occurrences of supplementary characters should generally be rare enough that the extra time to process them when they are encountered is not statistically significant.



2) special handling every 100 to 1000 characters (say 10 instructions)

3) additional cost of accessing 16-bit registers (per character)

4) reduction in cache misses (each the equivalent of many instructions)

This is a big deal. The costs in plowing through lots of text data with relatively simple processing appear to be heavily related to the required memory bandwidth. Assuming reasonably carefully written code, that is.



5) reduction in disk access (each the equivaletn of many many instructions)

For many operations, e.g. string length, both 1, and 2 are no-ops,
so you need to apply a reduction factor based on the mix of operations
you do perform, say 50%-75%.

For many processors, item 3 is not an issue.

For 4 and 5, the multiplier is somewhere in the 100s or 1000s, for each
occurrence depending on the architecture. Their relative weight depends
not only on cache sizes, but also on how many other instructions per
character are performed. For text scanning operations, their cost
does predominate with large data sets.



--
     Andy Heninger
     [EMAIL PROTECTED]




Reply via email to