UTF-16 conversion for library use

Ayers, Mike Mon, 24 Sep 2001 09:59:19 -0700

> From: Asmus Freytag [mailto:[EMAIL PROTECTED]] 
> Sent: Sunday, September 23, 2001 02:24 AM

> The typical situation involves cases where large data sets 
> are cached in 
> memory, for immediate access. Going to UTF-32 reduces the 
> cache effectively 
> by a factor of two, with no comparable increase in processing 
> efficiency to 
> balance out the extra cache misses. This is because each 
> cache miss is 
> orders of magnitude more expensive than a cache hit.

        For this situation you have a good point.  For others, however, the
extra data space of UTF-32 is bound to be lower cost than having to check
every character for special meaning (i.e. surrogate) before passing it on.

> For specialized data sets (heavy in ascii) keeping such a 
> cache in UTF-8 
> might conceivably reduce cache misses further to a point 
> where on the fly 
> conversion to UTF-16 could get amortized. However, such an 
> optimization is 
> not robust, unless the assumption is due to the nature of the 
> data (e.g. 
> HTML) as opposed to merely their source (US). In the latter 
> case, such an 
> architecture scales badly with change in market.

        Maybe, maybe not.  Latin characters are in heavy use wherever
computers are, at least for now.

> [The decision to use UTF-16, on the other hand, is much more robust, 
> because the code paths that deal with surrogate pairs will be 
> exercised 
> with low frequency, due to the deliberate concentration of nearly all 
> modern-use characters into the BMP (i.e. the first 64K).]

        Funny.  You see robustness, I see latent bugs due to rarely
exercised code paths.


/|/|ike
RE: UTF-8 <> UCS-2/UTF-16 conversion for library use

Reply via email to