Carl W. Brown writes:
> If you implement an array that is directly indexed by Unicode code point it
> would have to have 1114111 entries.  (I love the number)  I don't think that
> many applications can afford to have over a megabyte of storage per byte of
> table width.  If nothing else it would be an array of addresses pointing to
> valid entries that would take about 4.5 MB.  Because the new plains are
> sparsely populated you can segment your table.  In this case you have no
> real advantage using UTF-32.

That wasn't my point: obviously one would not create a lookup table
using raw Unicode values.

But if I have a text string, and that string is encoded in UTF-16, and
I want to access Unicode character values, then I cannot index that
string in constant time.

To find character n I have to walk all of the 16-bit values in that
string accounting for surrogates. If I use UTF-32 I don't need to do
that. This very issue came up during the discussion of how to handle
surrogates in Python.

> I though that Basis Technology was developed using UCS-2.  Have you
> converted to full UTF-16 support or are you thinking of changing?

The current shipping version of Rosette uses UCS-2 internally. Current
development is focusing on UTF-16 and UTF-32 support.

    -tree

-- 
Tom Emerson                                          Basis Technology Corp.
Sr. Sinostringologist                              http://www.basistech.com
  "Beware the lollipop of mediocrity: lick it once and you suck forever"

Reply via email to