Re: unidata is big

Theo Veenker Wed, 24 Apr 2002 02:04:08 -0700

andreas palsson wrote:
> 
> Hi.
> 
> I would just like to know if someone could give me a tip on how to
> structure all the unicode-information in memory?
> 
> All the UNIDATA does contain quite a bit of information and I can't see
> any obvious method of which is memory-efficient and gives fast access.


You might want to evaluate some of the open source libraries 
mentioned under "Enabled Products" on the unicode site. For my
own lib (http://www.let.uu.nl/~Theo.Veenker/personal/projects/ucp/)
I've created a seperate table builder tool for each property or 
mapping. The tools organize data in planes, and for each plane
all possible trie setups are determined (about 80 combinations
of one, two or three stage tables). Then the cheapest setup
is used. This still requires over 230kb to store all data 
(except character names and comments) from the following files:
UnicodeData.txt, EastAsianWidth.txt, LineBreak.txt, ArabicShaping.txt,
Scripts.txt, Blocks.txt, SpecialCasing.txt, CaseFolding.txt,
BidiMirroring.txt, PropList.txt, DerivedCoreProperties.txt,
DerivedNormalizationProperties.txt, and DerivedJoiningType.txt.
For some mappings I've stored 32 bit code points where 16 bit
would have been enough, but I decided API uniformness is more
important than memory efficiency. 

I wouldn't bother too much about memory efficiency; it's irrelevant
these days. Even your mobile phone has enough memory to store all 
unicode data 10..20 times. Same thing for lookup speed. All you have
to do to get it fast is to wait (a few seasons).

Theo

Re: unidata is big

Reply via email to