Re: Ternary search trees for Unicode dictionaries

Philippe Verdy Tue, 18 Nov 2003 05:27:26 -0800

From: "Mark Davis" <[EMAIL PROTECTED]>
> We tend to use tries, which have very good performance characteristics.
See
> "bits of unicode" on my site: www.macchiato.com.


You'll find many references of books, and papers available in PDF format on:
http://citeseer.nj.nec.com/147026.html

Various aspects are studied in the list of references, including performance
and data compression methods.

Unicode is not an issue in this approach, but most of the referenced
algorithms where based on 8-bit bytes, so for them UTF-8 is a good choice,
but SCSU-8 could be used as well, provided that it does not break too much
the redundancy of substrings for efficient detection and storage of
non-prefix common substrings.

An issue with SCSU-8 or even UTF-8 is that it still allows performing
compares within the search/browse algorithm on something not better than
just the internal binary ordering of encoded sequences (which is really
distinct from the normal collation order you'd find in a classic
dictionnary).

So yes, the choice of an encoding appropriate to the language you're
collating in the dictionnary is relevant and important prior to map the
encoded strings in any tree structure.

Re: Ternary search trees for Unicode dictionaries

Reply via email to