From: "Mark Davis" <[EMAIL PROTECTED]> > We tend to use tries, which have very good performance characteristics. See > "bits of unicode" on my site: www.macchiato.com.
You'll find many references of books, and papers available in PDF format on: http://citeseer.nj.nec.com/147026.html Various aspects are studied in the list of references, including performance and data compression methods. Unicode is not an issue in this approach, but most of the referenced algorithms where based on 8-bit bytes, so for them UTF-8 is a good choice, but SCSU-8 could be used as well, provided that it does not break too much the redundancy of substrings for efficient detection and storage of non-prefix common substrings. An issue with SCSU-8 or even UTF-8 is that it still allows performing compares within the search/browse algorithm on something not better than just the internal binary ordering of encoded sequences (which is really distinct from the normal collation order you'd find in a classic dictionnary). So yes, the choice of an encoding appropriate to the language you're collating in the dictionnary is relevant and important prior to map the encoded strings in any tree structure.