From: "Markus Scherer" <[EMAIL PROTECTED]> > > Nothing forbids to code each n-ary node as a binary tree for faster > > searches. > > Or, more simply and obviously, do a binary search on the line array in your node if the node > contains more than just a few items.
A binary search in a vector is implicitly creating such binary tree. That's what is done in most B-tree implementations, where nodes of variable sizes are grouped in vectors filling pages of a constant size, and this is typically used in almost all SQL engines to create indexes. There are a lot of options to tune how nodes should be encoded, but a Trie can also be stored in a very fast accessed B-tree, and become quite compact this way. The algorithms needed to feed the B-tree and then compact it to its minimal size with a page fill factor of 100% exist in many RDBMS engines or database toolkits (you may look for example to the various sources that work on dBase(tm) or other VSAM sorted tables), so during feed of the dictionnary entries, you may just use the minimum fill factor of 50%. Another way to compact the dictionnary is to use a LZW-like string lookup dictionnary, except that you won't limit the size of its internal dictionnary. At least with this system, you don't need any prior choice of encoding. This solution can help to keep the Trie stored according to the language collation rules (the LZW lookup dictionnary is a separate heap whose internal ordering is unrelated, and which may be kept coded with UTF-16 code units). That's true that using SCSU requires that you use a stable compressor to use it in a Trie. I think it's a design issue of SCSU: there's no standard compressor in the standard to create a predictable compressed string, that can be compared. I think that SCSU would have benefited of having a "reference implementation" whose result can be compared with exact binary matching. In SCSU, you can't compare two compressed strings without first decompressing them to some UTF. But this does not mean that SCSU cannot be used in a Trie to store a lexical dictionnary. Without it, of course, you can use BOCU, but it will compress strings less. There's however another option: use a reencoder that will take the LZW lookup dictionnary initially built from some UTF like UTF-32, and that will apply a statistic Huffman or Arithmetic encoding of each coded character. Where the frequency of letters will determine their numeric value in the generated encoding. You get then a bijective encoding where all UTF-32 code units are replaced by their Huffman rank, the lowest rank getting encoded with less bits. Then the Trie will not store directly the characters, but the bit position in the LZW dictionnary heap or characters coded with variable lengths of bits. The correspondance between UTF-32 code units and Huffman/Arithmetic codes will be made by a simple vector of code units (these sorted by encoding bitsize and statistic rank.) Finally, the number of code units in that Huffman/Arithmetic correspondance vector determines the initial offset to apply to encode the compressed LZW indice of each substring stored in each Trie node. With this solution, each Trie node stores only one indice, the lowest values referencing code units in the Huffman/Arithmetic correspondance table (and the higher values coding substrings stored at the indicated bitposition in the LZW heap). This allows to apply a variable length encoding for these indices if needed to compress each node of the Trie, which will finally be mapped onto the B-Tree. The correspondance vector of unique UTF-32 code units can also be compacted to use less bits per code unit if needed (this does not affect the storage of the LZW heap, or of the Trie in the B-Tree).