Philippe Verdy wrote:
Unicode is not an issue in this approach, but most of the referenced
algorithms where based on 8-bit bytes, so for them UTF-8 is a good choice,

Either UTF-8 or UTF-16 would probably work, depending on the rest of the application. UTF-16 may be better because it fits with many Unicode libraries and APIs that work with 16-bit Unicode strings.


but SCSU-8 could be used as well, provided that it does not break too much
the redundancy of substrings for efficient detection and storage of
non-prefix common substrings.

There is no such thing as SCSU-8.


SCSU is probably not an option because it is not a deterministic encoding. A small change in the encoder will result in a very different dictionary structure, and/or a failure to find a word. In order to use SCSU, you would have to write, control and stabilize your own encoder.

BOCU-1 would work if you always start looking up words from their beginning, not subwords from the middle of a word.

UTF-16 is probably best.

An issue with SCSU-8 or even UTF-8 is that it still allows performing
compares within the search/browse algorithm on something not better than
just the internal binary ordering of encoded sequences (which is really
distinct from the normal collation order you'd find in a classic
dictionnary).

If you need linguistically correct comparisons according to some sort order, then you could use the collation elements, or parts of them, for a word instead of the characters. http://oss.software.ibm.com/icu/userguide/Collate_ServiceArchitecture.html#Collation_Element_Iterator


markus




Reply via email to