On Wed, 14 Oct 2015 00:28:26 +0100 Daniel Bünzli <[email protected]> wrote:
> If UTF-32 feels wasteful there are various smart ways of providing > direct indexing at a reasonable cost if you are in a language that > has minimal support for datatype definition and abstraction. I can't find a good one that's been published. The Elias-Fano encoding for UTF-8 indexing works out at 3 to 5 bits per character even without extending to achieve 'constant time' access, the limiting extremes being English and Ugaritic. (Most SMP scripts use a lot of ASCII.) For genuine UTF-8 text I can happily get the memory requirement down to 1.031 bits per character. I exploit the fact that one can easily advance character by character through a UTF-8 string, but limit myself to 5 advances. The 0.031 part of the factor comes in for strings longer than a thousand characters, and could be reduced to 0.002 with some extra processing. There's a lot of redundancy in the positions. > Note that the Swift programming language seems to have gone even > further than I would have: their notion of character is a grapheme > cluster tested for equality using canonical equivalence and that's > what they index in their strings, see [1]. Don't know how well that > works in practice as I personally never used it; but it feels like > the ultimate Unicode string model you want to provide to the > zero-knowledge Unicode programmer (at least for alphabetic scripts). It doesn't quite work. For Thai at least, deleting backwards should delete just a combining mark rather than the whole grapheme cluster. I couldn't find any provision for this in Swift. There is also the question (irrelevant for Thai) of whether this deletion should be done in NFC or NFD. Deleting backwards deleting only a combining mark also makes sense for the International Phonetic Alphabet, as well as for the Thai script used alphabetically (as often done for Pali) and for the Lao script - the modern Lao writing system is formally an alphabet. Richard.

