On Tue, 13 Oct 2015 16:09:16 +0100 Daniel Bünzli <[email protected]> wrote (under topic heading 'Counting Codepoints')
> I don't understand why people still insist on programming with > Unicode at the encoding level rather than at the scalar value level. > Deal with encoding errors and sanitize your inputs at the IO boundary > of your program and then simply work with scalar values internally. If you are referring to indexing, I suspect the issue is performance. UTF-32 feels wasteful, and if the underlying character text is UTF-8 or UTF-16 we need an auxiliary array to convert character number to byte offset if we are to have O(1) time for access. This auxiliary array can be compressed chunk by chunk, but the larger the chunk, the greater the maximum access time. The way it could work is a bit strange, because this auxiliary array is redundant. For example, you could use it to record the location of every 4th or every 5th codepoint so as to store UTF-8 offset variation in 4 bits, or every 15th codepoint for UTF-16. Access could proceed by looking up the index for the relevant chunk, then adding up nibbles to find the relevant recorded location within the chunk, and then use the basic character storage itself to finally reach the intermediate points. (I doubt this is an original idea, but I couldn't find it expressed anywhere. It probably performs horribly for short strings.) Perhaps you are merely suggesting that people work with a character iterator, or in C refrain from doing integer arithmetic on pointers into strings. Richard.

