Re: Why Work at Encoding Level?

Daniel Bünzli Tue, 13 Oct 2015 16:34:08 -0700

Le mardi, 13 octobre 2015 à 23:37, Richard Wordingham a écrit :
> If you are referring to indexing, I suspect the issue is performance.
> UTF-32 feels wasteful, and if the underlying character text is UTF-8 or
> UTF-16 we need an auxiliary array to convert character number to byte
> offset if we are to have O(1) time for access.


If UTF-32 feels wasteful there are various smart ways of providing direct 
indexing at a reasonable cost if you are in a language that has minimal support 
for datatype definition and abstraction.  

Also I personally find indexing to be rarely useful in string processing, so it 
may not be the operation you want to optimize for. Having iterators-like 
functions as you suggest and a datatype to represent substrings seems often a 
better fit than doing indexing arithmetic.

Note that the Swift programming language seems to have gone even further than I 
would have: their notion of character is a grapheme cluster tested for equality 
using canonical equivalence and that's what they index in their strings, see 
[1]. Don't know how well that works in practice as I personally never used it; 
but it feels like the ultimate Unicode string model you want to provide to the 
zero-knowledge Unicode programmer (at least for alphabetic scripts).

Best,

Daniel

[1] 
https://developer.apple.com/library/ios/documentation/Swift/Conceptual/Swift_Programming_Language/StringsAndCharacters.html

Re: Why Work at Encoding Level?

Reply via email to