RE: Non-ascii string processing?

jon Tue, 07 Oct 2003 03:38:11 -0700

> Now - a count of DEFAULT GRAPHEME CLUSTERs might be useful (for example, 
> for display on a console which uses fixed-width fonts). Indeed, a whole 
> class of DEFAULT GRAPHEME CLUSTER handling functions might come in very 
> handy indeed. Bytes are useful. Default grapheme clusters are useful. 
> But a &quot;character&quot;? What's the point?


Because characters are a useful intermeditary point between bytes and grapheme 
clusters. Such an intermeditary may be entirely wrapped by code stepping from octets 
to grapheme clusters, or exposed by an API which will be used by higher level code to 
produce the grapheme clusters since that is the lowest level an API could expose while 
remaining encoding neutral (hence that is the level at which XML APIs expose CDATA, 
element names, etc.). The alternative would be to have a straight mapping between 
octets and grapheme clusters...

> But then, a default grapheme cluster might theoretically require up to 
> 16 Unicode characters. (Maybe more, I don't know). Even bit-packed to 21 
> bits per character, that still gives us 336 bits. So I conclude that our 
> string processing functions could go a lot faster if only we'd all use 
> UTF-336. Er....?

Certainly if we allow for linguistic improbabilities (such as C with two graves, an 
acute and a couple of Hebrew vowel points) there would be no limit. Allowing for 
linguistic improbabilities has the advantage of making it more likely that we are 
allowing for linguistic edge-cases.

RE: Non-ascii string processing?

Reply via email to