> As long as you are sure that this will not leak out into the outside > world, you are free to use the UTF-8 mechanism internally to represent > any type of 31-bit data you like, including this private replacement for > allkeys.txt. (You do know about allkeys.txt, don't you? And the fact > that UCA is heavily customizable?)
Yes I know allkeys.txt, and the fact that UCA is highly customizable. This is still too much complex to handle a lot of languages consistently, and I prefer having rules that define a hierarchy tree of languages for sorting or collating, so that a single reset of a language root will move all its collation keys along with related characters that are normally logically collated with them, even if they are not used in typical orthograph of that language. Also UCA still does not order very precisely all the characters in the [variable] section: this is a mix of characters mostly sorted by script type and then by code points, but many of them can be rearranged with related characters. > It would seem to make sense primarily for retaining ASCII compatibility > and representing smaller values in fewer bytes than larger values, so > you would want to be sure these are your design goals too. Unfortunately, this is IMPOSSIBLE! I need code positions between successive ASCII positions. All I can do is to preserve 1 byte for the ASCII character in the encoding scheme for the code position, but other bytes will be prepended and appended. Due to this constraint, any ASCII character will really be represented by at least 3 bytes, and this is not intended to be used for interchange of text, just for internal representation during processing, for lookup tables or to extract some binary coded character properties (I have more properties than those listed in Unicode, simply because I have inserted properties needed for UCA and tailored collation). > But things like this do have a tendency to leak into the outside world, > and if this ever happens with your collation keys, you will have > unleashed something like CESU-8 that fails the "duck test": it walks and > talks like UTF-8, but it's not. Be sure this won't leak out. Simply because this internal encoding is strictly for internal processing as an intermediate step. It is not efficient enough to make it a true encoding, simply because it uses 1 code per function, instead of packing several functions into bitfields. As I have not determined the correct size of these bitfields, I need some intermediate solution to pack them a little, and the UTF-8 TES (not the UTF-8 CES used by Unicode)venient for now, until I change it to a better encoding, which may or may not leak out (I am not sure that I need to make the encoding accessible from an interface, except for debugging). After all, the intermediate tables computed by the ICU builder are completely internal, and their format is not guaranteed to be supported elsewhere: these tables use their own encoding and convention, and are strictly bound strictly with the internal implementation of the ICU runtime. That's the same thing for me. __________________________________________________________________ << ella for Spam Control >> has removed Spam messages and set aside Newsletters for me You can use it too - and it's FREE! http://www.ellaforspam.com
<<attachment: winmail.dat>>