> Resending (from Gmail), because the Unicode list rejected the SMTP server of > my mail provider (Spamcop is defective).
Nothing forbifs you to create new serializations of Unicode; you may even create it so that it will be a conforming process (meaning that it will preserve *all* valid Unicode code points, without even affecting their relative order, except possibly their normalization form as long as you preserve the canonical equivalences). If creating a Unicode conforming process is a goal, this means that your serialization will have to also preserve all PUAs and all valid codepoints of the BMP, with the only exception of isolated/unordered surrogates. Once this condition is met, nothing forbids you to reuse some binary encoding space, in your serialization scheme, including the space used by PUAs and surrogates (after all, the UTF-16 encoding scheme already does that). According to your description, your scheme will be using 8-bit code units, so you don't have to worry about the byte order when serializing code units into bytes. But one goal will probably be not met by your encoding : there will be no separation of byte values and positions delimiting each encoded sequence. This will have a consequence: your encoing will not be processable from any random position and not backward, if you can't determine where a sequence starts that represents and encodes a single code. This means that fast search algorithms (Boyer-Moore and similar) won't be usable for fast plain text searches, and the whole encoded text stream will be processble only from its start and only in the foreward direction (after all there's a similar problem with compressed streams that store compressed texts into a single stream instead of sequences of blocks including some resynchronization points.) If you intend to use such encoding, it may be advisable that it contains a easily recognizable sequence, unique in all directions and recognizable from any starting point in the stream, as a synchronization point, and that itselfs actually represents *no* code point at all. With such resynchronization point (which may be freely inserted between encoded blocks of valid code points), a fast search algorithm will still be usable; additionally the resynchronization sequence may also be used as padding to fill the last bytes of data blocks with some fixed size (e.g. with blocks of 4KB, meaning that searches can be initiated at least from any position multiple of 4KB). Currently, UTF-8, UTF-16 and UTF-32 do not contain any resynchronization sequence, but the separation of code unit values still allows a stream to be scanned from any starting position, plus or minus a very small number of code units around this position (not exceeding a handful of bytes), to find the start of a encoded code point: this means that fast-search algorithms can work easily with them, with a very small impact on performance. If performance is really a target, you also have to consider what is the order of volume of these texts to process: if lots of texts need to be processed, it is certainly not the encoding itself which will impact the performance but the overall size of these texts, and a good compression scheme will allow to minimize the number of I/O (or equivalently memory page swaps, or misses in fast memory caches or disk caches). When the texts to process are in fact very small, and just a small part within larger data sets, there's absolutely no reason to use somthing else than very simple UTF's like UTF-8 or UTF-16 or UTF-32, even if these schemes is not very space efficient: actually these are not the texts that will be compressed alone, but the whole dataset containing them, in order to gain more performance: for example XML documents sill use UTF-8 by default, and it is still very efficient, even for large datasets, because these documents are actually compressed as a whole for storage or transmissions, and possibly split into blocks of fixed sizes (it's possible in XML to also include padding to make sure that all data blocks will still have a constant size, but generic compressions schemes also often contain their own resynchronization bytes for this goal). In other words: there's generally absolutely no need to use anything else than the basic standardized UTF's : trying to work at the codepoint-only level, instead of larger texts or datasets, is not the best approach to improve the performances, when you can gain a lot more by working on larger datasets containing these texts. And any new encoding scheme will instead defeat all the other many optimizations that have been worked in interoperable libraries. Unicode does not standardize more than the 3 encoding schemes (and their byte order variants for those with code units larger than bytes). I think it's much enough. But nothing forbids you to create others, and Unicode has included some informative annexes about such schemes for applicaitons needing some compression. What is true is that UTF-8 was initially designed to allow the encoding of code points up to 30 bits; some inefficiencies are remaining because not all of its encoding space is actually used; we could certainly have a UTF that is using byte-sizes code units, fully compatible with ASCII (1 byte for them), and that would encode all other valid codepoints into sequences of 2 or 3 bytes, including with the possibility of determining which of them are start bytes or trailing bytes (for fast-search compatibility; 4-byte sequences can be avoided (but are still needed with UTF-8 because of its inefficient use of the valid encoding space, since the valid code points are now restricted to the 17 first planes, i.e. a bit more than 20 bits). Note that you said you'll need to encode 20 bits: this is wrong if you need full Unicode process conformance (17 planes, minus the surrogates space is still more than 20 bits). If such a conformance is not met, your scheme will not be usable with lots of libraries and applications that are designed to work only with Unicode conforming processes (that also assume that all PUAs, in the BMP or in the last 2 planes, can effectively be preserved). Philippe. > Message du 02/06/10 07:47 > De : "Kannan Goundan" > A : [email protected] > Copie à : > Objet : Least used parts of BMP. > > > I'm trying to come up with a compact encoding for Unicode strings for > data serialization purposes. The goals are fast read/write and small > size. > > The plan: > 1. BMP code points are encoded as two bytes (0x0000-0xFFFF, minus surrogates). > 2. Non-BMP code points are encoded as three bytes > - The first two bytes are code points from the BMP's UTF-16 surrogate > range (11 bits of data) > - The next byte provides an additional 8 bits of data. > > Unfortunately, this doesn't quite work because it only gives me 19 > bits to encode non-BMP code points, but I need 20 bits. To solve this > problem, I'm planning on stealing a bit of code space from the BMP the > private-use area. If I did, then: > - I'd get the bits needed to encoded the Non-BMP in 3 bytes. > - The stolen code points of the private-use area would now have to be > encoded using 3 bytes. > > I chose the private-use area because I assumed it would be the least > commonly used, so having these code points require 3 bytes instead of > 2 bytes wasn't that big a deal. Does this sound reasonable? Do > people suggest a different section of the BMP to steal from, or a > different encoding altogether? > > Thanks for reading. > -- Kannan > > P.S. I actually have two encodings. One is similar to UTF-8 in that > it's ASCII-biased. The encoding described above is intended for > non-ASCII-biased data. The programmer selects which encoding to use > based on what he thinks the data looks like. > >

