> De : [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED] la > part de Doug Ewell > Envoyà : dimanche 23 novembre 2003 22:06 > à : Unicode Mailing List > Cc : [EMAIL PROTECTED]; Jungshik Shin > Objet : Re: Ternary search trees for Unicode dictionaries > > > Philippe Verdy <verdy underscore p at wanadoo dot fr> wrote: > > > For Korean text, I have found that representation with "defective" > > syllables was performing better through SCSU. I mean here decomposing > > the TLV syllables of the NFC form into T and LV, and TL into T and L, > > i.e. with partial decomposition. > > ... > > With this constraint, Korean is no more acting like Han, and the > > precombined arrangements of LV syllables saves much on the SCSU > > window; gains are also significant for for other binary compressors > > like LZW on any UTF scheme, and even with Huffman or Arithmetic coding > > of UTF-16*/UTF-32* schemes. > > This seems reasonable, except that you have to transform the text from > its original representation to this special, compression-friendly > format.
OK, this is a transform, but it is still canonically equivalent to the source text. Transformations between canonical equivalent strings is safe (at least for Korean Hangul), and this is what any normalizer performs. > Data to be compressed will not come pre-packaged in this > partially decomposed form, but will likely be either fully composed > syllables or fully decomposed jamos. So you really have to perform two > layers of transformation, one to prepare the data for compression and > another to actually compress it, and of course you must do the same > thing in reverse to decompress the data. > > This adds complexity, but is sometimes worth the effort. Not more complex than implementing the standard normalization. In any case, most normalizers perform Hangul syllable normalization algorithmically, so it's a couple of lines to modify to add a check for possible TL/TLV composition in the NFC normalizer, or to decompose TL and TLV but not LV in the NFD normalizer. > The Burrows-Wheeler block-sorting approach, for example, achieves very good results by adding a preprocessing step before "conventional" Huffman or arithmetic compression. I don't know what it is but I think it will try to create a unique code from common sequences of codepoints, in order to flatten approximately their distribution statistics. The assigned codes are then sorted by frequency, from where Huffman/Arithmetic encoding of these partially flattened codes will create the variable length codes, with a bit more accuracy than if the intermediate codes were all unsorted, the interest being that the result of Huffman/Arithmetic encoding will contain sequences of frequent bits set to 0, so that it is recompressible by a second Huffman/Arithmetic pass, or by a RLE pass. I have also seen compression based on ADPCM coding for audio PCM modulation data (i.e. compressing differences of codes), and some other technics accessible in SCSU. It would be interesting to know if good SCSU compressors have been developed, i.e. how they compare when they choose one option or the other for the coding, between switch to UTF-16, temporary escape to UTF-16, management of the most frequent codes in the 8-bit codepage, switch of codepages, reset to default codepages... I'm sure that many compressors have been attempted, but not fully experimented to get the best result. So it's quite soon to compare the performance of BOCU-1 and SCSU. After all, the "deflate" compression method in zlib is public, and all ZIP tools can defalte deflated files, but they are not all equal in terms of effective compression ratio (the fine tuning of compressors between processing speed at compression time and result size requires a lot of efforts, and this is where algorithms are proprietary and often patented, despite the decompressor itself is free and mostly identical between all products, with very few design choices except minor optimizatrions for speed of the decompression process). So we should wait for some time until those who developed the best tunings for "deflate" experiment their technics on SCSU. The current ICU implementation is not too bad, but after looking at some generated sequences, it is not always fully optimal, and I can still find shorter codes manually in some examples not too hard to find, notably in documents using mixed languages or scripts (I have not looked precisely on the case of Korean, but I have looked at the currently excellent result for Russian and Greek, with still some optimizations possible for Latin based languages that use a higher number of accents and use relatively more frequently the Latin Extended letters). Of course, as BOCU-1 and SCSU are considered as encoding schemes, they cannot themselves alter the represented Unicode strings into canonically equivalent ones. But I'm quite sure that if a process wish to create strings that it knows they will be normalized when read (or the reader will ignore differences of canonically equivalent strings, something that all Unicode-based processes should respect), that process could take it into account within its BOCU-1 or SCSU encoder by allowing them to change the normalization form as long as canonical equivalence is kept. For BOCU-1 this is quite simple as this is a pipeline with separate tasks with little options, but in SCSU the normalization form of the input string interacts with the SCSU-1 compressor during its run, and offers more alternatives to the compressor. This allowed modification of text is not destructive even in the case of security signatures (as long as these signatures are created on input data in NFC or NFD form, and not on free forms): the signed text may be decoded and renormalized to the form expected by the signature, before computing its digest. __________________________________________________________________ << ella for Spam Control >> has removed Spam messages and set aside Newsletters for me You can use it too - and it's FREE! http://www.ellaforspam.com
<<attachment: winmail.dat>>