Markus Scherer <markus dot scherer at jtcsv dot com> wrote: >> BOCU-1 might solve this problem, but multiplying and dividing by 243 >> doesn't sound faster than UTF-8 bit-shifting. (I'm still amazed by >> the claim in UTN #6 that converting Hindi text between UTF-16 and >> BOCU-1 took only 45% as long as converting it between UTF-16 and >> UTF-8.) > > "claim"? That hurts... > > I did measure these things, and the numbers in the table are all from > my measurements. I also included the type of machine I used, etc. > (http://www.unicode.org/notes/tn6/#Performance)
Certainly I would never accuse Markus of falsifying these statistics. The word "claim" was not meant in the sense of "unsubstantiated claim." It did startle me that converting to BOCU-1 and SCSU could be TWICE as fast as converting to UTF-8, unless the I/O cost of writing two or three bytes is *much* slower than that of writing only one. > The reason why BOCU-1 (and SCSU) is often faster than UTF-8 is that > BOCU-1 goes into single-byte mode for small scripts like Hindi. > Single-byte mode only performs a subtraction, no div/mod or even bit- > shifting, and writes/reads only one byte per character. It is also > optimized in ICU with a tight inner loop. I'll have to see how my encoder and decoder perform when I finish them. They're currently written for simplicity, not speed. > UTF-8 is useful because it's simple, and supported just about > everywhere - but it's otherwise hardly optimal for anything. As John said, it's all about ASCII transparency, together with no false positives for "ASCII bytes" in non-Basic Latin characters. > If you want high-speed, compact encoding, use SCSU. If you want good > speed, compact encoding, and binary order and/or MIME compatibility, > use BOCU-1. Make sure that both sides of the wire know what's going > across. Always. And especially in the case of BOCU-1, since it's not ASCII-transparent -- although heuristic detection of BOCU-1 should be straightforward and very reliable. -Doug Ewell Fullerton, California http://users.adelphia.net/~dewell/

