> If I have some all-kana documents ..., is there an > extension of UTF-8 that will alow me to strip off the redundant "this is > kana" byte from most of the kana?
No. > After the first few thousand kana, it > might be like, "Yeah, we get it already! It's kana! It's KANA!! You can > stop reminding us now!!" If I decide to emulate the Buddha and fill text files with a million DEVANAGARI OM symbols in a row, each instance is still U+0950, whether represented in UTF-16 or UTF-8 (or UTF-32, for that matter). Stop thinking in terms of bytes and start thinking in terms of characters. For that matter, say you were reading the genetic code: ATG, Methionine; ATG, Methionine; ATG, Methionine; ATG, Methionine; ATG, Methionine; ATG, Methionine; ATG, Methionine; ATG, Methionine; ATG, Methionine; ATG, Methionine; ATG, Methionine; ATG, Methionine;... Yeah, we get it already! It's methionine! It's METHIONINE!! You can stop reminding us now!! A code is what it is. > > This goes too for Hebrew, Greek, etc. What you are looking for are text compression algorithms. See UTS #6, A Standard Compression Scheme for Unicode. --Ken

