> If I have some all-kana documents ..., is there an 
> extension of UTF-8 that will alow me to strip off the redundant "this is 
> kana" byte from most of the kana? 

No.

> After the first few thousand kana, it 
> might be like, "Yeah, we get it already! It's kana! It's KANA!! You can 
> stop reminding us now!!"

If I decide to emulate the Buddha and fill text files with a million
DEVANAGARI OM symbols in a row, each instance is still U+0950, whether
represented in UTF-16 or UTF-8 (or UTF-32, for that matter).

Stop thinking in terms of bytes and start thinking in terms of
characters.

For that matter, say you were reading the genetic code:
ATG, Methionine; ATG, Methionine; ATG, Methionine; ATG, Methionine;
ATG, Methionine; ATG, Methionine; ATG, Methionine; ATG, Methionine;
ATG, Methionine; ATG, Methionine; ATG, Methionine; ATG, Methionine;...

Yeah, we get it already! It's methionine! It's METHIONINE!! You can
stop reminding us now!!

A code is what it is.

> 
> This goes too for Hebrew, Greek, etc.

What you are looking for are text compression algorithms. See UTS #6,
A Standard Compression Scheme for Unicode.

--Ken


Reply via email to