Re: length of text by different languages

Yung-Fong Tang Thu, 06 Mar 2003 15:31:28 -0800

Francois Yergeau wrote:

[EMAIL PROTECTED] wrote:

I remember there were some study to show although UTF-8 encode each 
Japanese/Chinese characters in 3 bytes, Japanese/Chinese usually use 
LESS characters in writting to communicate information than 
alphabetic base langauges.


Any one can point to me such research?


I don't know of exactly what you want, but I vaguely remember a paper given
at a Unicode conference long ago that compared various translations of the
charter (or some such) of the Voice of America in a couple or three
encodings.  Hmmmm, let's see....  could be this:

http://www.unicode.org/iuc/iuc9/Friday2.html#b3
Reuters Compression Scheme for Unicode (RCSU) 
Misha Wolf

yea. That could be it. I got a hard copy and it looks like the Fig 2 is the one I am looking for.


No paper online, alas.  I remember that Chinese was a clear winner in terms
of # of characters.  In fact, I kind of remember that Chinese was so much
denser that it still won after RCSU (now SCSU) compression, which would mean
that a Han character contains more than twice as much info on average as a
Latin letter as used in (say) English.

This is all on pretty shaky ground, distant memories.  Perhaps Misha stil
has the figures (if that's in fact the right paper).

Re: length of text by different languages

Reply via email to