length of text by different languages

Yung-Fong Tang Wed, 05 Mar 2003 16:29:22 -0800

I remember there were some study to show although UTF-8 encode each Japanese/Chinese characters in 3 bytes, Japanese/Chinese usually use LESS characters in writting to communicate information than alphabetic base langauges.

Any one can point to me such research? Martin, do you have some paper about that ?

I would like to find out the average ration between
English,
Geram,
French,
Japanese,
Chinese,
Korean

in term of the number of characters, and in term of the bytes needed to encode in UTF-8

If such research information have not been done, maybe one way to figure the result is to take tranlated Bible fo these language from swords project, strip out those xml tag and leave the pure text, and measure the size. Since all the Bible translation communicate the same information and the volumn is huge enough, that could be a good way to find out the result. Of course, those mark up need to be taken out to reduce the noise.

length of text by different languages

Reply via email to