I agree, the right approach is to look at some real data. And best is to look not at raw byte proportions per character, but at real UTF-8 text with equivalent translated content. There are a number of translated pages with the following sizes linked through:
http://www.unicode.org/unicode/standard/WhatIsUnicode.html 09,618 s-chinese.html 09,682 t-chinese.html 10,110 esperanto.html 10,279 maltese.html 10,475 icelandic.html 10,632 czech.html 10,660 welsh.html 10,808 danish.html 10,856 swedish.html 10,863 polish.html 10,864 spanish.html 10,955 interlingua.html 11,000 italian.html 11,038 lithuanian.html 11,044 portuguese.html 11,096 romanian.html 11,106 german.html 11,134 korean.html 11,281 french.html 11,462 japanese.html 13,892 persian.html 14,808 WhatIsUnicode.html* 14,028 greek.html 14,632 russian.html 15,218 hindi.html 15,853 deseret.html 16,069 georgian.html 18,185 arabic.html* Hindi is about the same as Greek or Russian, and about 37% more than German. But notice that when we look at the figures, it would appear that the Unicode consortium is favoring Chinese over all European languages! Yet as has been pointed out, even the comparisons here are not really representative in terms of total web content, since they have so few graphics. With a higher proportion of images to text, the differences in the text size are completely swamped. Mark * The Arabic page has a lot of crufty HTML carried over from MS Word; otherwise I would expect it to take about the same room as Persian. * The English page (WhatIsUnicode.html) has an overstated byte count, since it has the index on it. ————— Πόλλ’ ἠπίστατο ἔργα, κακῶς δ’ ἠπίστατο πάντα — Ὁμήρου Μαργίτῃ [For transliteration, see http://oss.software.ibm.com/cgi-bin/icu/tr] http://www.macchiato.com ----- Original Message ----- From: "David Starner" <[EMAIL PROTECTED]> To: "Aman Chawla" <[EMAIL PROTECTED]> Cc: "Unicode" <[EMAIL PROTECTED]> Sent: Sunday, January 20, 2002 22:27 Subject: Re: Devanagari > On Mon, Jan 21, 2002 at 12:39:58AM -0500, Aman Chawla wrote: > > > What's your point in continuing this? Most of the people on this list > > > already know how UTF-8 can expand the size of non-English text. > > > > The issue was originally brought up to gather opinion from members of this > > list as to whether UTF-8 or ISCII should be used for creating Devanagari web > > pages. The point is not to criticise Unicode but to gather opinions of > > informed persons (list members) and determine what is the best encoding for information > > interchange in South-Asian scripts... > > That's sort of like going into a Islamic shrine and asking who the one > true god is. The answer they will give is predicatable, and arguing > about the answer will start to annoy people, especially if you don't > seem to be listening. > > And you don't seem to be listening. The factor is not a factor of 3. > UTF-16, which IE supports (I believe) and Netscape 6 supports, will give > you a constant factor of 2. If you use UTF-8, HTML markup will > make the factor considerably smaller, and if you have many graphics, > their size will easily dwarf that of the text. > > For a comparison, yahoo.com sans graphics is 20k, 6k of text and 14k of > HTML. A Devangari page, therefor, should be about 32k, a factor of 1.5, > not 3. > > -- > David Starner - [EMAIL PROTECTED], dvdeug/jabber.com (Jabber) > Pointless website: http://dvdeug.dhis.org > When the aliens come, when the deathrays hum, when the bombers bomb, > we'll still be freakin' friends. - "Freakin' Friends" > >

