I agree, the right approach is to look at some real data. And best is to
look not at raw byte proportions per character, but at real UTF-8 text with
equivalent translated content. There are a number of translated pages with
the following sizes linked through:

http://www.unicode.org/unicode/standard/WhatIsUnicode.html

09,618            s-chinese.html
09,682            t-chinese.html
10,110            esperanto.html
10,279            maltese.html
10,475            icelandic.html
10,632            czech.html
10,660            welsh.html
10,808            danish.html
10,856            swedish.html
10,863            polish.html
10,864            spanish.html
10,955            interlingua.html
11,000            italian.html
11,038            lithuanian.html
11,044            portuguese.html
11,096            romanian.html
11,106            german.html
11,134            korean.html
11,281            french.html
11,462            japanese.html
13,892            persian.html
14,808            WhatIsUnicode.html*
14,028            greek.html
14,632            russian.html
15,218            hindi.html
15,853            deseret.html
16,069            georgian.html
18,185            arabic.html*

Hindi is about the same as Greek or Russian, and about 37% more than German.
But notice that when we look at the figures, it would appear that the
Unicode consortium is favoring Chinese over all European languages!

Yet as has been pointed out, even the comparisons here are not really
representative in terms of total web content, since they have so few
graphics. With a higher proportion of images to text, the differences in the
text size are completely swamped.

Mark

* The Arabic page has a lot of crufty HTML carried over from MS Word;
otherwise I would expect it to take about the same room as Persian.
* The English page (WhatIsUnicode.html) has an overstated byte count, since
it has the index on it.

—————

Πόλλ’ ἠπίστατο ἔργα, κακῶς δ’ ἠπίστατο 
πάντα — Ὁμήρου Μαργίτῃ
[For transliteration, see http://oss.software.ibm.com/cgi-bin/icu/tr]

http://www.macchiato.com

----- Original Message -----
From: "David Starner" <[EMAIL PROTECTED]>
To: "Aman Chawla" <[EMAIL PROTECTED]>
Cc: "Unicode" <[EMAIL PROTECTED]>
Sent: Sunday, January 20, 2002 22:27
Subject: Re: Devanagari


> On Mon, Jan 21, 2002 at 12:39:58AM -0500, Aman Chawla wrote:
> > > What's your point in continuing this? Most of the people on this list
> > > already know how UTF-8 can expand the size of non-English text.
> >
> > The issue was originally brought up to gather opinion from members of
this
> > list as to whether UTF-8 or ISCII should be used for creating Devanagari
web
> > pages. The point is not to criticise Unicode but to gather opinions of
> > informed persons (list members) and determine what is the best encoding
for information
> > interchange in South-Asian scripts...
>
> That's sort of like going into a Islamic shrine and asking who the one
> true god is. The answer they will give is predicatable, and arguing
> about the answer will start to annoy people, especially if you don't
> seem to be listening.
>
> And you don't seem to be listening. The factor is not a factor of 3.
> UTF-16, which IE supports (I believe) and Netscape 6 supports, will give
> you a constant factor of 2. If you use UTF-8, HTML markup will
> make the factor considerably smaller, and if you have many graphics,
> their size will easily dwarf that of the text.
>
> For a comparison, yahoo.com sans graphics is 20k, 6k of text and 14k of
> HTML. A Devangari page, therefor, should be about 32k, a factor of 1.5,
> not 3.
>
> --
> David Starner - [EMAIL PROTECTED], dvdeug/jabber.com (Jabber)
> Pointless website: http://dvdeug.dhis.org
> When the aliens come, when the deathrays hum, when the bombers bomb,
> we'll still be freakin' friends. - "Freakin' Friends"
>
>


Reply via email to