On Sat, 6 Dec 2003, Doug Ewell wrote: > Peter Jacobi <peter underscore jacobi at gmx dot net> wrote: > > > Some tests: In Mozilla 1.4.1 the characters fall apart and in IE5.5 > > the style expands to the entire orthographic syllable. > > Unicode test page: http://www.jodelpeter.de/i18n/tamil/markup-uc.htm > > TSCII test page: http://www.jodelpeter.de/i18n/tamil/markup-tscii.htm > > BTW, your "Unicode test page" is marked: > > <meta http-equiv="Content-Type" > content="text/html; charset=ISO-8859-1">
Peter uses NCRs so that it doesn't matter (although I prefer to tag the page as 'UTF-8', even in that case), does it? Anyway, he should have used 'lang' tag to help browsers pick up fonts. In two pages above, simply adding 'lang="ta"' to <table ....> would suffice. In xref-uc.htm, if you want a fine-grained control, he can just globally replace '<span class="glyph">&#....</span>' with '<span lang="ta" class="glyph">&#....</span>'. > while your TSCII test page is marked "x-user-defined". I'm not sure > what either of those declarations accomplishes. TSCII is not recongized by most browsers(it's not registered with IANA)[1]. 'x-user-defined' means that to view the page one has to configure one's browser to use Tamil 'custom encoded' [2] font (in TSCII/TAM? encoding) font when rendering 'x-user-defined' page. Most browsers have an option to set fonts for 'x-user-defined'. It's certainly better than tagging it as 'iso-8859-1' or 'windows-1252'. > > After seeing this effect at its source, it's now clear why you can't > > style individual Tamil characters in a word processor, when using > > Unicode (whereas you can do so, in legacy encodings). > > This is browser behavior, not word processor behavior, and certainly not > an inherent defect in the Unicode logical-order model. Display engines > need to do a better job of applying style to individual reordrant > glyphs, that's all. You're right. Anyway, this is an interesting challege to layout/rendering engines. In case of Korean Hangul (as Philippe wrote), it's even more so because unlike Indic scripts[3], it has multiple canonically equivalent (and not-canonically-equivalent in Unicode sense but nonetheless 'equivalent' in a certain sense) representations. Jungshik [1] http://bugzilla.mozilla.org/show_bug.cgi?id=186463 [2] 'Custom' (or 'hack') encoded : Windows-1252, Symbol or MacRoman Cmap is used to store Tamil glyphs (or other glyphs for other Indic scripts). Needless to say, we want to leave these fonts behind and move on. [3] As is well known, there are a few letters for which there are two canonically equivalent representations in Indic scripts.

