Re: Transcoding Tamil in the presence of markup

Jungshik Shin Sun, 07 Dec 2003 07:09:41 -0800

On Sat, 6 Dec 2003, Doug Ewell wrote:

> Peter Jacobi <peter underscore jacobi at gmx dot net> wrote:
>
> > Some tests: In Mozilla 1.4.1 the characters fall apart and in IE5.5
> > the style expands to the entire orthographic syllable.
> > Unicode test page: http://www.jodelpeter.de/i18n/tamil/markup-uc.htm
> > TSCII test page: http://www.jodelpeter.de/i18n/tamil/markup-tscii.htm
>
> BTW, your "Unicode test page" is marked:
>
> <meta http-equiv="Content-Type"
>  content="text/html; charset=ISO-8859-1">


  Peter uses NCRs so that it doesn't matter (although I prefer to
tag the page as 'UTF-8', even in that case), does it? Anyway, he
should have used 'lang' tag to help browsers pick up fonts. In two
pages above, simply adding 'lang="ta"' to <table ....> would suffice.
In xref-uc.htm, if you want a fine-grained control, he can just globally
replace '<span class="glyph">&#....</span>' with '<span lang="ta"
class="glyph">&#....</span>'.


> while your TSCII test page is marked "x-user-defined".  I'm not sure
> what either of those declarations accomplishes.

   TSCII is not recongized by most browsers(it's not registered with
IANA)[1]. 'x-user-defined' means that to view the page one has
to configure one's browser to use Tamil 'custom encoded' [2] font
(in TSCII/TAM? encoding) font when rendering 'x-user-defined' page.
Most browsers have an option to set fonts for 'x-user-defined'. It's
certainly better than tagging it as 'iso-8859-1' or 'windows-1252'.

> > After seeing this effect at its source, it's now clear why you can't
> > style individual Tamil characters in a word processor, when using
> > Unicode (whereas you can do so, in legacy encodings).
>
> This is browser behavior, not word processor behavior, and certainly not
> an inherent defect in the Unicode logical-order model.  Display engines
> need to do a better job of applying style to individual reordrant
> glyphs, that's all.

  You're right. Anyway, this is an interesting challege to
layout/rendering engines. In case of Korean Hangul (as Philippe wrote),
it's even more so because unlike Indic scripts[3], it has multiple
canonically equivalent (and not-canonically-equivalent in Unicode sense
but nonetheless 'equivalent' in a certain sense) representations.

   Jungshik

[1]  http://bugzilla.mozilla.org/show_bug.cgi?id=186463

[2] 'Custom' (or 'hack') encoded : Windows-1252, Symbol or MacRoman Cmap
    is used to store Tamil glyphs (or other glyphs for other Indic scripts).
    Needless to say, we want to leave these fonts behind and move on.

[3] As is well known, there are a few letters for which there are two
   canonically equivalent representations in Indic scripts.

Re: Transcoding Tamil in the presence of markup

Reply via email to