Re: Transcoding Tamil in the presence of markup

Christopher John Fynn Sat, 06 Dec 2003 15:50:07 -0800

In Unicode U+0BBE, U+0BC6 and U+0BCA are all dependent vowel signs

IE is probably  treating a base character and any dependent vowels as a single
unit. Since in  some fonts a base character + combining vowel mark might be
displayed by a single ligature glyph, it makes sense to apply the formatting of
a base character to any dependant combining characters as well.


In Mozilla you may be completely breaking the font lookups by separately
formatting the different parts of a conjunct.

In legacy glyph based Tamil encodings there was a simple one-to-one
correspondence  characters and glyphs so it is straightforward to apply
different formatting to different characters.

--
Christopher J. Fynn



----- Original Message ----- 
From: "Peter Jacobi" <[EMAIL PROTECTED]>
To: <[EMAIL PROTECTED]>
Sent: Saturday, December 06, 2003 6:39 PM
Subject: Transcoding Tamil in the presence of markup


> Dear All,
>
> I am attempting transcoding Tamil text (in legacy 8-bit encodings, which
> are in visual glyph order, being heirs of the Tamil typewriter) into Unicode
> (which uses 'logical' order invented  by ISCII):
> http://www.jodelpeter.de/i18n/tamil/xref-uc.htm
>
> When I thought,  my converter was ready, I had a severe collision
> with reality, as I tried it on some webpages.
>
> The problem: in the legacy encoding you can style individual characters,
> which not only breaks my simple converter, but which may have no
> good equivalent in Unicode anyway. See this example:
> (all legacy encoded Tamil is shown using C-style escape, Unicode Tamil as
> NCR)
>
> Converting unstyled text
> from TSCII
>  lA \xC4\xA1
>  le \xA7\xC4
>  lo \xA7\xC4\xA1
> to Unicode
>  lA &#x0BB2;&#x0BBE;
>  le &#x0BB2;&#x0BC6;
>  lo &#x0BB2;&#x0BCA;
>
> Now the consonant l should get a distinct color:
> In TSCII:
>  lA <span style='color:#00f'>\xC4</span>\xA1
>  le \xA7<span style='color:#00f'>\xC4</span>
>  lo \xA7<span style='color:#00f'>\xC4</span>\xA1
>
> In Unicode:
>  lA <span style='color:#00f'>&#x0BB2;</span>&#x0BBE;
>  le <span style='color:#00f'>&#x0BB2;</span>&#x0BC6;
>  lo <span style='color:#00f'>&#x0BB2;</span>&#x0BCA;
>
> It is easy to see, that simple n:m mapping cannot make this conversion.
> It is not that easy to judge whether this is the desired conversion at all.
> And what should the receiving software should do with it.

> Some tests: In Mozilla 1.4.1 the characters fall apart and in IE5.5 the
> style expands to the entire orthographic syllable.
> Unicode test page: http://www.jodelpeter.de/i18n/tamil/markup-uc.htm
> TSCII test page: http://www.jodelpeter.de/i18n/tamil/markup-tscii.htm
>
> After seeing this effect at its source, it's now clear why you can't style
> individual
> Tamil characters in a word processor, when using Unicode (whereas
> you can do so, in legacy encodings).
>
> It's hard to promote Unicode, when things that have worked in the past,
> stop working.
>
> Any insights?
>
> Regards,
> Peter Jacobi
>
>
>
>
> -- 
> +++ GMX - die erste Adresse f�r Mail, Message, More +++
> Neu: Preissenkung f�r MMS und FreeMMS! http://www.gmx.net
>
>
>

Re: Transcoding Tamil in the presence of markup

Reply via email to