Transcoding Tamil in the presence of markup

Peter Jacobi Sat, 06 Dec 2003 12:31:26 -0800

Dear All,

I am attempting transcoding Tamil text (in legacy 8-bit encodings, which
are in visual glyph order, being heirs of the Tamil typewriter) into Unicode
(which uses 'logical' order invented  by ISCII):
http://www.jodelpeter.de/i18n/tamil/xref-uc.htm


When I thought,  my converter was ready, I had a severe collision
with reality, as I tried it on some webpages. 

The problem: in the legacy encoding you can style individual characters,
which not only breaks my simple converter, but which may have no
good equivalent in Unicode anyway. See this example:
(all legacy encoded Tamil is shown using C-style escape, Unicode Tamil as
NCR)

Converting unstyled text 
from TSCII
 lA \xC4\xA1
 le \xA7\xC4
 lo \xA7\xC4\xA1
to Unicode
 lA &#x0BB2;&#x0BBE;
 le &#x0BB2;&#x0BC6;
 lo &#x0BB2;&#x0BCA;

Now the consonant l should get a distinct color:
In TSCII:
 lA <span style='color:#00f'>\xC4</span>\xA1
 le \xA7<span style='color:#00f'>\xC4</span>
 lo \xA7<span style='color:#00f'>\xC4</span>\xA1

In Unicode:
 lA <span style='color:#00f'>&#x0BB2;</span>&#x0BBE;
 le <span style='color:#00f'>&#x0BB2;</span>&#x0BC6;
 lo <span style='color:#00f'>&#x0BB2;</span>&#x0BCA;

It is easy to see, that simple n:m mapping cannot make this conversion.
It is not that easy to judge whether this is the desired conversion at all.
And what should the receiving software should do with it.

Some tests: In Mozilla 1.4.1 the characters fall apart and in IE5.5 the
style expands to the entire orthographic syllable.
Unicode test page: http://www.jodelpeter.de/i18n/tamil/markup-uc.htm
TSCII test page: http://www.jodelpeter.de/i18n/tamil/markup-tscii.htm

After seeing this effect at its source, it's now clear why you can't style
individual
Tamil characters in a word processor, when using Unicode (whereas
you can do so, in legacy encodings).

It's hard to promote Unicode, when things that have worked in the past,
stop working.  

Any insights?

Regards,
Peter Jacobi




-- 
+++ GMX - die erste Adresse f�r Mail, Message, More +++
Neu: Preissenkung f�r MMS und FreeMMS! http://www.gmx.net

Transcoding Tamil in the presence of markup

Reply via email to