Dear All, I am attempting transcoding Tamil text (in legacy 8-bit encodings, which are in visual glyph order, being heirs of the Tamil typewriter) into Unicode (which uses 'logical' order invented by ISCII): http://www.jodelpeter.de/i18n/tamil/xref-uc.htm
When I thought, my converter was ready, I had a severe collision with reality, as I tried it on some webpages. The problem: in the legacy encoding you can style individual characters, which not only breaks my simple converter, but which may have no good equivalent in Unicode anyway. See this example: (all legacy encoded Tamil is shown using C-style escape, Unicode Tamil as NCR) Converting unstyled text from TSCII lA \xC4\xA1 le \xA7\xC4 lo \xA7\xC4\xA1 to Unicode lA லா le லெ lo லொ Now the consonant l should get a distinct color: In TSCII: lA <span style='color:#00f'>\xC4</span>\xA1 le \xA7<span style='color:#00f'>\xC4</span> lo \xA7<span style='color:#00f'>\xC4</span>\xA1 In Unicode: lA <span style='color:#00f'>ல</span>ா le <span style='color:#00f'>ல</span>ெ lo <span style='color:#00f'>ல</span>ொ It is easy to see, that simple n:m mapping cannot make this conversion. It is not that easy to judge whether this is the desired conversion at all. And what should the receiving software should do with it. Some tests: In Mozilla 1.4.1 the characters fall apart and in IE5.5 the style expands to the entire orthographic syllable. Unicode test page: http://www.jodelpeter.de/i18n/tamil/markup-uc.htm TSCII test page: http://www.jodelpeter.de/i18n/tamil/markup-tscii.htm After seeing this effect at its source, it's now clear why you can't style individual Tamil characters in a word processor, when using Unicode (whereas you can do so, in legacy encodings). It's hard to promote Unicode, when things that have worked in the past, stop working. Any insights? Regards, Peter Jacobi -- +++ GMX - die erste Adresse f�r Mail, Message, More +++ Neu: Preissenkung f�r MMS und FreeMMS! http://www.gmx.net

