On Friday, April 27, <anbu at peoplestring dot com> wrote:
In addition I had a few more questions, of which the one below is the
most significant:

What if one had to send a text in multiple scripts, like in the case
of a text and its translation in the same message?

I thought maybe a new transition format or a new character encoding
would be suitable.

As a test, I took the first sentence from Article 1 of the UDHR (an increasingly common benchmark), and used Google Translate to derive the Hindi and Tamil equivalents:

All human beings are born free and equal in dignity and rights.
सभी मनुष्य स्वतंत्र और गरिमा और अधिकारों में बराबर पैदा होते हैं.
எல்லா மனிதர்களும் இலவச மற்றும் கௌரவம் மற்றும் உரிமைகள் சம பிறக்கின்றன.

(I don't vouch for the correctness of these translations; if you know Hindi or Tamil and disagree with them, please provide your own.)

This is 84 characters from the Basic Latin block (including spaces used in all three languages), 53 from Devanagari, and 62 from Tamil.

I encoded the resulting text in SCSU, with each line terminating in CRLF and with the U+FEFF signature (0E FE FF) at the beginning. The Devanagari passage is encoded as one byte per Unicode character, preceded by a single SC4 tag byte to select window 4, which is predefined to the Devanagari block. The Tamil passage is also encoded as one byte per character, preceded by a two-byte SD3 tag to define a window into the Tamil block and select it.

The total size of these three lines of text in SCSU, including signature and CRLF, is 211 bytes. That's probably about as good as any non-general-purpose Unicode compression encoding can achieve, and better than most. I'm curious how well Anbu's proprietary encoding will stack up.

--
Doug Ewell | Thornton, Colorado, USA
http://www.ewellic.org | @DougEwell ­

Reply via email to