On Friday, April 27, <anbu at peoplestring dot com> wrote:
In addition I had a few more questions, of which the one below is the
most significant:
What if one had to send a text in multiple scripts, like in the case
of a text and its translation in the same message?
I thought maybe a new transition format or a new character encoding
would be suitable.
As a test, I took the first sentence from Article 1 of the UDHR (an
increasingly common benchmark), and used Google Translate to derive the
Hindi and Tamil equivalents:
All human beings are born free and equal in dignity and rights.
सभी मनुष्य स्वतंत्र और गरिमा और अधिकारों में बराबर पैदा होते हैं.
எல்லா மனிதர்களும் இலவச மற்றும் கௌரவம் மற்றும் உரிமைகள் சம பிறக்கின்றன.
(I don't vouch for the correctness of these translations; if you know
Hindi or Tamil and disagree with them, please provide your own.)
This is 84 characters from the Basic Latin block (including spaces used
in all three languages), 53 from Devanagari, and 62 from Tamil.
I encoded the resulting text in SCSU, with each line terminating in CRLF
and with the U+FEFF signature (0E FE FF) at the beginning. The
Devanagari passage is encoded as one byte per Unicode character,
preceded by a single SC4 tag byte to select window 4, which is
predefined to the Devanagari block. The Tamil passage is also encoded as
one byte per character, preceded by a two-byte SD3 tag to define a
window into the Tamil block and select it.
The total size of these three lines of text in SCSU, including signature
and CRLF, is 211 bytes. That's probably about as good as any
non-general-purpose Unicode compression encoding can achieve, and better
than most. I'm curious how well Anbu's proprietary encoding will stack
up.
--
Doug Ewell | Thornton, Colorado, USA
http://www.ewellic.org | @DougEwell