Re: Fwd: Re: Unicode, SMS and year 2012

Doug Ewell Sat, 28 Apr 2012 20:56:02 -0700

On Friday, April 27, <anbu at peoplestring dot com> wrote:

In addition I had a few more questions, of which the one below is the
most significant:


What if one had to send a text in multiple scripts, like in the case
of a text and its translation in the same message?

I thought maybe a new transition format or a new character encoding
would be suitable.

As a test, I took the first sentence from Article 1 of the UDHR (anincreasingly common benchmark), and used Google Translate to derive theHindi and Tamil equivalents:


All human beings are born free and equal in dignity and rights.
सभी मनुष्य स्वतंत्र और गरिमा और अधिकारों में बराबर पैदा होते हैं.
எல்லா மனிதர்களும் இலவச மற்றும் கௌரவம் மற்றும் உரிமைகள் சம பிறக்கின்றன.

(I don't vouch for the correctness of these translations; if you knowHindi or Tamil and disagree with them, please provide your own.)

This is 84 characters from the Basic Latin block (including spaces usedin all three languages), 53 from Devanagari, and 62 from Tamil.

I encoded the resulting text in SCSU, with each line terminating in CRLFand with the U+FEFF signature (0E FE FF) at the beginning. TheDevanagari passage is encoded as one byte per Unicode character,preceded by a single SC4 tag byte to select window 4, which ispredefined to the Devanagari block. The Tamil passage is also encoded asone byte per character, preceded by a two-byte SD3 tag to define awindow into the Tamil block and select it.

The total size of these three lines of text in SCSU, including signatureand CRLF, is 211 bytes. That's probably about as good as anynon-general-purpose Unicode compression encoding can achieve, and betterthan most. I'm curious how well Anbu's proprietary encoding will stackup.


--
Doug Ewell | Thornton, Colorado, USA

http://www.ewellic.org | @DougEwell

Re: Fwd: Re: Unicode, SMS and year 2012

Reply via email to