In Tamil -- sometimes, they are digits; other times, just numbers

Early last year, Raymond Chen <http://blogs.msdn.com/oldnewthing> talked
about how Char.IsDigit matches more than just 0 through
9<http://blogs.msdn.com/oldnewthing/archive/2004/03/09/86555.aspx> and
later last year I talked about Crossing the *DIGIT*al
divide<http://blogs.msdn.com/michkap/archive/2004/12/01/272864.aspx>.
But in both cases the conversation is limited to digits, and not the wide
world of numbers which includes a lot more than just different ways
of saying 0123456789.

The distinction between digits and numbers in Unicode is an important one,
since the formatting and parsing of numeric values is highly dependent on
whether a number acts like the ASCII digits 0 - 9 or not.

Now the bulk of the modern number systems use the same Arabic-Indic
system conventions to which software developers are accustomed, but others
do exist, some of which are still see use today.

As an example people can relate to, most of us are aware of the Roman
numeral system where there is no Zero and you sometimes have to use a lot of
addition in subtraction in a deterministic manner (such that any time a
smaller number comes before a larger one, the smaller one is subtracted;
otherwise if they are the same value or the larger one comes first, it is
added). Thus *Ⅰ* is one, *Ⅲ* is three, *Ⅳ* is 4, *Ⅴ* is 5, and so on.
Although it is not used too much, it is still commonly seen in the credits
of movies and television shows for the copyright date (e.g. *MCMLXXXIX* for
1989). Many people who are not used to Roman numerals breathed a sigh of
relief at the year 2000 since *MM* is so much easier to read....

It is of note that the Roman Numerals are encoded in Unicode even though
they can all be represented as existing letters. The primary reason for this
is that there are character properties associated with each encoded
character, and these properties are used by many implementations of Unicode
to get actual work done. Therefore, the letter V
(U+0056<http://www.fileformat.info/info/unicode/char/0056/index.htm>,
LATIN CAPITAL LETTER V) has a General Category of
Lu<http://www.fileformat.info/info/unicode/category/Lu/list.htm>(Letter,
Uppercase) while
*Ⅴ*(U+2164 <http://www.fileformat.info/info/unicode/char/2164/index.htm>,
ROMAN NUMBERAL FIVE) has a general category of
Nl<http://www.fileformat.info/info/unicode/category/Nl/list.htm>(Letter,
Number).

And yes, even that claim falls apart a little since the hexidecimal digits
ABCDEF are not separately encoded for reasons of backwards compatibility
with decades of existing practice on computers which is not the case with
Roman numerals. Even the argument for having encoded the Roman numerals is a
little specious since for the most part they have not been encoded and when
they are the style never seems to be consistent typographically. Though YMMV
since you may have better fonts than I do! Try "*ⅯⅭⅯⅬⅩⅩⅩⅨ*" for the test....

*All of this goes to show that Unicode is a very complex standard. In the
end, Unicode can always do what it needs to do without fear of the
occasional contradiction, since there will always be some precedent with
which to be consistent. :-)*

Ethiopic numbers are based on a different alternative system, one that can
really wreak havoc with a formatting/parsing architecture like that in
Windows or the .NET Framework if you try to bring Ethiopic data in without
writing code do the work (just like with Roman numerals). I'll talk about
Ethiopic numbers another time....

Yet another system, the one I will talk about here, is that of Tamil
numerals. It is an additive and positional system (unlike Roman
numerals, there is no subtraction involved) that has no zero but includes
characters for 10, 100, and 1000.

In the traditional system the number 3,782 would be represented as ௩௲௭௱௮௰௨
(literally Three-Thousand(s)-Seven-Hundread(s)-Eight-Ten(s)-Two, or
மூன்று-ஆயிரத்து-எழு-நூற்று-எண்-பத்து-இரண்டு in Tamil).

At least since the early 1800s, however, usage of the Tamil numerals as
digits has been more and more common. Thus the number 3,782 would often be
represented as ௩௭௮௨ (literally 3782).

The following table gives a bunch of different numbers and how they are
represented in both the older, more traditional style and in the "modern"
style where they act as digits. Note that the table is treating U+0eb6 as
TAMIL DIGIT ZERO even though it is not being added to Unicode until version
4.1. Up until now the ASCII DIGIT ZERO was used as needed, as I do in the
table below for display purposes, and if you want to represent these numbers
before Unicode 4.1 is released you should likely use U+0030 (DIGIT ZERO).
The *modern Tamil *column using the LOCALE_SGROUPING setting of Tamil....
 Arabic-Indic Digit old style Tamil modern Tamil old style Tamil code
points modern
Tamil code points for number 0   (*not available)* 0 (*not available)*  0be6
1  ௧ ௧ 0be7 0be7 2  ௨ ௨ 0be8 0be8 3  ௩ ௩ 0be9 0be9 4  ௪ ௪ 0bea 0bea 5  ௫ ௫
0beb 0beb 6  ௬ ௬ 0bec 0bec 7  ௭ ௭ 0bed 0bed 8  ௮ ௮ 0bee 0bee 9  ௯ ௯ 0bef
0bef 10  ௰ ௧0 0bf0 0be7 0be6 11  ௰௧ ௧௧ 0bf0 0be7 0be7 0be7 12  ௰௨ ௧௨ 0bf0
0be8 0be7 0be8 13  ௰௩ ௧௩ 0bf0 0be9 0be7 0be9 14  ௰௪ ௧௪ 0bf0 0bea 0be7 0bea
15  ௰௫ ௧௫ 0bf0 0beb 0be7 0beb 16  ௰௬ ௧௬ 0bf0 0bec 0be7 0bec 17  ௰௭ ௧௭ 0bf0
0bed 0be7 0bed 18  ௰௮ ௧௮ 0bf0 0bee 0be7 0bee 19  ௰௯ ௧௯ 0bf0 0bef 0be7 0bef
100  ௱ ௧00 0bf1 0be7 0be6 0be6 156  ௱௫௰௬ ௱௫௬ 0bf1 0beb 0bf0 0bec 0be7 0beb
0bec 200  ௨௱ ௨00 0be8 0bf1 0be8 0be6 0be6 300  ௩௱ ௩00 0be9 0bf1 0be9 0be6
0be6 1,000  ௲ ௧,000 0bf2 0be7 0be6 0be6 0be6 1,001  ௲௧ ௧,00௧ 0bf2 0BE7 0be7
0be6 0be6 0be7 1,040  ௲௪௰ ௧,0௪0 0bf2 0bea 0bf0 0be7 0be6 0bea 0be6 8,000  ௮௲
௮,000 0bee 0bf2 0bee 0be6 0be6 0be6 10,000  ௰௲ ௧0,000 0bf0 0bf2 0be7 0be6
0be6 0be6 0be6 70,000  ௭௰௲ ௭0,000 0bed 0bf0 0bf2 0bed 0be6 0be6 0be6 0be6
90,000  ௯௰௲ ௯0,000 0bef 0bf0 0bf2 0bef 0be6 0be6 0be6 0be6 100,0001 ௱௲
௧,00,000 0bf1 0bf2 0be7 0be6 0be6 0be6 0be6 0be6 800,000  ௮௱௲ ௮,00,000 0bee
0bf1 0bf2 0bee 0be6 0be6 0be6 0be6 0be6 1,000,0002 ௰௱௲ ௧0,00,000 0bf0 0bf1
0bf2 0be7 0be6 0be6 0be6 0be6 0be6 0be6 9,000,000  ௯௰௱௲ ௯0,00,000 0bef 0bf0
0bf1 0bf2 0bef 0be6 0be6 0be6 0be6 0be6 0be6 10,000,0003 ௱௱௲ ௧,00,00,000 0bf1
0bf1 0bf2 0be7 0be6 0be6 0be6 0be6 0be6 0be6 0be6 100,000,0004 ௰௱௱௲
௧0,00,00,000 0bf0 0bf1 0bf1 0bf2 0be7 0be6 0be6 0be6 0be6 0be6 0be6 0be6
0be6 1,000,000,0005 ௱௱௱௲ ௧,00,00,00,000 0bf1 0bf1 0bf1 0bf2 0be7 0be6 0be6
0be6 0be6 0be6 0be6 0be6 0be6 0be6 10,000,000,0006 ௲௱௱௲ ௧0,00,00,00,000 0bf2
0bf1 0bf1 0bf2 0be7 0be6 0be6 0be6 0be6 0be6 0be6 0be6 0be6 0be6 0be6
100,000,000,0007 ௰௲௱௱௲ ௧,00,00,00,00,000 0bf1 0bf1 0bf2 0be7 0be6 0be6 0be6
0be6 0be6 0be6 0be6 0be6 0be6 0be6 0be6 1,000,000,000,0008 ௱௲௱௱௲
௧0,00,00,00,00,000 0bf1 0bf2 0bf1 0bf1 0bf2 0be7 0be6 0be6 0be6 0be6 0be6
0be6 0be6 0be6 0be6 0be6 0be6 0be6 100,000,000,000,0009 ௱௱௲௱௱௲
௧0,00,00,00,00,00,000 0bf1 0bf1 0bf2 0bf1 0bf1 0bf2 0be7 0be6 0be6 0be6 0be6
0be6 0be6 0be6 0be6 0be6 0be6 0be6 0be6 0be6 0be6

*1 - a.k.a. Lakh
2 - a.k.a. 10 Lakhs
3 - a.k.a. crore
4 - a.k.a. 10 crore
5 - a.k.a. 100 crore
6 - a.k.a. thousand crore
7 - a.k.a. 10 thousand crore
8 - a.k.a. lakh crore
9 - a.k.a. crore crore*

Some examples of both types of usage:

   - Modern practice, using Tamil digits for chapter numbers: mozi varalARu,
   by munucAmi varatarAcan, published by The South India Saiva Siddhanta Works
   Publishing Society, Tinnevelly, Limited, November 1954, p. 357-358 (page
   numbers from 14th Edition, December 1996).
   - Traditional practice, using the older format (and the source for large
   parts of the table above!): iniya tamiz ilakkaNam by yokisri cuttAnan~ta
   pAratiyAr, published by Kavitha Publications, p. 201-204. (you can see the
   scanned source of some of it
here<http://www.geocities.com/Athens/5180/numeral.html>
   ).

Note that the traditional form is not currently handled by any code in
either Windows or the .NET Framework, though it is sometimes seen in even
modern contexts such as calendars. The system is not too complicated and
figuring out the algorithm to parse or format with it seems like the sort of
thing that would make an interesting Microsoft interview question. Though
perhaps I will post some potential solutions another day....



*Special thanks to *Sivaraj Doddannan, *Dr. N. Ganesan, and Working Group 02
of INFITT (of which they are both members) for helping to dig up the
excellent resources for Tamil numbers. INFITT (International Forum for
Information Technology in Tamil) is a liaison member of Unicode and has been
instrumental in providing character addition and usage reports to help
finish up the Tamil block in Unicode.*

**

*This post brought to you by* "௧௨௩௪௫௬௭௮௯"
*(U+0be7<http://www.fileformat.info/info/unicode/char/0be7/index.htm>-
U+0bef <http://www.fileformat.info/info/unicode/char/0bef/index.htm>, a.k.a.
TAMIL DIGIT ONE - TAMIL DIGIT NINE)
and they all welcome their new compadre U+0be6, which is coming soon to a
Unicode near you!*
 Filed under: 
Locales/Cultures<http://blogs.msdn.com/michkap/archive/tags/Locales_2F00_Cultures/default.aspx>,
Unicode/standards<http://blogs.msdn.com/michkap/archive/tags/Unicode_2F00_standards/default.aspx>,
Int'l 
Programming<http://blogs.msdn.com/michkap/archive/tags/Int_2700_l+Programming/default.aspx>

-- 
"Great changes may not happen right away, but with effort even the difficult
may become easy" - Bill Blackman

Vanakkam Subbu

Reply via email to