Re: Romanized Singhala - Think about it again

Jean-François Colson Tue, 17 Jul 2012 11:19:37 -0700

Le 17/07/12 02:43, Naena Guru a écrit :
> Jean, sorry I am late. I used spare time as and when I got it.
>

> On Sun, Jul 8, 2012 at 10:20 PM, Jean-François Colson <[email protected]>wrote:

>
>     Le 09/07/12 01:29, Naena Guru a écrit :
>>     Jean-François,
>>

>> Let me approximate it in Romanized Singhala: 'jyó-frósvaa'. Justtrying...> I don’t know how that transcription should be pronounced but inIPA, Jean-François is /ʒɑ̃.fʁɑ̃.swa/.

>
> They came as rectangles (in XP).

That’s not surprising. Windows XP is an old out-of-date system with, bydefault, a very limited set of fonts. But nobody prevents you todownload some additional free fonts.

> They showed correctly in your message inside Firefox running in PuppyLinux, but where I an replying, it shows a reversed Euro like character

That is surprising. Which font does include a reversed euro sign ?

> in place of the a-umlaut.
I didn’t use any umlauts but two tildes.

> This again illustrates how hazardous it is for characters outsideLatin-1.

It illustrates how hazardous it is to use such an old OS as Windows XP.

>
> I can only approximate the first letter as English j+y,
English j + y????? I don’t know that, neither in English nor in French.

/ʒ/ is the French j. It is not the English j plus something but ratherthe English j minus something.The English j is /dʒ/. That’s an affricate, i.e. roughly a sound whichbegins as a plosive and evolves to end as a fricative.The French j is a fricative. its nearest approximation in English is thez in azure or the s in leisure.

/ɑ̃/ is not an /ɑ/ with umlaut but an /ɑ/ with tilde. It is pronounced asthe a in the English word “car” but with a nasal quality, i.e. some airpasses through the nose.

Jean is an homophone of gens which you can hear here:http://fr.wiktionary.org/wiki/gens#PrononciationThe speaker has recorded “des gens”, so focus your attention on thesecond syllable.


/f/ is pronounced like in English.

/ʁ/ is the French r, but there are several varieties of r among theFrench dialects, so using the English r instead is not a big problem.


/s/ and /w/ are pronounced as in English.

/a/ is very similar to /ɑ/. It is the begining of the diphtong in theEnglish word “sky”.


> which is same on Singhala. The rest is pretty close, I think.
>
>
>
>>
>>     Thank you for your interest. See inline responses.
>>

>> On Thu, Jul 5, 2012 at 7:35 AM, Jean-François Colson<[email protected]> wrote:

>>
>>         Le 05/07/12 10:02, Naena Guru a écrit :
>>>
>>>

>>> On Wed, Jul 4, 2012 at 11:33 PM, Philippe Verdy<[email protected]> wrote:

>>>

>>> Anyway, consider the solutions already proposed inSinhalese>>> Wikipedia. There are verious solutions proposed,including several>>> input methods supported there. But the purpose of thesesolutions is>>> always to generate Sinhalese texts perfectly encodedwith Unicode and

>>>             nothing else.
>>>

>>> Thank you for the kind suggestion. The problem is UnicodeSinhala does not perfectly support Singhala!

>>         What’s wrong? Are there missing letters?
>>
>>     Many, many.
>>
>>>         The solution is for Sinhala not for Unicode!
>>         Or rather for Sinhala by Unicode.
>>
>>     Sure, if you want to do it with proper deliberation.
>>
>>
>>

>>> I am not saying Unicode has a bad intention but anill-conceived product.

>>         What precisely is ill-conceived?
>>
>>     Anglo-centric thinking is what is wrong.
>     ?????
>

> Letters have no direct relation to speech -- very few. In Singhala(perhaps as in French too as someone said?) you write what you say. InSinghala, the exception is clearly understood rule set about how topronounce short 'a' -- whether muted or not.

> Therefore, the approach should have been to encode the vowels,diphthongs and consonants as base letters. I assigned the acute accentto the 'ng' sound and the umlaut to the guttyral H in Sanskrit, but theycould be assigned independent codepoints.

>
>
>
>>      Let me take you on the scenic route:
>>

>> Number of letters in Singhala is only theoretical. In the caseof Singhala orthography, the actually used number depends on theSanskrit vocabulary.> Do you mean there are many conjunct consonants, sometimes with aseparate glyph?

> Yes, many. There are three orthographies. Singhala does not have CCsat all. Sanskrit has a lot. Pali has touch letters in addition to whatSanskrit has. Modern Singhala is mixed Singhala and Sanskrit. WithUnicode Sinhala, you need to know which ones join and provide the ZWJand hope and pray that the font has the CC. Often they are absent.So I guess your problem could be solved by providing new fonts with abetter support of conjunct consonants. What you did for 8-bit Sinhala,you could do it for Unicode Sinhala too.


> Then SLS1134 gives wrong advice too.
Could you explain in details what is wrong in their advice?

>

> In Devanagari, they’re made by typing two or more consonantsseparated by halants. Isn’t that possible with Sinhala?

> Yes. That is possible, which means the fonts need lookup tables thatprovide the CCs in the Private Use Area of the font.

Isn’t that what you did in your 8-bit Sinhala font?

>

> BTW, 'halant' means the last consonant in a word. 'hal' meansconsonant. What Unicode calls halant (and virama) is the sign thatindicates a letter is a (free-standing) consonant. virama means themarker of a consonant at end of a sentence. (Sigh, cry)> In Singhala, we call the sign 'hal kiriima' = the sign which makes ita hal and is not location specific.

>
>
>
>>     My test font has about 1500.

> IIRC there are only 191 code-points in Latin-1. Or 96? I’m notsure whether ASCII is part of Latin-1.

> These are the two that I use to look up letters of the Latin-1repertoire:

> http://www.unicode.org/charts/PDF/U0000.pdf
> http://www.unicode.org/charts/PDF/U0080.pdf
But there are not “about 1500” code points there.

>
>
>

>> I need much more. Pali orthography enforces touch-letter rule(see later). Modern Singhala, meaning 1st century onwards, is anadmixture of Singhala and Sanskrit. Pali is not mixed into Singhala text(except to quote like a foreign language).

>>

>> Generally, the Unicode approach is to treat the consonants asbase shapes. Then the vowel signs are added around them. The vowel signshave their own codepoints. We hit upon the first problem here becausethere are two possible codepoints for each single-mora vowel,double-mora vowel and each diphthong. We lose all hopes of traditionalsearch, replace etc. It complicates collation too.> In my first name, I usualy type the ç as U+00E7 because I have aÇ key on my keyboard, but using the two code-points U+00E7 U+0327 (c +combining cedilla) would be correct too and in many software the searchfunction won’t find the ç if I’m looking for a ç.

>
> You are tight!
Thanks

> Notepad does not find ç. It is certainly a bug. I tried Geany texteditor and Abiword in Linux. They both are okay. I think Notepad forgetsthose letters in the Latin 1 Supplement. This is not a big deal.Not really a smaller one than improving the Sinhalese fonts if they misssome conjuncts.

> I remember that VOLT guys took awhile to realize that Open Type'simple' script includes those.

>
>
>
>>

>> Then they went up the gum tree of the notion that the Singhalaconsonant is actually a consonant with a vowel inside it -- an absurdity-- the Abugida theory. They then added two ligatures without normalizingthem. Singhala has 15 ligatures in that category of ligatures. Theyincluded upadhmAnIya and left out jihvAmUlIya.> Can’t those ligatures be typed as separate signs with a zero-withjoiner? I’ve seen there’s one on Sinhalese keyboards.

>
> Yes, you can but only to get a few.

Could you describe in details the missing conjuncts, which lettersshould generate them and what they should look like?Also, it should be very useful to send those descriptions to thedesigners of Sinhalese fonts.

> It hurts the orthography, Jean. People forget orthography when it isin disuse. It is better to make the CCs and let the user prevent theirformation by means of ZWNJ where they do not want the joining.Oh! I see. It would be possible to make input methods which work as youwant and generate Sinhalese characters as they are used today.

It is not possible to change the behaviour of sinhalese characters: thatwould mangle all presently existing texts.


> I implemented this in my font. It is easier for input thanh
>
>
>
>>
>>     That concludes that Unicode Sinhala is not grammar compliant.
>     What do you mean?
>

> That the rules of orthography are ignored. e.g., just two conjuncts(conjoints) out of a group of 15

15 or about 1500

> in their class inside the standard and others missing.

You’re not alone. In the South of Belgium and the North of France, thereare local languages which, in some orthographies, use the ring above thevowels A and E. Å and å are available as precomposed consonants. E̊ and e̊are not and must be composed of E/e + U+00A0. That’s not a big problem.

I think the good question is “Why were those two CCs included asprecomposed characters?”, not “Why were the other ones forgotten?”.


> There is no consistency in this code block.
Just forget the useless characters if they are a mistake of the past.

> This is why I say, just encode the vowels and consonants. You complywith the grammar comply only if you start with the phoneme set. That iswhat HK Sanskrit, IAST Sanskrit and PTS Pali did. They are backwardcompatible with older text due to this.

>
>
>

>> This is the first requirement. It is not Unicode compliantbecause is has canonicals that are not normalized.> Can’t you describe in details those missing “canonicals” to makea proposal?

>
> I can, but that is the wrong approach.
Why is it wrong?

>
> Unicode correctly advised not to introduce canonicals trivially.
What do you mean?

> Then they gave a deadline to register them. The whole idea is todifferentiate the text of one language from the other with thesenational ligatures. Those languages that had Latin cononical forms arejustified in declaring codepoints for them. Whereas, Singhalaartificially made two 'canonicals'.Do you mean there’s a historic way to write Sinhala and a second way,introduced by the persons who submitted the Sinhalese Unicode proposalyears ago?

Is that what you call “canonicals”?

> Now they have to be normalized down to base codepoints.
>

> Besides, no change can ever happen because of the dog-in-the-mangereffect.

I don’t understand. What is that strange effect?

> If you insist, the alternative is to define Singhala in a new codeblock without canonicalsWithout canonicals????? Could you reformulate that please? I don’tunderstand. Which meaning of canonical(http://dictionary.reference.com/browse/canonical) are you using here?

> and the codepoints are purely for vowels, diphthongs and consonants.Next you make fonts to whatever depth of complexity that you want.OpenType allows this.I’m not a font designer, but can’t that be done with the presentSinhalese encoding?

> Then those who care about orthography can make font templates fortraditional orthographies. The user simply types as they speak and thefont responds with the combinations intended by the font author.Not changing anything, it would be possible to type as you speak with awell designed input method.

> I tested it for nine years with various kinds of users. It is in my(proof-of-concept) font.

Is it possible to download your font?

> There are few instances where (pure) Singhala words that have twoconsonants together, such as, kan|ða කන්ද and aþ|þa අත්ත. There are alsonames which are like hyphenated ones that have three vowels together(e.g. සුමිත්‍ර‌ආරච්චි -> sumiþra‌|aaracci.). In those instances, the user canintervene ZWNJ. The Bar represents ZWNJ in the RS examples. Users knowobvious Singhala words and names.

>> It has a jumble of Singhala letters and signs and duplicates thesame phoneme.

>
>     Duplicates the same phoneme…

> Do you mean there are characters which are encoded twice? In thiscase, you can ignore one of them.> Do you mean vowels and diphtongs can be written both as separateletters and as diacritics after a consonant? In this case, that workslike in Indic scripts. I don’t see anything wrong in that.

> What is wrong in that is in the struggle you would have to go throughfor text processing like search and replace and sorting.

?????

> It is very straight forward in RS:

> See the sample names getting sorted according to English and Singhalahere:

> http://www.lovatasinhala.com/liyanna.php#sort
> Meanings of the buttons left to right:
> [erase] [show sample] [sort-Singhala] [sort-English]
>
>
>
>>     It is good only for the trash can.
>     Isn’t that an exaggeration?
>
> Kind of.

> Jean, it can only be redefined. First, if you tinker it, you orphanthe present text made with the flawed one. If you define it elsewhere,then you can write a routine to convert the flawed one to the new one.That is what I did in the case of romanized SInghala.

>
>
>
>>
>>     Here are the considerations for a successful encoding:
>>     A consonant is called a 'hal akSara'. Check the Sanskrit dictionary:
>> http://www.sanskrit-lexicon.uni-koeln.de/cgi-bin/tamil/recherche
>>

>> In the pre-printing tradition, adjoining consonants either hadstandard ligatures or they were written touching each other to indicatethey are digraphs or trigraphs. The vowel signs surround these. When aconsonant occurred at the end of a word, that was flagged by the halantsign. 'halant' means hal at end.

>>

>> With the advent of letterpress printing, touch letter techniquebecame difficult or impossible to implement. So, now we have a newconcept of 'antara' hal -- interior consonant. The modern orthographyfirst honors the Sanskrit ligature rules and then drops the touch-letterrule. The ligatures are described in the following books:>> A.M. Gunasekara - Acomprehansive Grammar of the SinhaleseLanguage (1891) - pp 16-18, Rev Theodore G. Perera - The SinhalaLanguage (1932) -- 57 - 58.

OK. I’ll buy them, but not immediately: I’m currently unemployed and mybudget is very limited.

>>

>> There is only one way that Singhala could be digitized to dojustice to the continuation of its writing system and to smoothlysupport the past. That is to define the Vowels, diphthings, consonantsand prenasals as individual codepoints. You can give a codepoint eachfor the anusvara (ng sound) and visarga (the guttural postfix on vowels)

>     That’s what has been done in Unicode.
>
> No, Jean, no. What it has as 'consonants' are not consonants.
What are they?

> It has two coedepoints for each vowel...
Where is the problem?

> (Please don't make me repeat)
Sorry

>
>

>> as the earlier Sanskrit transliterations did or you couldprovide one codepoint each for the modified vowels.

>     That’s the Unicode dependent vowel signs.
>

> What I meant by modified vowels is the set of letters you get whenyou add the anusvara and visarga, five in each.

>
>
>
>>     I did this latter in romanized Singhala.
>     What did you exactly do?
>
> anusvara:RS [á í ú é ó]  (PTS Pali and HKS use M for anusvara)
> visarga: [ä ï ü ë ö] (HK-Sanskrit uses H and IAST dot below h)
Aren’t those characters supported yet?

>
>
>
>
>
>>
>>

>>> The fault is with Lankan technocrats that took the proposalas it was given and ever since prevented public participation. Mysolution is 'perfectly encoded with Unicode'.

>>         No. It’s an 8-bit character set independant from Unicode.
>>
>>     To think 8-bit is outside Unicode is wrong.
>     Could you explain it?
>
> I'll try.
>

> I am wrong technically because there is a leading zero byte added toeach SBCS character to make it UCS-2.Only if you use the UCS-2 or UTF-16 Little Endian encodings. But that’snot the problem.

My main concern is the name you are using.

I won’t deny your right to make a Sinhalese 8-bit font: that could beuseful in some old applications which can’t handle Unicode text, such asthe MS-DOS window on Windows.But when you say it is Unicode compliant, that’s not true because inUnicode, those code-points are used mostly for Latin letters.When you say it is compliant with ISO-8859-1, that’s not true becauseISO-8859-1 is used mostly for Latin letters.Perhaps you could call it SiSCII (Sinhalese Standard Code forInformation Interchange). ☺

Or perhaps SACII (Sinhalese Alternate Code for Information Interchange). ☺

> In the world of putting things into use, what is significant to me is,

> "16-bit characters, however, are not compatible with many currentapplications and protocols" -- RFC 2044 (UTF-8).

> My constant question is how is this working? Well, if you useISO-8859-1, it works. But ISO-8859 is SBCS. But those characters are inthe Unicode standard, and all characters are Unicode. SBCS have zerosadded to make them all same size with others in UCS-2,

?????

SBCS are Single Byte Character Sets. There is no need to add zeros andSBCS are not used in UCS-2.


> and that makes sense, but it doesn't, because UCS-4 and UTF-16,
?????

> but I don't know about that, may be Chinese -- pulling my hair,running up and down the stairs.UCS-2, UCS-4, UTF-16, UTF-8, UTF-7 are encodings, i.e. ways to recordUnicode code-points in memory, but the Unicode numbers remain the sameand those encodings don’t require separate fonts.


>
> It occurs to me HK Sanskrit works. It is pure US-ASCII.
What exactly is HK-Sanskrit?

> Does IAST work?  No. Why? It has the Added Bonus characters.
http://en.wikipedia.org/wiki/IAST

But that ’s a Latin transliteration. It was not meant to encode Sanskrittexts in Devanagari but rather to provide a readable version of the textto users of Latin alphabet languages who didn’t learn to read Devanagari.

> How about transliterating into iso-8859-1? That is the space betweenHK and IAST. It works. Romanizing Singhala into iso-8859-1 works!

And then…?

>

> I do not know how time proven applications made believing that acharacter is a one-byte data type still work,Most of them appeared before Unicode and were made by people using SBCSto write their own languages.

> but they were written before 'wide-character' happened. They stillwork. Reading... MSDN discussions, blogs and so on. iso-8859-1 works inspite of Unicode.

As well as many other SBCS.

> Windows-1252 is the best -- the oasis.

No. CP1252 is the best for Windows users who only use the Western Latinalphabet.


There are many other SBCS which are note compatible with it.

And Windows CP-1252, ISO-8859-1 and ISO 8859-15 are three differentcharacter sets.


>
> I hope you now understand.
>
>
>
>>     8-bit character set is the core of Unicode.
>     I’m very surprised, voiceless.
>

> Okay, Jean. You win. I am speechless too wondering why people have togo through all the rigmarole of perfecting a Unicode code block to getto a thing called Plain Text and then continue to struggle making itwork with all the sweat, blood and money to finally end up with acrippled system just to protect the reputation of a handful academicsand technocrats when transliteration is very simple and works instantly.

…

>

>> It is the best placer to park any language because it is themost stable part of the Unicode character database.> I wouldn’t like to type in Japanese with 8-bit fonts, changingthe font for every new glyph. I tried it once on Windows 3.1… It wasvery time-consuming.

> You caught the slip! I did not make one qualification. When I writein Singhala, I make a point to say that this does not include CJKV.Didn't do it here. I accept my fault.. (Actually, I have to fine mytranslator)

>
>
>
>>
>>
>>>

>>> Yes thee may remain some issues with older OSes thathave limited>>> support for standard OpenType layout tables. Butthere's now no>>> problem at all since Windows XP SP2. Windows 7 has thefull support,>>> and for those users that have still not upgraded fromWindows XP,>>> Windows 8 will be ready in next August with an upgradecost of about>>> US$ 40 in US (valid offer currently advertized for allusers upgrading>>> from XP or later), and certainly even less for users inIndia and Sri

>>>             Lanka.
>>>
>>>         The above are not any of my complaints.

>>> Per Capita Income in Sri Lanka $2400. They are content withcell phones. The practical place for computers is the Internet Cafe.Linux is what the vast majority needs.

>>>
>>>

>>> And standard Unicode fonts with free licences arealready available>>> for all systems (not just Linux for which they wereinitially

>>>             developed);
>>>
>>>         Yes, only 4 rickety ones. Who is going to buy them anyway?
>>         Why would you buy them if they’re free?
>>
>>     Brilliant!
>>
>>
>>

>>> Still Iskoola Pota made by Microsoft by copying a printedfont is the best. You check the Plain Text by mixing Singhala and Latinin the Arial Unicode MS font to see how pretty Plain text looks. Theyspent $2 or 20 million for someone to come and teach them how to makefonts. (Search ICTA.lk). Staying friendly with them is profitable. Worldbank backs you up too.>>> Sometime in 1990s when I was in Lanka, I tried to select aPC for my printer brother. We wanted to buy Adobe, Quark Express etc.The store keeper gave a list and asked us to select the programs.Knowing that they are expensive, I asked him first to tell me how muchthey cost. He said that he will install anything we wanted for free! Thesame trip coming back, in Zurich, the guys tried to give me a illicitcopy of Windows OS in appreciation for installing German and Italian(or French?) code pages on their computers.

>>>
>>>             there even exists solutions for older versions of iPhone
>>>             4. OR on Android smartphones and tablets.
>>>

>>> Mine works in them with no special solution. It worksanywhere that supports Open Type -- no platform discrimination

>>         Is there any platform discrimination with Unicode Sinhala?
>>
>>     You mean Apple / Windows / Linux?
>     I mean what you meant one line above.
>
> Okay, whatever.
>
>
>

>> Not really, but Microsoft was ahead of others. They all justsupport the crippled system.

>>
>>
>>
>>>

>>> No one wants to get back to the situation that existedin the 1980's>>> when there was a proliferation of non-interoperable 8bit encodings

>>>             for each specific platform.
>>>

>>> I agree. Today, 14 languages, including English, French,German and Italian all share the same character space called ISO-8859-1.>> In fact, ISO-8859-1 is not well suited for French (my nativelanguage): it lacks a few letters which were added to ISO-8859-15.However, I always use Unicode today, even for French-only texts.

>>

>> Jean, you are lucky because you use Latin letters. Latin lettersare always bare individual letters. Sinhala is not so. It has all theseother shaping complications and special rules are applied per Complexlanguage.

>>

>> I think you could appreciate my dilemma. This is how I see it.Going outside ISO-8859-1 is lot of trouble.

>     Which ones?
>

> Like the Added Bonus characters, specifically, the ones with macronsand dots

But making an 8-bit Sinhalese font already is going outside ISO-8859-1.

>
>
>>     Should I enumerate them to you?
>     Why not?
>

> It becomes what we call in Singhala, hooðahooða madee = හෝදහෝද මඩේ --getting in out of the puddle repeatedly washing your feet in it.

>
>
>

Re: Romanized Singhala - Think about it again

Reply via email to