Le 17/07/12 02:43, Naena Guru a écrit :
> Jean, sorry I am late. I used spare time as and when I got it.
>
> On Sun, Jul 8, 2012 at 10:20 PM, Jean-François Colson <[email protected]> wrote:
>
>     Le 09/07/12 01:29, Naena Guru a écrit :
>>     Jean-François,
>>
>> Let me approximate it in Romanized Singhala: 'jyó-frósvaa'. Just trying... > I don’t know how that transcription should be pronounced but in IPA, Jean-François is /ʒɑ̃.fʁɑ̃.swa/.
>
> They came as rectangles (in XP).
That’s not surprising. Windows XP is an old out-of-date system with, by default, a very limited set of fonts. But nobody prevents you to download some additional free fonts.

> They showed correctly in your message inside Firefox running in Puppy Linux, but where I an replying, it shows a reversed Euro like character
That is surprising. Which font does include a reversed euro sign ?

> in place of the a-umlaut.
I didn’t use any umlauts but two tildes.

> This again illustrates how hazardous it is for characters outside Latin-1.
It illustrates how hazardous it is to use such an old OS as Windows XP.

>
> I can only approximate the first letter as English j+y,
English j + y????? I don’t know that, neither in English nor in French.

/ʒ/ is the French j. It is not the English j plus something but rather the English j minus something. The English j is /dʒ/. That’s an affricate, i.e. roughly a sound which begins as a plosive and evolves to end as a fricative. The French j is a fricative. its nearest approximation in English is the z in azure or the s in leisure.

/ɑ̃/ is not an /ɑ/ with umlaut but an /ɑ/ with tilde. It is pronounced as the a in the English word “car” but with a nasal quality, i.e. some air passes through the nose.

Jean is an homophone of gens which you can hear here: http://fr.wiktionary.org/wiki/gens#Prononciation The speaker has recorded “des gens”, so focus your attention on the second syllable.

/f/ is pronounced like in English.

/ʁ/ is the French r, but there are several varieties of r among the French dialects, so using the English r instead is not a big problem.

/s/ and /w/ are pronounced as in English.

/a/ is very similar to /ɑ/. It is the begining of the diphtong in the English word “sky”.

> which is same on Singhala. The rest is pretty close, I think.
>
>
>
>>
>>     Thank you for your interest. See inline responses.
>>
>> On Thu, Jul 5, 2012 at 7:35 AM, Jean-François Colson <[email protected]> wrote:
>>
>>         Le 05/07/12 10:02, Naena Guru a écrit :
>>>
>>>
>>> On Wed, Jul 4, 2012 at 11:33 PM, Philippe Verdy <[email protected]> wrote:
>>>
>>> Anyway, consider the solutions already proposed in Sinhalese >>> Wikipedia. There are verious solutions proposed, including several >>> input methods supported there. But the purpose of these solutions is >>> always to generate Sinhalese texts perfectly encoded with Unicode and
>>>             nothing else.
>>>
>>> Thank you for the kind suggestion. The problem is Unicode Sinhala does not perfectly support Singhala!
>>         What’s wrong? Are there missing letters?
>>
>>     Many, many.
>>
>>>         The solution is for Sinhala not for Unicode!
>>         Or rather for Sinhala by Unicode.
>>
>>     Sure, if you want to do it with proper deliberation.
>>
>>
>>
>>> I am not saying Unicode has a bad intention but an ill-conceived product.
>>         What precisely is ill-conceived?
>>
>>     Anglo-centric thinking is what is wrong.
>     ?????
>
> Letters have no direct relation to speech -- very few. In Singhala (perhaps as in French too as someone said?) you write what you say. In Singhala, the exception is clearly understood rule set about how to pronounce short 'a' -- whether muted or not.
>
> Therefore, the approach should have been to encode the vowels, diphthongs and consonants as base letters. I assigned the acute accent to the 'ng' sound and the umlaut to the guttyral H in Sanskrit, but they could be assigned independent codepoints.
>
>
>
>>      Let me take you on the scenic route:
>>
>> Number of letters in Singhala is only theoretical. In the case of Singhala orthography, the actually used number depends on the Sanskrit vocabulary. > Do you mean there are many conjunct consonants, sometimes with a separate glyph?
>
> Yes, many. There are three orthographies. Singhala does not have CCs at all. Sanskrit has a lot. Pali has touch letters in addition to what Sanskrit has. Modern Singhala is mixed Singhala and Sanskrit. With Unicode Sinhala, you need to know which ones join and provide the ZWJ and hope and pray that the font has the CC. Often they are absent. So I guess your problem could be solved by providing new fonts with a better support of conjunct consonants. What you did for 8-bit Sinhala, you could do it for Unicode Sinhala too.

> Then SLS1134 gives wrong advice too.
Could you explain in details what is wrong in their advice?

>
> In Devanagari, they’re made by typing two or more consonants separated by halants. Isn’t that possible with Sinhala?
>
> Yes. That is possible, which means the fonts need lookup tables that provide the CCs in the Private Use Area of the font.
Isn’t that what you did in your 8-bit Sinhala font?

>
> BTW, 'halant' means the last consonant in a word. 'hal' means consonant. What Unicode calls halant (and virama) is the sign that indicates a letter is a (free-standing) consonant. virama means the marker of a consonant at end of a sentence. (Sigh, cry) > In Singhala, we call the sign 'hal kiriima' = the sign which makes it a hal and is not location specific.
>
>
>
>>     My test font has about 1500.
> IIRC there are only 191 code-points in Latin-1. Or 96? I’m not sure whether ASCII is part of Latin-1.
>
> These are the two that I use to look up letters of the Latin-1 repertoire:
> http://www.unicode.org/charts/PDF/U0000.pdf
> http://www.unicode.org/charts/PDF/U0080.pdf
But there are not “about 1500” code points there.

>
>
>
>> I need much more. Pali orthography enforces touch-letter rule (see later). Modern Singhala, meaning 1st century onwards, is an admixture of Singhala and Sanskrit. Pali is not mixed into Singhala text (except to quote like a foreign language).
>>
>> Generally, the Unicode approach is to treat the consonants as base shapes. Then the vowel signs are added around them. The vowel signs have their own codepoints. We hit upon the first problem here because there are two possible codepoints for each single-mora vowel, double-mora vowel and each diphthong. We lose all hopes of traditional search, replace etc. It complicates collation too. > In my first name, I usualy type the ç as U+00E7 because I have a Ç key on my keyboard, but using the two code-points U+00E7 U+0327 (c + combining cedilla) would be correct too and in many software the search function won’t find the ç if I’m looking for a ç.
>
> You are tight!
Thanks

> Notepad does not find ç. It is certainly a bug. I tried Geany text editor and Abiword in Linux. They both are okay. I think Notepad forgets those letters in the Latin 1 Supplement. This is not a big deal. Not really a smaller one than improving the Sinhalese fonts if they miss some conjuncts.

> I remember that VOLT guys took awhile to realize that Open Type 'simple' script includes those.
>
>
>
>>
>> Then they went up the gum tree of the notion that the Singhala consonant is actually a consonant with a vowel inside it -- an absurdity -- the Abugida theory. They then added two ligatures without normalizing them. Singhala has 15 ligatures in that category of ligatures. They included upadhmAnIya and left out jihvAmUlIya. > Can’t those ligatures be typed as separate signs with a zero-with joiner? I’ve seen there’s one on Sinhalese keyboards.
>
> Yes, you can but only to get a few.
Could you describe in details the missing conjuncts, which letters should generate them and what they should look like? Also, it should be very useful to send those descriptions to the designers of Sinhalese fonts.

> It hurts the orthography, Jean. People forget orthography when it is in disuse. It is better to make the CCs and let the user prevent their formation by means of ZWNJ where they do not want the joining. Oh! I see. It would be possible to make input methods which work as you want and generate Sinhalese characters as they are used today.

It is not possible to change the behaviour of sinhalese characters: that would mangle all presently existing texts.

> I implemented this in my font. It is easier for input thanh
>
>
>
>>
>>     That concludes that Unicode Sinhala is not grammar compliant.
>     What do you mean?
>
> That the rules of orthography are ignored. e.g., just two conjuncts (conjoints) out of a group of 15
15 or about 1500

> in their class inside the standard and others missing.
You’re not alone. In the South of Belgium and the North of France, there are local languages which, in some orthographies, use the ring above the vowels A and E. Å and å are available as precomposed consonants. E̊ and e̊ are not and must be composed of E/e + U+00A0. That’s not a big problem.

I think the good question is “Why were those two CCs included as precomposed characters?”, not “Why were the other ones forgotten?”.

> There is no consistency in this code block.
Just forget the useless characters if they are a mistake of the past.

> This is why I say, just encode the vowels and consonants. You comply with the grammar comply only if you start with the phoneme set. That is what HK Sanskrit, IAST Sanskrit and PTS Pali did. They are backward compatible with older text due to this.
>
>
>
>> This is the first requirement. It is not Unicode compliant because is has canonicals that are not normalized. > Can’t you describe in details those missing “canonicals” to make a proposal?
>
> I can, but that is the wrong approach.
Why is it wrong?

>
> Unicode correctly advised not to introduce canonicals trivially.
What do you mean?

> Then they gave a deadline to register them. The whole idea is to differentiate the text of one language from the other with these national ligatures. Those languages that had Latin cononical forms are justified in declaring codepoints for them. Whereas, Singhala artificially made two 'canonicals'. Do you mean there’s a historic way to write Sinhala and a second way, introduced by the persons who submitted the Sinhalese Unicode proposal years ago?
Is that what you call “canonicals”?

> Now they have to be normalized down to base codepoints.
>
> Besides, no change can ever happen because of the dog-in-the-manger effect.
I don’t understand. What is that strange effect?

> If you insist, the alternative is to define Singhala in a new code block without canonicals Without canonicals????? Could you reformulate that please? I don’t understand. Which meaning of canonical (http://dictionary.reference.com/browse/canonical) are you using here?

> and the codepoints are purely for vowels, diphthongs and consonants. Next you make fonts to whatever depth of complexity that you want. OpenType allows this. I’m not a font designer, but can’t that be done with the present Sinhalese encoding?

> Then those who care about orthography can make font templates for traditional orthographies. The user simply types as they speak and the font responds with the combinations intended by the font author. Not changing anything, it would be possible to type as you speak with a well designed input method.

> I tested it for nine years with various kinds of users. It is in my (proof-of-concept) font.
Is it possible to download your font?

> There are few instances where (pure) Singhala words that have two consonants together, such as, kan|ða කන්ද and aþ|þa අත්ත. There are also names which are like hyphenated ones that have three vowels together (e.g. සුමිත්‍ර‌ආරච්චි -> sumiþra‌|aaracci.). In those instances, the user can intervene ZWNJ. The Bar represents ZWNJ in the RS examples. Users know obvious Singhala words and names.
>
>> It has a jumble of Singhala letters and signs and duplicates the same phoneme.
>
>     Duplicates the same phoneme…
> Do you mean there are characters which are encoded twice? In this case, you can ignore one of them. > Do you mean vowels and diphtongs can be written both as separate letters and as diacritics after a consonant? In this case, that works like in Indic scripts. I don’t see anything wrong in that.
>
> What is wrong in that is in the struggle you would have to go through for text processing like search and replace and sorting.
?????

> It is very straight forward in RS:
> See the sample names getting sorted according to English and Singhala here:
> http://www.lovatasinhala.com/liyanna.php#sort
> Meanings of the buttons left to right:
> [erase] [show sample] [sort-Singhala] [sort-English]
>
>
>
>>     It is good only for the trash can.
>     Isn’t that an exaggeration?
>
> Kind of.
> Jean, it can only be redefined. First, if you tinker it, you orphan the present text made with the flawed one. If you define it elsewhere, then you can write a routine to convert the flawed one to the new one. That is what I did in the case of romanized SInghala.
>
>
>
>>
>>     Here are the considerations for a successful encoding:
>>     A consonant is called a 'hal akSara'. Check the Sanskrit dictionary:
>> http://www.sanskrit-lexicon.uni-koeln.de/cgi-bin/tamil/recherche
>>
>> In the pre-printing tradition, adjoining consonants either had standard ligatures or they were written touching each other to indicate they are digraphs or trigraphs. The vowel signs surround these. When a consonant occurred at the end of a word, that was flagged by the halant sign. 'halant' means hal at end.
>>
>> With the advent of letterpress printing, touch letter technique became difficult or impossible to implement. So, now we have a new concept of 'antara' hal -- interior consonant. The modern orthography first honors the Sanskrit ligature rules and then drops the touch-letter rule. The ligatures are described in the following books: >> A.M. Gunasekara - Acomprehansive Grammar of the Sinhalese Language (1891) - pp 16-18, Rev Theodore G. Perera - The Sinhala Language (1932) -- 57 - 58.
>
OK. I’ll buy them, but not immediately: I’m currently unemployed and my budget is very limited.

>>
>> There is only one way that Singhala could be digitized to do justice to the continuation of its writing system and to smoothly support the past. That is to define the Vowels, diphthings, consonants and prenasals as individual codepoints. You can give a codepoint each for the anusvara (ng sound) and visarga (the guttural postfix on vowels)
>     That’s what has been done in Unicode.
>
> No, Jean, no. What it has as 'consonants' are not consonants.
What are they?

> It has two coedepoints for each vowel...
Where is the problem?

> (Please don't make me repeat)
Sorry

>
>
>> as the earlier Sanskrit transliterations did or you could provide one codepoint each for the modified vowels.
>     That’s the Unicode dependent vowel signs.
>
> What I meant by modified vowels is the set of letters you get when you add the anusvara and visarga, five in each.
>
>
>
>>     I did this latter in romanized Singhala.
>     What did you exactly do?
>
> anusvara:RS [á í ú é ó]  (PTS Pali and HKS use M for anusvara)
> visarga: [ä ï ü ë ö] (HK-Sanskrit uses H and IAST dot below h)
Aren’t those characters supported yet?

>
>
>
>
>
>>
>>
>>> The fault is with Lankan technocrats that took the proposal as it was given and ever since prevented public participation. My solution is 'perfectly encoded with Unicode'.
>>         No. It’s an 8-bit character set independant from Unicode.
>>
>>     To think 8-bit is outside Unicode is wrong.
>     Could you explain it?
>
> I'll try.
>
> I am wrong technically because there is a leading zero byte added to each SBCS character to make it UCS-2. Only if you use the UCS-2 or UTF-16 Little Endian encodings. But that’s not the problem.
My main concern is the name you are using.
I won’t deny your right to make a Sinhalese 8-bit font: that could be useful in some old applications which can’t handle Unicode text, such as the MS-DOS window on Windows. But when you say it is Unicode compliant, that’s not true because in Unicode, those code-points are used mostly for Latin letters. When you say it is compliant with ISO-8859-1, that’s not true because ISO-8859-1 is used mostly for Latin letters. Perhaps you could call it SiSCII (Sinhalese Standard Code for Information Interchange). ☺
Or perhaps SACII (Sinhalese Alternate Code for Information Interchange). ☺

> In the world of putting things into use, what is significant to me is,
> "16-bit characters, however, are not compatible with many current applications and protocols" -- RFC 2044 (UTF-8).
>
> My constant question is how is this working? Well, if you use ISO-8859-1, it works. But ISO-8859 is SBCS. But those characters are in the Unicode standard, and all characters are Unicode. SBCS have zeros added to make them all same size with others in UCS-2,
?????
SBCS are Single Byte Character Sets. There is no need to add zeros and SBCS are not used in UCS-2.

> and that makes sense, but it doesn't, because UCS-4 and UTF-16,
?????

> but I don't know about that, may be Chinese -- pulling my hair, running up and down the stairs. UCS-2, UCS-4, UTF-16, UTF-8, UTF-7 are encodings, i.e. ways to record Unicode code-points in memory, but the Unicode numbers remain the same and those encodings don’t require separate fonts.

>
> It occurs to me HK Sanskrit works. It is pure US-ASCII.
What exactly is HK-Sanskrit?

> Does IAST work?  No. Why? It has the Added Bonus characters.
http://en.wikipedia.org/wiki/IAST
But that ’s a Latin transliteration. It was not meant to encode Sanskrit texts in Devanagari but rather to provide a readable version of the text to users of Latin alphabet languages who didn’t learn to read Devanagari.

> How about transliterating into iso-8859-1? That is the space between HK and IAST. It works. Romanizing Singhala into iso-8859-1 works!
And then…?

>
> I do not know how time proven applications made believing that a character is a one-byte data type still work, Most of them appeared before Unicode and were made by people using SBCS to write their own languages.

> but they were written before 'wide-character' happened. They still work. Reading... MSDN discussions, blogs and so on. iso-8859-1 works in spite of Unicode.
As well as many other SBCS.

> Windows-1252 is the best -- the oasis.
No. CP1252 is the best for Windows users who only use the Western Latin alphabet.

There are many other SBCS which are note compatible with it.

And Windows CP-1252, ISO-8859-1 and ISO 8859-15 are three different character sets.

>
> I hope you now understand.
>
>
>
>>     8-bit character set is the core of Unicode.
>     I’m very surprised, voiceless.
>
> Okay, Jean. You win. I am speechless too wondering why people have to go through all the rigmarole of perfecting a Unicode code block to get to a thing called Plain Text and then continue to struggle making it work with all the sweat, blood and money to finally end up with a crippled system just to protect the reputation of a handful academics and technocrats when transliteration is very simple and works instantly.
…

>
>> It is the best placer to park any language because it is the most stable part of the Unicode character database. > I wouldn’t like to type in Japanese with 8-bit fonts, changing the font for every new glyph. I tried it once on Windows 3.1… It was very time-consuming.
>
> You caught the slip! I did not make one qualification. When I write in Singhala, I make a point to say that this does not include CJKV. Didn't do it here. I accept my fault.. (Actually, I have to fine my translator)
>
>
>
>>
>>
>>>
>>> Yes thee may remain some issues with older OSes that have limited >>> support for standard OpenType layout tables. But there's now no >>> problem at all since Windows XP SP2. Windows 7 has the full support, >>> and for those users that have still not upgraded from Windows XP, >>> Windows 8 will be ready in next August with an upgrade cost of about >>> US$ 40 in US (valid offer currently advertized for all users upgrading >>> from XP or later), and certainly even less for users in India and Sri
>>>             Lanka.
>>>
>>>         The above are not any of my complaints.
>>> Per Capita Income in Sri Lanka $2400. They are content with cell phones. The practical place for computers is the Internet Cafe. Linux is what the vast majority needs.
>>>
>>>
>>> And standard Unicode fonts with free licences are already available >>> for all systems (not just Linux for which they were initially
>>>             developed);
>>>
>>>         Yes, only 4 rickety ones. Who is going to buy them anyway?
>>         Why would you buy them if they’re free?
>>
>>     Brilliant!
>>
>>
>>
>>> Still Iskoola Pota made by Microsoft by copying a printed font is the best. You check the Plain Text by mixing Singhala and Latin in the Arial Unicode MS font to see how pretty Plain text looks. They spent $2 or 20 million for someone to come and teach them how to make fonts. (Search ICTA.lk). Staying friendly with them is profitable. World bank backs you up too. >>> Sometime in 1990s when I was in Lanka, I tried to select a PC for my printer brother. We wanted to buy Adobe, Quark Express etc. The store keeper gave a list and asked us to select the programs. Knowing that they are expensive, I asked him first to tell me how much they cost. He said that he will install anything we wanted for free! The same trip coming back, in Zurich, the guys tried to give me a illicit copy of Windows OS in appreciation for installing German and Italian (or French?) code pages on their computers.
>>>
>>>             there even exists solutions for older versions of iPhone
>>>             4. OR on Android smartphones and tablets.
>>>
>>> Mine works in them with no special solution. It works anywhere that supports Open Type -- no platform discrimination
>>         Is there any platform discrimination with Unicode Sinhala?
>>
>>     You mean Apple / Windows / Linux?
>     I mean what you meant one line above.
>
> Okay, whatever.
>
>
>
>> Not really, but Microsoft was ahead of others. They all just support the crippled system.
>>
>>
>>
>>>
>>> No one wants to get back to the situation that existed in the 1980's >>> when there was a proliferation of non-interoperable 8 bit encodings
>>>             for each specific platform.
>>>
>>> I agree. Today, 14 languages, including English, French, German and Italian all share the same character space called ISO-8859-1. >> In fact, ISO-8859-1 is not well suited for French (my native language): it lacks a few letters which were added to ISO-8859-15. However, I always use Unicode today, even for French-only texts.
>>
>> Jean, you are lucky because you use Latin letters. Latin letters are always bare individual letters. Sinhala is not so. It has all these other shaping complications and special rules are applied per Complex language.
>>
>> I think you could appreciate my dilemma. This is how I see it. Going outside ISO-8859-1 is lot of trouble.
>     Which ones?
>
> Like the Added Bonus characters, specifically, the ones with macrons and dots
But making an 8-bit Sinhalese font already is going outside ISO-8859-1.

>
>
>>     Should I enumerate them to you?
>     Why not?
>
> It becomes what we call in Singhala, hooðahooða madee = හෝදහෝද මඩේ -- getting in out of the puddle repeatedly washing your feet in it.
>
>
>


Reply via email to