Re: abstract characters, semantics, meaningful transformations ... Was: Tibetan Paluta
Thank you, professor. You wrote exactly what one would expect from a professor. It is a wonderful display of your prowess in the subject. Doctors and lawyers use Latin for concealment and self-preservation. Greenspan used Greenspanish. Unicode masters use Unicodish. Indic is the name Unicode assigned to South Asian writing systems that are associated with Sanskrit vyaakaraNa. This is a result of what the good professor explains by "Unicode is not the realm of everyone; it's the realm of people with a certain amount of linguistic knowledge and computer knowledge". What is that 'certain amount' and which deity decides it? How do we unfortunate nincompoops decode it? Decode itself is beyond us, indeed. South Asians, especially Indians who already seem to have too many gods to deal with, do not need, though they might be tempted to add an image of the exalted Unicode god behind a colorful curtain to sing praise to with an alms box marked M$ besides to get favors each time the high priest scrubs off some of its 'hairy esoteric mess' while surreptitiously (or, ignorantly?) adding more. Brahmins were able to make any declaration because they were privileged. Similarly, Unicode experts can make declarations like, 'very few forms of writing are direct transcriptions of speech' and hide behind the 'in case' adjective 'direct' to avoid giving actual data. Of course, they can boldly count Sinhala as one that is not a direct transcription of speech. Speech getting transcribed into writing itself is a Unicodish. Hark! The professor declares. So, boys and girls, if you want to pass the test memorize this, even if it is obviously false: Printing made no difference to the fact that English has a dozen vowels with five letters to write them. The thorn has little impact on the ambiguity of English writing. The problem with printing is that it fossilizes the written language, and our spellings have stayed the same while the pronunciations have changed. And the dissociation of sound and writing sometimes helps English; even when two English speakers from different parts of the world would have trouble understanding each other, writing is usually not so impaired. It is printing with the dictionary industry that fossilized writing and as a result, forced speech to comply. The 'certain' level of knowledge above is now revealed. Language, dialect, creole, migration, intermixing of different peoples, accent...; where do these stand? Find ye by the foregoing what the fossil 'ye' actually was and what caused it to get fossilized in this form. On 5/2/2017 5:31 AM, David Starner wrote: On Mon, May 1, 2017 at 7:26 AM Naena Guru via Unicode <unicode@unicode.org <mailto:unicode@unicode.org>> wrote: This whole attempt to make digitizing Indic script some esoteric, 'abstract', 'semantic representation' and so on seems to me is an attempt to make Unicode the realm of the some super humans. Unicode is like writing. At its core, it is a hairy esoteric mess; mix these certain chemicals the right ways, and prepare a writing implement and writing surface in the right (non-trivial) ways, and then manipulate that implement carefully to make certain marks that have unclear delimitations between correct and incorrect. But in the end, as much of that is removed from the problem of the user as possible; in the case of modern word-processing system, it's a matter of hitting the keys and then hitting print, in complete ignorance of all the silicon and printing magic going on between. Unicode is not the realm of everyone; it's the realm of people with a certain amount of linguistic knowledge and computer knowledge. There's only a problem if those people can't make it usable for the everyday programmer and therethrough to the average person. The purpose of writing is to represent speech. Meh. The purpose of writing is to represent language, which may be unrelated to speech (like in the case of SignWriting and mathematics) or somewhat related to speech--very few forms of writing are direct transcriptions of speech. Even the closest tend to exchange a lot of intonation details for punctuation that reveals different information. English writing was massacred when printing was brought in from Europe. No, it wasn't. Printing made no difference to the fact that English has a dozen vowels with five letters to write them. The thorn has little impact on the ambiguity of English writing. The problem with printing is that it fossilizes the written language, and our spellings have stayed the same while the pronunciations have changed. And the dissociation of sound and writing sometimes helps English; even when two English speakers from different parts of the world would have trouble understanding each other, writing is usually not so impaired.
Re: abstract characters, semantics, meaningful transformations ... Was: Tibetan Paluta
A little humor is very good. sarasvaþi was a sweet girl, I am sure, so much so that when she died, I think, those who were imagining about her beyond practical, made her rise up, up and fly away. Now you watch what happens to Elizabeth when she dies. They narrowly failed making one such with Hillary Clinton as she is suspected of having Parkinson's which condition her daughter says has an anecdotal remedy with MaryJane. Hmmm... Who went to her daughter's house instead of to the doctor when they suddenly fell? As for Thoth, he is okay. Don't worry. Egyptian man => demi-god => god has not much of a consequence in the West dominated culture of this day. On 5/1/2017 8:55 PM, Richard Wordingham via Unicode wrote: On Mon, 1 May 2017 19:49:27 +0530 Naena Guru via Unicode <unicode@unicode.org> wrote: The purpose of writing is to represent speech. It is not some secret that demi-gods created Sarasvati and Thoth would be offended at being called mere demi-gods. sound => letter that is the basis for writing. "=>" is not a particularly phonetic notation. It took quite a while for letters to become the primary part of writing anywhere, and they are not a universal phenomenon. Richard. Okay, Richard. Your probably have knowledge of how writing evolved in the whole world. Tell us how it was in South Asia. Was it like I said, sound => letter? I assume only to know about English and Indic in this respect.
abstract characters, semantics, meaningful transformations ... Was: Tibetan Paluta
This whole attempt to make digitizing Indic script some esoteric, 'abstract', 'semantic representation' and so on seems to me is an attempt to make Unicode the realm of the some super humans. The purpose of writing is to represent speech. It is not some secret that demi-gods created that we are trying to explain with 'modern' linguistic gymnastics. sound => letter that is the basis for writing. English writing was massacred when printing was brought in from Europe. A similar thing is happening to Indic by all this mumbo-jumbo. I call out to NATIVE users of Indic to explain what apparently Europeans or Americans are discussing here. On 5/1/2017 10:47 AM, Philippe Verdy wrote: 2017-04-29 21:21 GMT+02:00 Naena Guru via Unicode <unicode@unicode.org <mailto:unicode@unicode.org>>: Just about the name paluta: In Sanskrit, the length of vowels are measured in maaþra (a cognate of the word 'meter'). It is the spoken length of a short vowel. In Latin it is termed mora. Usually, you have only single and double length vowels. A paluþa length is like when you call out somebody from a distance. Pluta is a careless use of spelling. Virama and Halanta are two other terms loosely used. Anyway, Unicode is only about DISPLAYING a script: There's a shape here; Let's find how to get it by assembling other shapes or by creating a code point for it. What is short, long or longer in speech is no concern for Unicode. Wrong. Unicode is absolutely not about how to "display" any script (except symbols and notational symbols). Unicode does not encode glyphs. Unicode encodes "abstract characters" according to their semantics, in order to assign them properties allowing meaningful transformations of text and in order to allow perfoirming searches (with collation algorithms). What is important is their properties (something that ISO 10646 does not care when it started the UCS in a separate project, ignoring how it would be used, focusing too much on apparent glytphs (and introducing lot of "compatiblity characters" that would not have been encoded otherwise, and creating some havoc in logical processing. Anyway Unciode makes some exceptions to the logical model only for roundtrip comptaibility with other standards that used another encoding model widely used, notably in Thai: these are the exception where there are "prepended" letters. There was some havoc also for some scripts in India because of roundtrip compatiblity with an Indian standard (criticized by many users of Tamil and some other Southern Indic scripts that don't follow directly the paradigm created for getting some limited transliteration with Devanagari: that initial desire was abandoned but the legacy Indic scripts in India were imported as is to Unicode)
Re: Tibetan Paluta
Just about the name paluta: In Sanskrit, the length of vowels are measured in maaþra (a cognate of the word 'meter'). It is the spoken length of a short vowel. In Latin it is termed mora. Usually, you have only single and double length vowels. A paluþa length is like when you call out somebody from a distance. Pluta is a careless use of spelling. Virama and Halanta are two other terms loosely used. Anyway, Unicode is only about DISPLAYING a script: There's a shape here; Let's find how to get it by assembling other shapes or by creating a code point for it. What is short, long or longer in speech is no concern for Unicode. On 4/27/2017 1:57 PM, Srinidhi A via Unicode wrote: The annotation of 0F85 ྅ TIBETAN MARK PALUTA says it is used for avagraha. However it seems this character denotes pluta instead of avagraha. Pluta is used for indicating elongation of vowel. Similar character with identical glyph is encoded in Soyombo( 11A9D ) with name as pluta. These characters are likely derive from digit ३ as ३ is used in Devanagari for indicating pluta. Figure 2 of L2/16-016 shows the usage of TIBETAN MARK PALUTA for Pluta. What is the correct spelling in Tibetan language Paluta or Pluta? Can Tibetan scholars clarify the usage of above character? If 0F85 is used for Pluta ,are there any distinct characters denoting avagraha in Tibetan script. Srinidhi A
Re: Go romanize! Re: Counting Devanagari Aksharas
Quote from below: The word indeed means 'danger' (Pali/Sanskrit _antarāya_). The pronunciation is /ʔontʰalaːi/; the Tai languages that use(d) the Tai Tham script no longer have /r/. The older sequence /tr/ normally became /tʰ/ (except in Lao), but the spelling has not been updated - at least, not amongst the more literate. The script has a special symbol for the short vowel /o/, which it shares with the Lao script. This symbol is used in writing that word. Two ways I have seen it spelt, each with two orthographic syllables, are ᩋᩫ᩠ᨶᨲᩕᩣ᩠ᨿ on-trAy (the second syllable has two stacks) and ᩋᩫᨶ᩠ᨲᩕᩣ᩠ᨿ o-ntrAy. I have also seen a form closer to Pali, namely _antarAy_, written ᩋᨶ᩠ᨲᩁᩂ᩠ᨿ a-nta-rAy. However, I have seen nothing that shows that I won't encounter ᩋᩢᨶ᩠ᨲᩁᩣ᩠ᨿ a-nta-rAy with the first vowel written explicitly, or even ᩋᩢ᩠ᨶᨲᩁᩣ᩠ᨿ an-ta-rAy. How does your scheme distinguish such alternatives? Response: Perhaps this word is derived from Sanskrit 'anþaraða' (Search: antarada at http://www.sanskrit-lexicon.uni-koeln.de/cgi-bin/tamil/recherche) Sinhala:anþaraaðaayakayi, anþaraava, anþaraavayi, anþraava, anþraavayi Use this font to read the above Sinhala words: http://smartfonts.net/ttf/aruna.ttf -=- svasþi siððham! -=- On 4/25/2017 2:07 AM, Richard Wordingham via Unicode wrote: On Mon, 24 Apr 2017 20:53:12 +0530 Naena Guru via Unicode<unicode@unicode.org> wrote: Quote by Richard: Unless this implies a spelling reform for many languages, I'd like to see how this works for the Tai Tham script. I'm not happy with the Romanisation I use to work round hostile rendering engines. (My scheme is only documented in variable hack_ss02 in the last script blocks ofhttp://wrdingam.co.uk/lanna/denderer_test.htm.) For example, there are several different ways of writing what one might naively record as "ontarAy". MY RESPONSE: Richard, I stuck to the two specifications (Unicode and Font) and Sanskrit grammar. The akSara has two aspects, its sound (zabða, phoneme) and its shape. (letter, ruupa). Reduce the writing system to its consonants, vowels etc. (zabða) and assign SBCS letters/codes to them (ruupa). SBCS provides the best technical facilities for any language. (This is why now more than 130 languages romanize despite Unicode). Use English letters for similar sounds in the native speech. Now, treat all combinations as ligatures. For example, 'po' sound in Indic has the p consonant with a sign ahead plus a sign after. In many Indic scripts, yes. In Devanagari, the vowel sign is normally a singly element classified as following the consonant. In Thai, the vowel sign precedes the consonant. Tai Tham uses both a two-part sign and a preceding sign. The preceding sign is for Tai words and the two-part sign for Pali words, but loanwords from Pali into the Tai languages may retain the two part sign. For the font, there is no difference between the way it makes the combination 'ä', which has a sign above and the Indic having two on either side. For OpenType, there is. The first can be made by providing a simple table of where the diaeresis goes relative to the base characters, in this case the diaeresis. The second is painfully complicated, for the 'p' may have other marks attached to it, so doing it be relative positioning is painfully complicated and error-prone. This job is given to the rendering engine, which may introduce its own problems. AAT and Graphite offer the font maker the ability to move the 'sign ahead' from after the 'p' to before it. Recall that long ago, Unicode stopped defining fixed ligatures and asked the font makers to define them in the PUA. While the first is true enough, I believe the second is false. Not every glyph has to be mapped to by a single character. I don't do that for contextual forms or ligatures in my font. Spelling and speech: There is indeed a confusion about writing and reading in Hindi, as I have observed. Like in English and Tamil, Hindi tends to end words with a consonant. So, there is this habit among the Hindi speakers to drop the ending vowel, mostly 'a' from words that actually end with it. For example, the famous name Jayantha (miserable mine too, haha! = jayanþa as Romanized), is pronounced Jayanth by Hindi speakers. It is a Sanskrit word. Sanskrit and languages like Sinhhala have vowel ending and are traditionally spoken as such. This loss is also to be found in Further India. Thai, Lao and Khmer now require that such a word-final vowel be written explicitly if it is still pronounced. Looking at the word you gave, ontarAy, it looks to me like an Anglicized form. If I am to make a guess, its ending is like in ontarAyi. Is it said something like, own-the-raa-yi? (danger?) If I am right, this is a good example of decline if a writing system owing to bad, uncaring application of technology. We are in the Digital Age, and we need not compromise any more. In fact, we can fix errors and decadence introduced by past technol
Go romanize! Re: Counting Devanagari Aksharas
Quote by Richard: Unless this implies a spelling reform for many languages, I'd like to see how this works for the Tai Tham script. I'm not happy with the Romanisation I use to work round hostile rendering engines. (My scheme is only documented in variable hack_ss02 in the last script blocks of http://wrdingam.co.uk/lanna/denderer_test.htm.) For example, there are several different ways of writing what one might naively record as "ontarAy". MY RESPONSE: Richard, I stuck to the two specifications (Unicode and Font) and Sanskrit grammar. The akSara has two aspects, its sound (zabða, phoneme) and its shape. (letter, ruupa). Reduce the writing system to its consonants, vowels etc. (zabða) and assign SBCS letters/codes to them (ruupa). SBCS provides the best technical facilities for any language. (This is why now more than 130 languages romanize despite Unicode). Use English letters for similar sounds in the native speech. Now, treat all combinations as ligatures. For example, 'po' sound in Indic has the p consonant with a sign ahead plus a sign after. For the font, there is no difference between the way it makes the combination 'ä', which has a sign above and the Indic having two on either side. Recall that long ago, Unicode stopped defining fixed ligatures and asked the font makers to define them in the PUA. Spelling and speech: There is indeed a confusion about writing and reading in Hindi, as I have observed. Like in English and Tamil, Hindi tends to end words with a consonant. So, there is this habit among the Hindi speakers to drop the ending vowel, mostly 'a' from words that actually end with it. For example, the famous name Jayantha (miserable mine too, haha! = jayanþa as Romanized), is pronounced Jayanth by Hindi speakers. It is a Sanskrit word. Sanskrit and languages like Sinhhala have vowel ending and are traditionally spoken as such. Dictionary is a commercial invention. When Caxton brought lead types to England, French-speaking Latin-flaunting elites did not care about the poor natives. Earlier, invading Romans forced them to drop Fuþark and adopt the 22-letter Latin alphabet. So, they improvised. Struck a line across d and made ð, Eth; added a sign to 'a' and made æ (Asc) and continued using Thorn (þ) by rounding the loop. Lead type printing hit English for the second time, ruining it as the spell standardizing began. Dictionaries sold. THE POWERFUL CAN RUIN PEOPLE'S PROPERTY BECAUSE THEY CAN IN ORDER TO MAKE MONEY. Unicode enthusiasts, take heed! Looking at the word you gave, ontarAy, it looks to me like an Anglicized form. If I am to make a guess, its ending is like in ontarAyi. Is it said something like, own-the-raa-yi? (danger?) If I am right, this is a good example of decline if a writing system owing to bad, uncaring application of technology. We are in the Digital Age, and we need not compromise any more. In fact, we can fix errors and decadence introduced by past technologies. RICHARD: That sounds like a letter-assembly system. MY RESPONSE: Nothing assembled there, my friend. On 4/24/2017 12:38 PM, Richard Wordingham via Unicode wrote: On Mon, 24 Apr 2017 00:36:26 +0530 Naena Guru via Unicode <unicode@unicode.org> wrote: The Unicode approach to Sanskrit and all Indic is flawed. Indic should not be letter-assembly systems. Sanskrit vyaakaraNa (grammar) explains the phonemes as the atoms of the speech. Each writing system then assigns a shape to the phonetically precise phoneme. The most technically and grammatically proper solution for Indic is first to ROMANIZE the group of writing systems at the level of phonemes. That is, assign romanized shapes to vowels, consonants, prenasals, post-vowel phonemes (anusvara and visarjaniiya with its allophones) etc. This approach is similar to how European languages picked up Latin, improvised the script and even uses Simples and Capitals repertoire. Romanizing immediately makes typing easier and eliminates sometimes embarrassing ambiguity in Anglicizing -- you type phonetically on key layouts close to QWERTY. (Only four positions are different in Romanized Sinhala layout). If we drop the capitalizing rules and utilize caps to indicate the 'other' forms of a common letter, we get an intuitively typed system for each language, and readable too. When this is done carefully, comparing phoneme sets of the languages, we can reach a common set of Latin-derived SINGLE-BYTE letters completely covering all phonemes of all Indic. Unless this implies a spelling reform for many languages, I'd like to see how this works for the Tai Tham script. I'm not happy with the Romanisation I use to work round hostile rendering engines. (My scheme is only documented in variable hack_ss02 in the last script blocks of http://wrdingam.co.uk/lanna/denderer_test.htm.) For example, there are several different ways of writing what one might naively record as "ontarAy". Next, each native script
Re: Counting Devanagari Aksharas
The Unicode approach to Sanskrit and all Indic is flawed. Indic should not be letter-assembly systems. Sanskrit vyaakaraNa (grammar) explains the phonemes as the atoms of the speech. Each writing system then assigns a shape to the phonetically precise phoneme. The most technically and grammatically proper solution for Indic is first to ROMANIZE the group of writing systems at the level of phonemes. That is, assign romanized shapes to vowels, consonants, prenasals, post-vowel phonemes (anusvara and visarjaniiya with its allophones) etc. This approach is similar to how European languages picked up Latin, improvised the script and even uses Simples and Capitals repertoire. Romanizing immediately makes typing easier and eliminates sometimes embarrassing ambiguity in Anglicizing -- you type phonetically on key layouts close to QWERTY. (Only four positions are different in Romanized Sinhala layout). If we drop the capitalizing rules and utilize caps to indicate the 'other' forms of a common letter, we get an intuitively typed system for each language, and readable too. When this is done carefully, comparing phoneme sets of the languages, we can reach a common set of Latin-derived SINGLE-BYTE letters completely covering all phonemes of all Indic. Next, each native script can be obtained by making orthographic smart fonts that display the SBCS codes in the respective shapes of the native scripts. I have successfully romanized Sinhala and revived the full repertoire of Sinhla + Sanskrit orthography losing nothing. Sinhala script is perhaps the most complex of all Indic because it is used to write both Sanskrit and Pali. See this: http://ahangama.com/ (It's all SBCS underneath). Test here: http://ahangama.com/edit.htm On 4/20/2017 5:05 AM, Richard Wordingham via Unicode wrote: Is there consensus on how to count aksharas in the Devanagari script? The doubts I have relate to a visible halant in orthographic syllables other than the first. For example, according to 'Devanagari VIP Team Issues Report' http://www.unicode.org/L2/L2011/11370-devanagari-vip-issues.pdf, a derived form from Nepali श्रीमान् should be written श्रीमान्को and not श्रीमान्को . Now, if the font used has a conjunct for SHRA, I would count the former as having 4 aksharas SH.RII, MAA, N, KO and the latter as having 3 aksharas SH.RII, MAA, N.KO. If the font leads to the use of a visible halant instead of the vattu conjunct SH.RA, as happens when I view this email, would there then be 5 and 4 aksharas respectively? A further complication is that the font chosen treats what looks like SH, RA as a conjunct; the vowel I appears to the left of SH when added after RA (श्रि). Richard.
Singhala scirpt ill defined by OpenType standard
Here is the proof that OpenType standard defined the Singhala script wrongly. Also find a BNF grammar that describes it. http://ahangama.com/unicode/index.htm Thanks. ___ Unicode mailing list Unicode@unicode.org http://unicode.org/mailman/listinfo/unicode
Re: Dead and Compose keys (was: Re: Romanized Singhala got great reception in Sri Lanka)
Okay, Doug. Type this inside the yellow text box in the following page: kaaryyaalavala yanþra pañkþi http://www.lovatasinhala.com/puvaruva.php Please tell me what sequence of Unicode Sinhala codes would produce what the text box shows. On Mon, Mar 17, 2014 at 7:56 PM, Doug Ewell d...@ewellic.org wrote: Naena Guru wrote: In the case of romanized Singhala, any processing that English accepts, it accepts too. For RS, you select a font to display it in the native script because if it is mixed with English, both are using the same character space, just as when English and French are mixed. But English and French actually *use* the same letters, or at any rate most of them. With your approach, it is not possible to write Sinhala in the Sinhala script mixed with English or French or anything else in the Latin script. In web pages you can resort to span style=Latin tricks, but this doesn't work for plain text. This is what people mean when they suggest that your real goal is to abolish the Sinhala script and just write in Latin. -- Doug Ewell | Thornton, CO, USA http://ewellic.org | @DougEwell ___ Unicode mailing list Unicode@unicode.org http://unicode.org/mailman/listinfo/unicode
Re: Dead and Compose keys (was: Re: Romanized Singhala got great reception in Sri Lanka)
Making a keyboard is not hard. You can either edit an existing one or make one from scratch. I made the latest Romanized Singhala one from scratch. The earlier one was an edit of US-International. When you type a key on the physical keyboard, you generate what is called a scan-code of that key so that the keyboard driver knows which key was pressed. (During DOS days, we used to catch them to make menus.) Now, you assign one or a sequence of Unicode characters you want to generate for the keypress. Use Microsft's keyboard layout creator for all versions of Windows from XP: http://msdn.microsoft.com/en-us/goglobal/bb964665.aspx Select the language carefully. I selected US-English for RS. That way, I can switch between the two keyboards quickly with Ctrl+Shift. You can change all these in the Control Panel. Here is the keymap I made for RS in Linux: http://ahangama.com/apiapi/singhala/linuxkb-s.php Just scroll down for the English part. (The lines starting with double slashes are comments and have no effect on the program) The Macintosh key layout is easy too. The story with iOS and Android are different but not hard either. On Sun, Mar 16, 2014 at 6:47 PM, Doug Ewell d...@ewellic.org wrote: Jean-François Colson jf at colson dot eu wrote: The idea here was “that characters not on an ordinary QWERTY keyboard could be entered _using_an_ordinary_QWERTY_keyboard._” Are there any dead keys on an _ordinary_ (i.e. not one using an international(ized) driver) QWERTY keyboard? Not on the standard vanilla U.S. keyboard. It has to be provided by the OS, via a driver, just as Compose key support has to be provided by the OS. The standard vanilla U.S. keyboard also doesn't provide the accented letters and other non-ASCII letters like ð that Naena Guru uses for his font hack. If a character is available by a dead key, isn’t it on the keyboard ? It depends on what you mean by on the keyboard. Thanks to John Cowan's delightful Moby Latin keyboard layout, I can type AltGr+\ followed by 7 to get the fraction ⅐ (one-seventh). That character is not on the keyboard in any sense other than what the driver provides. -- Doug Ewell | Thornton, CO, USA http://ewellic.org | @DougEwell ___ Unicode mailing list Unicode@unicode.org http://unicode.org/mailman/listinfo/unicode ___ Unicode mailing list Unicode@unicode.org http://unicode.org/mailman/listinfo/unicode
Re: Romanized Singhala got great reception in Sri Lanka
You have lot of stars, my friend. You ARE the Lion. Roar!! You have lot of friends in Sri Lanka, indeed, and one thanked you highly for the service you did for the Language too. Good for you. However, he did not know the way to the Public Library or the nearest Buddhist temple or the Christian Church that would have enlightened him on Singhala writing system, so you could help you make meaningful proposal than copying something that somebody else forwarded with a question. Did any one of them show you Rev. Fr. A. M Gunasekara's book? No! Who signed Mettavihari purportedly for the Singhala user group? On Mon, Mar 17, 2014 at 4:51 AM, Michael Everson ever...@evertype.comwrote: On 17 Mar 2014, at 02:48, Naena Guru naenag...@gmail.com wrote: You are talking about something you do not know. I am a Singhalese. That doesn't give you any special knowledge or privilege. I know many people from Sri Lanka, who work in the area of computing, and who work with the Sinhala characters as encoded in the UCS. And really. The Lion of Unicode? My stars. Michael Everson On Sun, Mar 16, 2014 at 6:18 AM, Michael Everson ever...@evertype.com wrote: On 16 Mar 2014, at 04:12, Naena Guru naenag...@gmail.com wrote: Dual-script Singhala means romanized Singhala that can be displayed either in the Latin script or in the Singhala script using an Orthographic Smart Font... What a terrible, terrible idea. You are essentially promoting giving up writing Sinhala, in favour of a slightly-bigger-than-ASCII Latin font hack. Dual-script Singhala is the proper and complete solution on the computer for the Singhala script used to write Singhala, Sanskrit and Pali languages. No, it isn't. It's a huge step backwards, unless you propose abolishing the Sinhala script entirely and just writing in Latin. The government ministries, media and people welcomed it with enthusiasm and relief that there is something practical for Singhala. The response in the country was singularly positive, except for the person that filibustered the QA session of the presentation that spoke about the hard work done on Unicode Sinhala, clearly outside the subject matter of the presentation. That person understood the nature of data integrity. As does everyone who cares about the Universal Character Set. Michael Everson * http://www.evertype.com/ ___ Unicode mailing list Unicode@unicode.org http://unicode.org/mailman/listinfo/unicode ___ Unicode mailing list Unicode@unicode.org http://unicode.org/mailman/listinfo/unicode
Re: Romanized Singhala got great reception in Sri Lanka
Romanized to Unicode? Romanizing is inside Unicode. English and all Western European languages also use Unicode. Romanized Singhala resides in Latin 1 character set, that is between U+0020 to U+00FF. Unicode Singhala resides in the range U+0D80 to U+0DFF There is no difference between RS and those languages except the users live in an island far away from those others. Is there some reason you want to convert romanized Singhala to Unicode Singhala, a terrible specification that is already corrupting the language? I spoke to serious users such as journalists and teachers just few weeks back. It is unfortunate that you guys are still hanging on to it. Why? The proof-of-concept font I made has glyph substitutions. That is how it can apply an orthography. Unicode Singhala is completely a botched work. It has vowels each with two codes, one for stand-alone and one for its sign. Each consonant is considered as having the embedded (intrinsic!) vowel a. I is not a consonant, people. Then it has two ligatures included as basic consonants These do not have normalizing rules, 1 because they are NOT canonical forms as there was no precedent digital form of Singhala for backward compatibility 2 It was submitted after Unicode closed receiving applications for normalizing canonical forms. How on earth can you make a sorting method for it? When you backspace it destroys multiple keystrokes. Search and replace is not possible, at least the way do it with English. Typing is a nightmare. There are special rules for making Unicode Singhala fonts. The keyboards have keys to type pieces of letters not in the code block. As you see, this is a terrible mess and cannot be straightened, granted few people use it, and there'll be more. What other choice do they have except Anglicizing?. In Singhala, they say, balu valigee uµa purukee ðaalaa hæðuvaþ nææ æðee ærennee (බලු වලිගේ උණ පුරුකේ දාලා හැදුවත් නෑ ඇදේ ඇරෙන්නේ - I inserted all joiners, but can't guarantee if vowel signs would pop out). It means you cannot straighten dog tail even if you put it in a bamboo.piece. You cannot fix Unicode Singhala and sadly, it is bringing down the language with it. On Sun, Mar 16, 2014 at 11:05 AM, William_J_G Overington wjgo_10...@btinternet.com wrote: So, everyone, can the Romanized Singhala system be used with a QWERTY keyboard to produce Unicode-encoded text, thereby producing a good combined system? Could this be achieved if a text-processing software package were produced that could automatically perform a character string to character string substitution (namely Romanized Singhala character string to Unicode character string) that would be applied before any OpenType glyph to glyph substitution? The character string to character string substitution rules could be stored in a text file, such as a UTF-16 text file saved from WordPad, that format being what WordPad describes as a Unicode Text Document file type. Could this be achieved? If so, text entry could use an ordinary QWERTY keyboard and yet the resulting text would be stored using the appropriate Unicode characters for the script and the font would use Unicode mappings. William Overington 16 March 2014 ___ Unicode mailing list Unicode@unicode.org http://unicode.org/mailman/listinfo/unicode
Re: Dead and Compose keys (was: Re: Romanized Singhala got great reception in Sri Lanka)
Marc, Yes, making keyboard layouts is not difficult. I believed that language tools are selected for each language manually when input. I did not know that there is an automatic switching of language tools say when you switch to French keyboard from English. That shouldn't be difficult to make for RS, though. In the case of romanized Singhala, any processing that English accepts, it accepts too. For RS, you select a font to display it in the native script because if it is mixed with English, both are using the same character space, just as when English and French are mixed. Spell checking, grammar checking for Unicode Singhala? There are no such things for it there. It is in the stage of struggling to input text: special programs, physical keyboards etc. I saw them. They have a special IT category of employees to input Unicode Sinhala. They have special places called Typesetting kiosks in Lanka where you go to get your résumé and term paper printed. On Mon, Mar 17, 2014 at 3:33 PM, Marc Durdin m...@keyman.com wrote: I disagree. Making a basic keyboard layout is not hard, just like making a font without OpenType support is not that hard. Making a keyboard layout that doesn’t force users to learn the nuances of the encoding of a script is more of a challenge, and making a high quality keyboard layout that is consistent, easy to use, and efficient is anything but straightforward. Most keyboard layouts fail at one of these. The story for touch device input is even more challenging. Not being constrained to a physical set of keys increases your flexibility. The big challenge is usually the size of the display on mobile-sized devices. Regarding keyboard design: · Scan make/break codes are not really relevant to Windows keyboards – Windows has an abstraction layer of ‘virtual key’ codes, for better or worse. · Selecting US-English for a non-English keyboard means that all language tools will break with your text. Spell checking, grammar checking, automatic keyboard selection, autocorrect, font selection and more. That’s a big price to pay. Conversely, selecting Singhala for your Romanised non-Unicode encoding will break spell checking, grammar checking, automatic keyboard selection, autocorrect, font selection and more. Marc *From:* Unicode [mailto:unicode-boun...@unicode.org] *On Behalf Of *Naena Guru *Sent:* Tuesday, 18 March 2014 3:08 AM *To:* Doug Ewell *Cc:* UnicoDe List *Subject:* Re: Dead and Compose keys (was: Re: Romanized Singhala got great reception in Sri Lanka) Making a keyboard is not hard. You can either edit an existing one or make one from scratch. I made the latest Romanized Singhala one from scratch. The earlier one was an edit of US-International. When you type a key on the physical keyboard, you generate what is called a scan-code of that key so that the keyboard driver knows which key was pressed. (During DOS days, we used to catch them to make menus.) Now, you assign one or a sequence of Unicode characters you want to generate for the keypress. Use Microsft's keyboard layout creator for all versions of Windows from XP: http://msdn.microsoft.com/en-us/goglobal/bb964665.aspx Select the language carefully. I selected US-English for RS. That way, I can switch between the two keyboards quickly with Ctrl+Shift. You can change all these in the Control Panel. Here is the keymap I made for RS in Linux: http://ahangama.com/apiapi/singhala/linuxkb-s.php Just scroll down for the English part. (The lines starting with double slashes are comments and have no effect on the program) The Macintosh key layout is easy too. The story with iOS and Android are different but not hard either. On Sun, Mar 16, 2014 at 6:47 PM, Doug Ewell d...@ewellic.org wrote: Jean-François Colson jf at colson dot eu wrote: The idea here was “that characters not on an ordinary QWERTY keyboard could be entered _using_an_ordinary_QWERTY_keyboard._” Are there any dead keys on an _ordinary_ (i.e. not one using an international(ized) driver) QWERTY keyboard? Not on the standard vanilla U.S. keyboard. It has to be provided by the OS, via a driver, just as Compose key support has to be provided by the OS. The standard vanilla U.S. keyboard also doesn't provide the accented letters and other non-ASCII letters like ð that Naena Guru uses for his font hack. If a character is available by a dead key, isn’t it on the keyboard ? It depends on what you mean by on the keyboard. Thanks to John Cowan's delightful Moby Latin keyboard layout, I can type AltGr+\ followed by 7 to get the fraction ⅐ (one-seventh). That character is not on the keyboard in any sense other than what the driver provides. -- Doug Ewell | Thornton, CO, USA http://ewellic.org | @DougEwell ___ Unicode mailing list Unicode@unicode.org http://unicode.org/mailman/listinfo/unicode
Re: Dead and Compose keys (was: Re: Romanized Singhala got great reception in Sri Lanka)
Doug, Making keyboard layouts for Unicode Singhala is hard not because of fault of Unicode. It is the complexity of letter assembly. I have use the Wijesekara keyboard on a 24in Olympia Singhala keyboard in 1970s. It is radically different from US-English. I tried to make a phonetic one to kind of relate to the English keys. Still, you need to have many shifted keys to get common letters. On Mon, Mar 17, 2014 at 11:38 AM, Doug Ewell d...@ewellic.org wrote: Naena Guru naenaguru at gmail dot com wrote: Making a keyboard [layout] is not hard. You can either edit an existing one or make one from scratch. I made the latest Romanized Singhala one from scratch. The earlier one was an edit of US- International. I've made a couple dozen of them myself, with MSKLC. When you type a key on the physical keyboard, you generate what is called a scan-code of that key so that the keyboard driver knows which key was pressed. (During DOS days, we used to catch them to make menus.) Now, you assign one or a sequence of Unicode characters you want to generate for the keypress. Precisely. As Marc Durdin said, you can create a keyboard layout just as easily for Unicode characters as for ASCII and Latin-1 characters. You can also assign a combination of characters to a single key. So it is not true that typing Unicode Sinhala requires you to learn a key map that is entirely different from the familiar English keyboard, while losing some marks and signs too. Unicode does not prescribe any key map. You can have whatever layout you like. As Marc also said, if you think there are marks and signs missing from Unicode, that is another matter. -- Doug Ewell | Thornton, CO, USA http://ewellic.org | @DougEwell ___ Unicode mailing list Unicode@unicode.org http://unicode.org/mailman/listinfo/unicode
Re: Romanized Singhala got great reception in Sri Lanka
Thank you, Ken. You very nicely analyzed it. Why I said that the signs might pop out because I have had complaints that happening. I think this is because implementation of proper rendering is behind in some systems. On input, I tried to make a layout that is close to QWERTY. But failed because of the need for too many combination keys. Keyman uses the old typewriter keyboard Wijesekara. I saw a better one on the front page for Singhala but did not find it further inside. Marc would know, of course. Anyway my complaint is that Unicode Singhala is incomplete and wrong and that it has a deleterious effect on the language, one of the oldest in the world. What's aggravating is that they institutionalize errors as correct. Rev. Fr. Perera warned against this 80 years ago. I suppose I wouldn't have much to say if the 58 phonemes are used to replace the ones there. It will not happen. On Mon, Mar 17, 2014 at 8:36 PM, Whistler, Ken ken.whist...@sap.com wrote: Well, I actually don’t see. I took a look at the Sinhala you inserted in this email. I cannot tell what you did at your input end (about “inserted all joiners”), but there are no actual joiners in the text itself. It displayed just fine in my email (including the correct conditional formatting of the –u vowel applied to the ra in pu*ru*kee), without me doing anything special (or installing any hacked font). Why? Because it was transmitted in plain Unicode. I cut and pasted that Unicode Sinhala string into a Word document, and it worked just fine. The boundaries for all the syllables were correctly detected. I saved it as a plain text UTF-8 file, and it worked just fine. I even then read the plain text UTF-8 file into a UTF-8 aware programming editor, and it worked just fine. (In a programming editor, which doesn’t attempt complex script rendering, the vowels don’t apply to the consonants and no reordering is done, so the display isn’t correct, but each character is correctly preserved, and if I write it back out to a document and read it in Word or some other tool that has access to proper rendering, it is still fine.) And all that interoperability works, why? Because this is plain Unicode. So while I don’t doubt that people may be having serious issues with input methods for Sinhala, I tend to agree with Marc Durdin that you are confusing encoding with input methods. Yes, I know you know the difference, but it appears to me that the inescapable conclusion from your argumentation is that the highest priority for the design of an encoding system should be to make the design of input methods as simple as possible. And in my estimation, that is confusing encoding with input methods. The art of input methods is to hide encoding details from users, and instead to provide them with an abstraction that they find easy to use and which accords with their general understanding of the writing system they are using. If done correctly, then the details of the input method *also* recede into the background, and users then simply do what they want: write and edit text easily on their devices. --Ken P.S. Here is an octal dump of that text (after I inserted a closing parenthesis in the editor). Sinhala sequence highlighted. Plain Unicode in UTF-8, no fancy stuff, and works just fine. 00EF BB BF 62 61 6C 75 20 76 61 6C 69 67 65 65 C2 20A0 75 C2 B5 61 20 70 75 72 75 6B 65 65 C2 A0 C3 40B0 61 61 6C 61 61 20 68 C3 A6 C3 B0 75 76 61 C3 60BE 20 6E C3 A6 C3 A6 20 C3 A6 C3 B0 65 65 20 C3 000100A6 72 65 6E 6E 65 65 0D 0A 28 E0 B6 B6 E0 B6 BD 000120E0 B7 94 20 E0 B7 80 E0 B6 BD E0 B7 92 E0 B6 9C 000140E0 B7 9A 20 E0 B6 8B E0 B6 AB 20 E0 B6 B4 E0 B7 00016094 E0 B6 BB E0 B7 94 E0 B6 9A E0 B7 9A 20 E0 B6 000200AF E0 B7 8F E0 B6 BD E0 B7 8F 20 E0 B7 84 E0 B7 00022090 E0 B6 AF E0 B7 94 E0 B7 80 E0 B6 AD E0 B7 8A 00024020 E0 B6 B1 E0 B7 91 20 E0 B6 87 E0 B6 AF E0 B7 0002609A 20 E0 B6 87 E0 B6 BB E0 B7 99 E0 B6 B1 E0 B7 0003008A E0 B6 B1 E0 B7 9A 29 0D 0A 0D 0A As you see, this is a terrible mess and cannot be straightened, granted few people use it, and there'll be more. What other choice do they have except Anglicizing?. In Singhala, they say, balu valigee uµa purukee ðaalaa hæðuvaþ nææ æðee ærennee (බලු වලිගේ උණ පුරුකේ දාලා හැදුවත් නෑ ඇදේ ඇරෙන්නේ - I inserted all joiners, but can't guarantee if vowel signs would pop out). It means you cannot straighten dog tail even if you put it in a bamboo.piece. You cannot fix Unicode Singhala and sadly, it is bringing down the language with it. ___ Unicode mailing
Re: Romanized Singhala got great reception in Sri Lanka
of the Sanskrit hodiya. Rev. Fr. Theodore G. Perera's grammar book (1932) and Rev. Fr. A. M. Gunasekera's book (1891) that dug up sinking Singhala fully describe the writing system. Like most other languages, including English before printing arrived in England, it is written phonetically. Singhala was romanized first in 1860s by Rhys Davids, called PTS scheme, to print Pali (Magadhi) in the Latin script. This requires letters with bars (macron) and dots not found in common fonts. This scheme is called PTS Pali. It is similar to IAST Sanskrit. It is impossible to type these on the regular keyboard. I freshly romanized Singhala by mapping its phonemes to the SAME area 13 Western European languages mapped their alphabetic letters within the following Unicode code charts: http://www.unicode.org/charts/PDF/U.pdf http://www.unicode.org/charts/PDF/U0080.pdf So, if that is creating a transcoding table all Europeans did it and I do it too. On Sun, Mar 16, 2014 at 12:36 AM, Philippe Verdy verd...@wanadoo.fr wrote: Don't you realize that what you are trying to create is completely out of topic of Unicode, as it is simply another new 8-bit encoding similar to what ISCII does for supporting multiple Indic scripts with a common encoding/transcoding table? The ISCII standard has shown its limitations, it cannot be enough to support all scripts correctly and completely, it has lots of unsolved ambiguities for tricky cases or historic orthographies, or newer orthographies, that the UCS encoding better supports due to its larger character set and more precise character properties and algorithms. You are in fact creating a transcoding table... Except that you are mixing the concepts; and the Unicode and ISO technical commitees working on the UCS dont need to handle new 8-bit encodings. And you'll soon experiment the same problems as in ISCII and all other legacy 8-bit encodings: very poor INTEROPERABILITY due to version tracking or complax contextual rules... You may still want to promote it at some government or education institution, in order to promote it as a national standard, except that there's little change it will ever happen when all countries in ISO have stopoed working on standardization of new 8-bit encodings (only a few ones are maintained; but these are the most complex ones used in China and Japan. Well in fact only Japan now seens to be actively updating its legacy JIS standard; but only with the focus of converging it to use the UCS and solve ambiguities or solve some technical problems (e.g. with emojis used by mobile phone operators). Even China stopped updating its national standard by publishing a final mapping table to/from the full UCS (including for characters still not encoded in the UCS): this simplified the work because only one standard needs to be maintained instead of 2. Note that as long there will not be any national standard supporting your proposed encodng, there is no chance that the font standards will adopt it. You may still want to register your encoding in the IANA registry, but you'll need to pass the RFC validation. And there are lots of technical details missing in your proposal so that it can work for supporting it with a standard mapping in fonts. There is better chance for you to pomote it only as a transliteration scheme, or as an input method for leyboard layout (both are also not in the scope of the Unicode and ISO/ISC 10646 standards though, they could be in the scope of the CLDR project, which is not by itself a standard but just a repository of data, supported by a few standards)... Think about it. 2014-03-16 5:12 GMT+01:00 Naena Guru naenag...@gmail.com: I made a presentation demonstrating Dual-script Singhala at National Science Foundation of Sri Lanka. Most of the attendees were government employees and media representatives; a few private citizens came too. Dual-script Singhala means romanized Singhala that can be displayed either in the Latin script or in the Singhala script using an Orthographic Smart Font. It is easy to input (phonetically) using a keyboard layout slightly altered from QWERTY. The font uses Standard Ligature feature liga of OpenType / OpenFont standard to display glyphs of Sanskrit ligatures as well as many Singhala letters. The font is supported across all OSs: Windows, Macintosh, Linux, iOS and Android. Dual-script Singhala is the proper and complete solution on the computer for the Singhala script used to write Singhala, Sanskrit and Pali languages. The same solution can be applied for all Indic languages. The government ministries, media and people welcomed it with enthusiasm and relief that there is something practical for Singhala. The response in the country was singularly positive, except for the person that filibustered the QA session of the presentation that spoke about the hard work done on Unicode Sinhala, clearly outside the subject matter of the presentation
Romanized Singhala got great reception in Sri Lanka
I made a presentation demonstrating Dual-script Singhala at National Science Foundation of Sri Lanka. Most of the attendees were government employees and media representatives; a few private citizens came too. Dual-script Singhala means romanized Singhala that can be displayed either in the Latin script or in the Singhala script using an Orthographic Smart Font. It is easy to input (phonetically) using a keyboard layout slightly altered from QWERTY. The font uses Standard Ligature feature liga of OpenType / OpenFont standard to display glyphs of Sanskrit ligatures as well as many Singhala letters. The font is supported across all OSs: Windows, Macintosh, Linux, iOS and Android. Dual-script Singhala is the proper and complete solution on the computer for the Singhala script used to write Singhala, Sanskrit and Pali languages. The same solution can be applied for all Indic languages. The government ministries, media and people welcomed it with enthusiasm and relief that there is something practical for Singhala. The response in the country was singularly positive, except for the person that filibustered the QA session of the presentation that spoke about the hard work done on Unicode Sinhala, clearly outside the subject matter of the presentation. The result of the survey passed around was 100% as below (translated from Singhala): 1. I believe that Dual-script Singhala is convenient to me as it is implemented similar to English - Yes 2. Today everyone uses Unicode Sinhala. It is easy and has no problems - No 3. The cost of Unicode Sinhala should be eliminated by switching to Dual-scrip Singhala - Yes 4. We should amend Pali text in the Tripitaka according to rulings of SLS1134 - No 5. Digitizing old books is a very important thing - Yes 6. We should focus on making this easy-to-use Dual-script Singhala method a standard - Yes Please comment or send questions. ___ Unicode mailing list Unicode@unicode.org http://unicode.org/mailman/listinfo/unicode
Re: interaction of Arabic ligatures with vowel marks
Please see this page: (for IE, use v 2010 and up) http://lovatasinhala.com/ The font is almost all ligatures. If you copy and inspect the text, you'll notice that it is simple romanized Singhala. I am currently in Sri Lanka demonstrating this. The people at president's office and one of the powerful ministers have seen it. They are elated that after all, Singhala, the most complex of 'Abigudas' is much like a Western European language and amazingly computer and user friendly. This is contrary to how it was portrayed to them by local academics and technocrats causing the poor country unnecessary debt. The ideas of Abiguda and Complex fade away if a font is made fully understanding Unicode's description of ligatures and how they are implemented by OpenType (now OpenFont). I believe that Arabic and Hebrew can follow this model so that typing the script is simplified for users without compromising orthography. On Wed, Jun 12, 2013 at 8:39 AM, Stephan Stiller stephan.stil...@gmail.comwrote: Hi, How is the placement of vowel marks around ligatures handled in Arabic text? Does anyone have good pointers on this topic? My guess is that this does not come up often (just like the topic of pointing for handwritten Hebrew), as vowel marks are mostly not added in ordinary text. Nonetheless, any text making heavy use of ligatures will from time to time need to add vowel marks for a foreign name or as a reading aid, and (as many of us know) the Quran is traditionally printed with vowel marks. I'm also wondering how font designers normally handle this. I think there are analogous questions for various ligature-heavy abugidas, so there must be an existing body of knowledge. There should be better answers than squeeze the vowels around the consonant clusters in whatever way seems most intuitive. Do traditional printing presses use extra metal types for such glyph clusters, or do they manually add and adjust the positioning of vowels? Stephan ___ Unicode mailing list Unicode@unicode.org http://unicode.org/mailman/listinfo/unicode
Re: Interoperability is getting better ... What does that mean?
Thank you for commenting and Happy New Year. CP-1252 is a perfectly legal web character set, and nobody is going to argue with you if you want to use it in legal ways. (I.e. writing Latin script in it, not Sinhala.) But . Okay, what is implied is I am doing something illegal. Define what I am doing that is illegal and cite the rule and its purpose of preventing what harm to whom. May I ask if the following two are Latin script, English or Singhala? 1. This is written in English. 2. mee laþingaþa síhalayi. For me, both are Latin script and 1 is English and 2 is Singhala (says,' this is romanized Singhala'). The fo;;owing are the *only* Singhala language web pages that pass HTML validation (Challenge me): http://www.lovatasinhala.com/ They are in romanized Singhala. The statement, the death of most character sets makes everyone's systems smaller and faster is *FALSE*. Compare the sizes of the following two files that are copies of a newspaper article. The top part in red has few more words in romanized Singhala in the romanized Singhala file. Notice the size of each file: 1. http://ahangama.com/jc/uniSinDemo.htm size:38,092 bytes 2. http://ahangama.com/jc/RSDemo.htm size:18,922 bytes As the size of the page grows, the size of Unicode Sinhala tends to double the size relative to its romanized Singhala version. Unicode Sinhala characters become 50% larger when UTF-8 encoded for transmission That is three times the size of the romanized Singhala file. So, the Unicode Sinhala file consumes 3 times the bandwidth needed to send the romanized Singhala file. more likely to correctly show them the document instead of trash Again *demonstrably WRONG*: Unicode Sinhala is trash in a machine that does not have the fonts. It is trash also if the font used by the OS is improperly made, such as in iPhone. It is generally trash because the SLS1134 standard corrupts at least one writing convention. (Brandy issue). On the other hand, romanized Singhala is always readable whether you have the font or not. It is not helpful to criticize Singhala related things without making a serious effort to understand the issues. Blind men thought different things about the elephant. If you mean that everyone should start using 16-bit Unicode characters, I have no objection to that. It would happen if and when all applications implement it. I cannot fight that even if I want to. But I do not see users of English doing anything different to what they are doing now, like my typing now, I think, using 8-bit characters. (I can verify that by copying it and pasting into a text editor. I showed that the Singhala can be romanized and all the problems of ill-conceived Unicode Indic can be eliminated by carefully studying the grammar of the language and romanizing. (I used the word 'transliterate' earlier, but the correct word is transcribe). I did it for Singhala and made an Open Type font to show it perfectly in the traditional Singhala script. So far, one RS smartfont and six Unicode fonts even after spending $20M for a foreign expert to tell how to make fonts though it is right on the web in the same language the expert spoke in. My work irritates some may be because it is an affront their belief that they know all and decide all. Some feel let down why they could not think of it earlier and may be write about a strange discovery like Abiguda and write a book on the nonsense. Most of all, I think it is a just cultural block on this side of the globe. As for Lankan technocrats, their worry is that the purpose of ICTA would come unraveled. I went there in November and it was revealed to me (by one of its employees) that its purpose is to provide a single point of contact for foreign vendors that can use local experts as their advocates. On Thu, Jan 3, 2013 at 12:56 AM, Leif Halvard Silli xn--mlform-...@xn--mlform-iua.no wrote: Asmus Freytag, Mon, 31 Dec 2012 06:44:44 -0800: On 12/31/2012 3:27 AM, Leif Halvard Silli wrote: Asmus Freytag, Sun, 30 Dec 2012 17:05:56 -0800: The Web archive for this very list, needs a fix as well … The way to formally request any action by the Unicode Consortium is via the contact form (found on the home page). Good idea. Done! Turned out to only be - it seems to me - an issue of mislabeling the monthly index pages as ISO-8859-1 instead of UTF-8. Whereas the very messages themselves are archived correctly. And thus I made the request that they properly label the index pages. Happy new year! -- leif h silli
Re: Interoperability is getting better ... What does that mean?
It used to be that during HTML 4 days ISO8859-1 was the default character set for pages that used SBCS (those that belong to Basic Latin and Latin Extended-1). At least that is what the Validator (http://validator.w3.org/) said. (By the way, Unicode is quietly suppressing Basic Latin block by removing it from the Latin group at top of the code block page ( http://www.unicode.org/charts/) and hiding it under different names in the lower part of the page.) Now the validator complains correctly that some characters in those pages do not belong to ISO-8859-1, if you use bullet points, ellipse etc. It says they come from Windows-1252. That is true. If you declare these pages as UFT-8, then it throws off *all* Latin-1 characters and the web pages show character-not-found glyph. Windows-1252 replaces all Control codes (first 32 characters) in Latin-1 page with some common characters used by Eastern European languages and some punctuation marks. There is one main consideration in the mind of the web developer: Make the file as small as possible. Try this: Make a text file in Windows Notepad and save it in ANSI, Unicode and UTF-8 formats. ANSI file (Windows-1252) will be the smallest. Why should people make their pages larger just to satisfy some peoples idea of perfection? It reminds me of the Plain Text and language detection myths. On Mon, Dec 31, 2012 at 8:44 AM, Asmus Freytag asm...@ix.netcom.com wrote: On 12/31/2012 3:27 AM, Leif Halvard Silli wrote: Asmus Freytag, Sun, 30 Dec 2012 17:05:56 -0800: The Web archive for this very list, needs a fix as well … The way to formally request any action by the Unicode Consortium is via the contact form (found on the home page). A./
Re: Compiling a list of Semitic transliteration characters
{Sorry for this delayed response. I had it in he drafts box] INDIC: Speculation is one, doing is another. I have *successfully* transliterated Singhala. Devanagari is perhaps less complicated. I haven't had the time and money to do it. The difference between the two is not even as much as between common Latin and Cyrillic. Statements like, Using Unicode is recommended in preference to any code page because it has better language support and is less ambiguous than any of the code pages. are trying to assert untruths, that people tend to believe without concrete reasons. 'better language support' and 'less ambiguous'? That statement is by Microsoft right in the registration of Windows-1252 that plainly contravenes Unicode: http://msdn.microsoft.com/en-US/goglobal/cc305145.aspx All languages in the Developed countries in the West including English, use Windows-1252! And all others who use double-byte etc. use UTF-8 to transport their codes across the web, because only Single-byte codes are trustworthy. We used Base64 for the same end in the 90s. I agree that following ISCII, whatever it is, might be the problem. Even so, that is no excuse for not researching enough and fixing the problem. After all, the claim is that Unicode provides BETTER LANGUAGE SUPPORT and LESS AMBIGUITY -- both of them were happily presented with Unicode Sinhala. But what happened? Why do we see constant questions regarding Unicode Indic? I researched into how Sanskrit and Pali was transliterated during the letterpress days and why they were so successful and without loss in mapping. If that was not so, we wouldn't have such a comprehensive Sanskrit dictionary like Monier-Williams available on the web. The transliteration of Sanskrit is perfect whether ITRANS or Harvard-Kyoto. The same is true about PTS Pali.Half a century ago, I worked with Ven. Nyanaponika (the ultimate BuJu) when printing Pali, and I know the real technicalities letterpress printers faced. I know why certain diacritics were chosen and why some letters were chosen over others though they seem more logical. I have demonstrated in the following page how romanized Singhala, IAST and HK match perfectly, and they are simple direct mappings (read the Javascript): http://www.lovatasinhala.com/liyanna-e.php#sanskrit It is presumptuous to say, the rest of the post is irrelevant. It has the same attitude as the statement I gave above said by Microsoft.We know what's best for you (and remember that we hold all the strings). So do as we say. *halanta *means *hal anta*, ending consonant. (*hal *= consonant). *virAma *means closing and in grammar it means the last consonant without a vowel following it. *mAtra *means a measure. It is an IE cognate of measure, meter etc. In grammar it means the spoken length of a short vowel, same as mora in Latin. All these three terms are used misleadingly in Unicode documentation. Instead of 'language support', it mangles language. SEMITIC LANGUAGES: I checked the Arabic alphabet in Wikipedia. It is similar to the older Indic writing, Brahmi in particular, where there is no consonant sign, but it is understood in the context (a, i etc). On Fri, Sep 7, 2012 at 1:16 PM, Richard Wordingham richard.wording...@ntlworld.com wrote: On Fri, 7 Sep 2012 11:43:59 -0500 Naena Guru naenag...@gmail.com wrote: Transliteration or Romanizing My first advice is not to embark on making solutions for languages that you do not know. Unicode ruined Indic and Singhala by making 'solutions' for them by not doing any meaningful research and ignoring well-known Sanskrit grammar and previous solutions for Indic. The problems, if any, are due to thinking that the Indians understood what they were doing when they developed ISCII. Such problems as there are appear to arise from a belief that all Indic scripts are like Devanagari. I romanized Singhala, probably the most complex script among all Indic, and made an orthographic font that in turn shows the transliterated text in its native script. http://www.lovatasinhala.com Some reasons for romanizing: snip 3. Make the language accessible to those who are not familiar with the script The rest of the post is irrelevant. Transliterations from Semitic languages have been established for this reason, and possibly because of costs of making and setting type. One issue at hand is that there is not a *single* transliteration to hand, and certainly not a single pan-Semitic one. Therefore I strongly doubt that an 8-bit code would encompass everything that was needed. Richard.
Re: Compiling a list of Semitic transliteration characters
Transliteration or Romanizing My first advice is not to embark on making solutions for languages that you do not know. Unicode ruined Indic and Singhala by making 'solutions' for them by not doing any meaningful research and ignoring well-known Sanskrit grammar and previous solutions for Indic. I romanized Singhala, probably the most complex script among all Indic, and made an orthographic font that in turn shows the transliterated text in its native script. http://www.lovatasinhala.com Some reasons for romanizing: 1. The current solution is hard to use and incomplete 2. A user friendly method on the computer for native users for their language 3. Make the language accessible to those who are not familiar with the script 4. Help in linguistic studies, take advantage of text to voice technologies etc. In order to romanize successfully, you need to select the best character set. Unicode is a very bad choice because its codes are at least two bytes. Do not be fooled by statements like, 'use added bonus characters', which lures you to the crippled double-byte area, Then some might try to scare you by saying you are making rogue encodings. The only constructive suggestion I had was to be aware of Latin semantics, which too was trivial In my opinion, the best character set is ISO-8859-1, which was modified by Microsoft to Windows-1252. The following table shows both of them merged. The area with yellow background is the part that was modified by Microsoft. ISO-8859-1 had machine control commands in that area. http://www.lovatasinhala.com/eds/charsets.htm Notice that capital and simples are separately encoded. Capitals can be used to indicate modified or closely related sounds to the ones used on regular keys. The AltGR or Ctrl+Alt shifted state gives you more options. Remember that here are letters that are not seen often such as þ, ð, æ, etc. The key positioning in relation to English sounds is more important than the fear of their unfamiliarity. I used þ and ð in the regular positions of t and d. and moved t and d to AltGr shifted positions. The users hardly notice it because t and d keys are what they used in Anglisizing anyway that now give more accurate interpretation. By the way, starting with Anglicizing and refining it is what would be easiest for the users to adapt. The keyboards that have that have ISO-8859-1 characters are US-International, US-Extended, Dead-key Keyboard in Windows, Macintosh and Linux systems. All these three OSs also have easy ways to add your own customized keyboard. In my experience, it is best to strip off all keys not used by a transliteration scheme from the customized keyboards to avoid typos. Think if ZWJ and / or ZWNJ and non-breaking space might be useful in the new keyboard. As for directionality, I'd keep it L2R and if you make fonts like I did, the transformation might incorporate the direction switch. Good luck!
Re: Compiling a list of Semitic transliteration characters
Thank you Phillip, so, what did you say? On Fri, Sep 7, 2012 at 8:58 AM, Philippe Verdy verd...@wanadoo.fr wrote: 2012/9/7 Leif Halvard Silli xn--mlform-...@xn--mlform-iua.no: The word Roman, can also refer to Greek. So it is best to avoid that term. ;-) The Roman empire was speaking a large set of languages (and writing in various scripts) from Europe to Asia and Africa, even if Latin was used in Rome, and written in the Latin script (but not only). But the conventional meaning of romanisation is that it is a transcription to the Latin script (independantly of the target language). The concept of transliteration, rather than transcription, is in fact quite new in human history : the initial need was just to write how languages were pronounced, with more or less approximations, to match the way another language is written, read and pronounced (in the target phonology). So a transcription has always been lossy. But the real difference between transcription and transliteration is for another role : a transliteration attempts to preserve the maximum of the source language phonology and meaning, avoiding most ambiguities. So a transliteration occurs within the same language. A translietteration scheme is created when a language starts changing its standard script in some area. But even in that case it is extremely rare that this conversion will be lossless : there are frequent adaptation of the orthography, and some historic orthographies in the original script (such as mute letters or more frequently letters whose current phonology has changed considerably so that the original orthography in the source script is already far away from the actually spoken language, or because some historic distinctions are no longer heard and the transliteration scheme is representing the letters the same way : N-to-1 is then frequent as well). For some pairs or scripts, it is impossible to be 1-to-1, because the scripts work very differently : alphabets are not like abjads or akharas, and not like ideographic scripts. So adaptation is unavoidable. When a language changes its standard script, there is also very frequently an orthographic reform on the new script, so even the rules of transliterations contain a lot of new exceptions, to match the new orthography. When this change of script is just motivated to ease the learning of the language by people that are better aware of another script, the transliteration rules will often be more strict. It will be much stricter if this change of script is motivated by technical reasons (but people are generally not very well trained on how to make this conversion, so they will each one use their own transliteration scheme, to approximate the language. For this reason, the distinction between lossy and lossless is not very relevant to make the distinction between a traditional transcription and a modern transliteration. My opinion if that the simplest conversions that try to avoid most ambiguities are just named transliteration and they occur within the same language in the same region. Transcriptions are more traditional and instead on focusing on the source language, they try to best approximate the phonology of another language in its current common orthography. Different needs, different rules, but even in both cases the rules are not followed exactly. None of them are lossless. But the distinction is there. There's no clearly defined separation line between transliteration schemes and transcription schemes. except by their intent to preserve a source language or best approximate another one. So the stadnard conversion of Chinese from Han ideographs to Bopomofo or Latin (with the Piyin standard) could be called translierations even if there's by evidence a lot of losses. Same thing about Romaji in Japanese. And even for Korean the standard conversion from the Hangul alphabet to Latin creates some ambiguities and is a bit lossy. Note also that a transcription also occurs within the same script : when you adapt an orthography to use other letters than in the original orthography, this is not a transliteration. For example when you transcript French to an English context, you'll commonly convert ou into oo, or will disambiguate some s into z, or some c into k or s. The intent is to tell English native speakers how to read a word written in another language (e.g. you say that the French word paille should be read like the English pie. This is not a transliteration but a transcription). As well, when you convert the language into a phonetic alphabet like IPA, the process is definitely not a transliteration but a transcription, even if this occurs within the same Latin script (many people are arguing that IPA is not part of the Latin script as it does not contain letters, but symbols, and it is monocameral and it cannot follow the common typographic rules, in addition to the fact that it borrows
Re: Sinhala naming conventions
Mahesh, Thank you. I like this line of discussion than the constant effort to condemn me for abstract crimes. I have not seen any standard whose conditions stand in the way of proper implementation of Singhala on the computer. There, my challenge stands 1. to show where I hurt Singhala by romanizing. 2. to show how romanized Singhala violates any standard in what specific way. 3. to show that romanized Singhala is inferior to Unicode Singhala in some unique way that no other implementation of a language on the computer displays. SANSKRIT: I use Monier-Williams steadily because since of late, I have come to doubt Lankan 'educated' class on the use of venerable Sanskrit. No one group of people own Sanskrit. I see it as our (South Asian) reserve for clarity of expression. This is certainly true for Singhala. I consider Monier-Williams first an Indian and second of British decent -- an authority on Sanskrit: http://en.wikipedia.org/wiki/Monier_Monier-Williams The dictionary he compiled is maintained by Cologne University ( http://en.wikipedia.org/wiki/Cologne_University_of_Applied_Sciences). It has been online for 3 or four years now. http://www.sanskrit-lexicon.uni-koeln.de/cgi-bin/tamil/recherche. Before that, I used a downloaded version. Two entries: *lipi*f. (accord. to L. also %{lipI}) smearing , anointing c. (see %{-kara}) ; painting , drawing L. ; writing , letters , alphabet , art or manner of writing Ka1v. Katha1s. [902,3] ; anything written , manuscript , inscription , letter , document Naish. Lalit. ; outward appearance (%{lipim-Ap} , with gen. , ` to assume the appearance of ' ; %{citrAM@lipiM} %{nI} , ` to decorate beautifully ') Vcar. *lipisaMkhyA*f. a number of written characters L. There you have it as it were straight from the horse's mouth. I feel safe to use lipisaMkhyaa for text. Please read my other responses inline: On Tue, Jul 10, 2012 at 1:52 PM, Mahesh T. Pai paiva...@gmail.com wrote: Naena Guru said on Tue, Jul 10, 2012 at 10:31:10AM -0500,: Another one I found was lipisaMkhyA, which means exactly what we tern text. I am not sure of the context, but, there is a tradition of using text to write numbers. (samkhya = numbers) Tamil Nadu where Indian linguists reside is closer to Colombo than Delhi I The late Sri P. V. Narasimha Rao, the former Prime Minister, knew 6 or 7 non-Indian languages; and IIRC, he used to give interviews to Spanish (or was it Portugese?). He was not a resident of Tamil Nadu, AFAIK. Actually, I did not mean to put Southerners over Northerners (which latter brought Sanskrit to the South! -- agasti during KulaSekara's time?). It is just that I see lot of Tamil-like names mentioned. I think that there is some language institute in Tamil Nadu which involves a lot in linguistics. Or is knowledge of that many languages not sufficient to qualify him as a linguist? Why split hairs with 'linguist'? Indeed, he is a SaD-bhASA-parama-Izvara = (sandhi) - SaDbhASA paramezvara. It was like the top honorific given a linguist. (PS:- Ah. Well -- PV was refered to as a polyglot; not a linguist - you win) Let's not get trapped by English semantics. The movers of the users of a branch of study could adjust connotations for their benefit, or so it seems in the IT field. Did you notice that lately the word script has been redacted and 'character set' is used instead? What's with that? Why do we generally understand 'transliteration' as shorthand for transcribing from another script to Latin script? It is because, Latin script is recognized by most people and it is the first script that is used in the communication technology of a given era. It was so for letterpress printing in the 1800s and now the computer. The plain truth is that Latin-1, second to ASCII, is the best settled set of (coded) script (characters set) on the computer. This is exactly why I romanized Singhala into Latin-1. Verify that by going to the following web site and clicking on the link that says, Latin Script: www.lovatasinhala.com Latin-1 is not the complete repertoire of codepoints for letters *derived from the proper Latin script*. It is the first part only. I'd say with confidence that 99.9% of well established applications accept Latin-1 with no problem. However, *most* of those applications *do not* accept accented Latin characters used by PTS Pali and IAST Sanskrit (the ones with macron bar and dots). One of the detractors here called the Latin characters outside Latin-1 Added Bonus. I have used these letters in the LIYANNA page of the web site. There are separate conversion sections in it for Pali and Sanskrit. Each of them have edit boxes to enter the language in one form and to get it in the the other(s). Singhala: http://www.lovatasinhala.com/liyanna.php#unisin (rom. Sing - Unicode Sinhala) Pali: http://www.lovatasinhala.com/liyanna.php#pali (rom. Sing - PTS Pali) Sanskrit: http://www.lovatasinhala.com/liyanna.php#sanskrit (RS -. HK - IAST) As a Singhalese
Re: Charset declaration in HTML (was: Romanized Singhala - Think about it again)
On Tue, Jul 10, 2012 at 11:58 PM, Leif Halvard Silli xn--mlform-...@xn--mlform-iua.no wrote: Naena Guru, Tue, 10 Jul 2012 01:40:19 -0500: HTML5 assumes UTF-8 as the character set if you do not declare one explicitly. My current pages are in HTML 4. There is in principle no difference between what HTML5-parsers assume and what HTML4-parsers assume: All of them default to the default encoding for the locale. I see. That is, for the transliteration, the locale should be Sinhala (Latin). Yes. I know that it is not official. I loathe the spelling Sinhala. Oh, well, you cannot have it all. Notepad forced me to save the file in UTF-8 format. I ran it through W3C Validator. It passed HTML5 test with the following warning: [image: Warning] Byte-Order Mark found in UTF-8 File. I assume that you used the validator at http://validator.w3.org. Yes, and it validated it. I was talking about BOM in a different context. It showed up when I opened the file in HTML-Kit that was first created in Notepad and saved under UTF-8. HTML-Kit Tools asked me to specify the character set. It took it. but messed up the macron and dot letters anyway. What I was trying to emphasize was the fact that it is hard for those people that try to make web pages in those 'character sets'. I have been making web pages since 1990s and never had these problems because they were written by hand in English. But if you instead use the most updated HTML5-compatible validators at http://www.validator.nu or http://validator.w3.org/nu/ then will not get any warning just because your file uses the Byte-Order Mark. HTML5 explicitly allows you to use the BOM. Thanks. This too validated all seven pages as HTML5 (I upgrated from HTML 4) The Unicode Byte-Order Mark (BOM) in UTF-8 encoded files is known to cause problems for some text editors and older browsers. You may want to consider avoiding its use until it is better supported. Weasel words from the validator. The notion about older browsers is not very relevant. How old are they? IE6 have no problems with the BOM, for instance. And that is probably one of the few, somewhat relevant, old browsers. As I said before BOM was no problem for me. As for editors: If your own editor have no problems with the BOM, then what? But I think Notepad can also save as UTF-8 but without the BOM - there should be possible to get an option for choosing when you save it. Else you can use the free Notepad++. And many others. In VIM, you set or unset the BOM via the commands set bomb set nobomb Yes, yes. I've seen it before. I have Notepad++. It intimidated me the first time and never used it, haha! -- Leif H Silli
Re: Ewellic again (was: Re: Romanized Singhala - Think about it again)
My error. Sorry, Doug. On Sun, Jul 8, 2012 at 8:00 PM, Doug Ewell d...@ewellic.org wrote: Unicode character database goes from zero to some very big number. There are no holes in it to define character sets for somebody's fancy. Well, Doug Ewell did one for Esparanto expanding fuþorc. Ewellic is not futhorc. They are different scripts. From the Omniglot page on Ewellic (with *emphasis* added): The shape of Ewellic letters was *inspired by* the Runic and Cirth scripts, but shows greater (though still imperfect) regularity of form. -- Doug Ewell | Thornton, Colorado, USA http://www.ewellic.org | @DougEwell
Re: Charset declaration in HTML (was: Romanized Singhala - Think about it again)
Hey, Philippe, Your input is much appreciated. So, in a nutshell, I don't have to worry. One of these days I need to crunch down (minify) the CSS and JavaScript pages. I left them readily readable so that techs like you could easily read them in place in any browser without having to pretty print. The pages are not big by any standard and they download pretty fast. Your earlier point about WOFF is what I am going to try and tackle today (Sunday). In the meanwhile, thanks again. On Tue, Jul 10, 2012 at 11:32 PM, Philippe Verdy verd...@wanadoo.fr wrote: 2012/7/10 Naena Guru naenag...@gmail.com I wanted to see how hard it is to edit a page in Notepad. So I made a copy of my LIYANNA page and replaced the character entities I used for Unicode Sinhala, accented Pali and Sanskrit with their raw letters. Notepad forced me to save the file in UTF-8 format. I ran it through W3C Validator. It passed HTML5 test with the following warning: [image: Warning] Byte-Order Mark found in UTF-8 File. The Unicode Byte-Order Mark (BOM) in UTF-8 encoded files is known to cause problems for some text editors and older browsers. You may want to consider avoiding its use until it is better supported. The BOM is the first character of the file. There are myriad hoops that non-Latin users go through to do things that we routinely do. This problem I saw right at the inception. I already know why romanizing is so good. Don't you? You should probably ignore this non-critical warning now ; it is only for extremely strict compatibility with deprecated softwares that should have been updated since long for obvious security and performance reasons. Those old browsers are deprecating fast (due to the massive and fast spread of security attacks, automatic security updates to close issues competely (instead of just by preventive virus detection based on code bahavior or code patterns which will never be complete and fast enough to react to these extremely frequent attacks). Older editors do not have the cumfort that newer editors have. The memory usage of these newer editors are no longer a problem (notably for web developers that have systems largely above what theiur average users have), and systems capable of running them have never been so cheap. In addition, memory and storage costs have dramatically decreased. We are more concerned about the bandwidth usage, so your web editing platform should include an optimisation process and converters that will automatically use a compact representation (numeric character references for example can be sent by your server as raw UTF-8, in addition the server can now support on-the-fly data compression over the HTTP sessions ; there also exists frontend proxies that will do that for you without requiring you to change the development/editing methods you use. Most text editors even in Linux can now open sucessfully UTF-8 files starting by a BOM without complaining. Just like Notepad does since long. And they allow you to change this edit mode before saving. Most text processors will silently discard the U+FEFF character (it should be safe to do that everywhere, given that U+FEFF should no longer be used for anything else than BOM's) [side node] But Notepad has another problem since long : it cannot sucessfully open a text file whose lines are terminated by LF only, it absolutely wants them to be converted using CR+LF sequences ; this problem is much more severe than the use of a leading BOM. As well, Excell cannot successfully decode an UTF-8 encoded CSV file. But it can autoamtically recognize it if you used instead the import data function. This is inconsistant (also it still does not allow specifying how to convert numbers using dots instead of commas, when running it on a non-English user locale, you need to manually use a search/replace function; it does not allow selecting the date format for CSV file imports, making searhd/replacements operations is not trivial on date fields ; no question is asked to the user, it only uses implicits defaults even when they are wrong, most of the time for actual cases of CSV files). [/side node] But It has nothing to do with your problem of romanization or behavior with Latin. BOMs are only absent from old 8-bit character sets that are no longer recommanded in any modern Internet protocols ; and from 7-bit ASCII used only for internal technical data but not for any text intended to be read and translated. Only UTF-8 support is mandatory now. And that's fine. HTTP headers or URLs require a specufinc encoding but webservers and designing tools can ta ke care of that Everythng else is optional and will require an explicit metadata (the exceptions being UTF-16 and UTF-32 which are not well suited for interchanges across heterogeneous networks and independant realms, but used mostly for internal processes, for which you absolutely don't need any byte order change, so for which
Re: Charset declaration in HTML (was: Romanized Singhala - Think about it again)
Thank you Otto. Sorry for delay in replying. I spent the entire Sunday replying Jaques twins. You are absolutely right about choice between ISO-8859-1 and UTF-8. I shouldn't have said 'using ISO-8859-1 is advantageous over UTF-8' It is efficient if your pages are written in a language that uses single byte codepoints. When you mix multi-byte based codepoints, like you said, the ideal is to have them in their raw form. But in practice, this is not as easy as we think. Actually, the trade-off is not great for me because I use only little non-SBCS characters. Each 2-byte character would end up as six bytes in a Hex char entity. If you want to control the look of your web site, then you probably have to have expensive software to do it. As for poor me, I use CSS, JavaScript and HTML inside HTML-Kit. HTML5 assumes UTF-8 as the character set if you do not declare one explicitly. My current pages are in HTML 4. As I said, I use HTML-Kit (and Tools). If I have raw Unicode Sinhala in the HTML or Javascript, it messes them and gives you character-not-found for them on the web page. I must have character entities if I need the comfort of HTML-Kit. There are web sites that help you process your SBCS and multi-byte mixed text to make character entities for non Latin-1 characters. I used them when making my only page that has them (Liyanna). Stop and think why there are such websites. (Search text to unicode). The world outside Latin-1 is a harsh one. If I want to have raw Unicode Sinhala, PTS Pali or IAST Sanskrit, I have to use Notepad instead of HTML-Kit. It is hard to code without color-coded text. I wanted to see how hard it is to edit a page in Notepad. So I made a copy of my LIYANNA page and replaced the character entities I used for Unicode Sinhala, accented Pali and Sanskrit with their raw letters. Notepad forced me to save the file in UTF-8 format. I ran it through W3C Validator. It passed HTML5 test with the following warning: [image: Warning] Byte-Order Mark found in UTF-8 File. The Unicode Byte-Order Mark (BOM) in UTF-8 encoded files is known to cause problems for some text editors and older browsers. You may want to consider avoiding its use until it is better supported. The BOM is the first character of the file. There are myriad hoops that non-Latin users go through to do things that we routinely do. This problem I saw right at the inception. I already know why romanizing is so good. Don't you? UTF-8 encoding is this RFC: http://www.ietf.org/rfc/rfc2279.txt This is the table it gives on the way UTF-8 encoding works: - 007F 0xxx ASCII 0080- 07FF 110x 10xx === Latin -1 plus higher 0800- 1110 10xx 10xx == Unicode Sinhala 0001 -001F 0xxx 10xx 10xx 10xx 0020 -03FF 10xx 10xx 10xx 10xx 10xx 0400 -7FFF 110x 10xx ... 10xx Observe that Latin 'a' transforms from UCS-2 to two coded bytes with UTF-8 and Unicode Sinhala Ayanna goes from two to three. Unicode Sinhala: 0D80 - 0DFF a = Hex 61 = Bin 0110 0001 - UTF-8 Template: 110x 10xx UTF-8 Encoding: 1101 1011 = Hex C1 A1 ayanna = Hex 0D85 = Bin 11011000 0101 - UTF-8 Template: 1110 10xx 10xx UTF-8 encoding: 1110 10110110 1101 = Hex E0 B6 85 Thanks for your input. It is appreciated. On Wed, Jul 4, 2012 at 2:25 PM, Otto Stolz otto.st...@uni-konstanz.dewrote: Hello Naena Guru, on 2012-07-04, you wrote: The purpose of declaring the character set as iso-8859-1 than utf-8 is to avoid doubling and trebling the size of the page by utf-8. I think, if you have characters outside iso-8859-1 and declare the page as such, you get Character-not-found for those locations. (I may be wrong). You are wrong, indeed. If you declare your page as ISO-8859-1, every octet (aka byte) in your page will be understood as a Latin-1 character; hence you cannot have any other character in your page. So, your notion of “characters outside iso-8859-1” is completely meaningless. If you declare your page as UTF-8, you can have any Unicode character (even PUA characters) in your page. Regardless of the charset declaration of your page, you can include both Numeric Character References and Character Entity References in your HTML source, cf., e.g., http://www.w3.org/TR/html401/**charset.html#h-5.3http://www.w3.org/TR/html401/charset.html#h-5.3 . These may refer to any Unicode character, whatsoever. However, they will take considerably more storage space (and transmission bandwidth) than the UTF-8 encoded characters would take. Good luck, Otto Stolz
Re: Sinhala naming conventions
I did not see the original message here and may be off the subject. However, this seems to be about making computer related words. Singhala is much more Indic than Sriramana thinks. (Oldest Brahmi was found in Lanka). Its phoneme inventory is near Devanagari except Devanagari has some Dravidian phonemes in addition and Singhala has the (OE) æ sound. Sanskrit is the common core. Though Tamil is Dravidian, they (and Germans) tend to be authoritative in Sanskrit. In this light, it makes eminent sense, and if Sanskrit is preferred, that Indians compile the base names for terms, and Lankans follow. Lankans, perhaps because the technocracy is a guarded very small closed group, they tend not to be too deliberative, and stamp down public criticism to guard their mistakes. English is not strange to South Asian natives. I do not know why we need to hurry localizing where that locality is mostly academic (unless there is a commercial advantage for globalizing business, which could be damaging, like it happened with Unicode Sinhala). The ideal is for the terms to evolve by allowing the public to participate. Let it take the natural course of time. Why should it be any different to how it evolves in America for English? In Lanka, there is no discussion and cooperation, only arrogant imposition. In our experience, we have seen Lankan bureaucrats make howlers with impunity. For example, mRdukaMga = 'soft parts' for software. (I am using HK-Sanskrit transcription). Instead, I use 'anavya' (as an Indian friend suggested) a perfect fit for the idea of program as I understand it as a programmer. Then antarjAla = inter-net: but there is a subtle though an important difference between connotations of net and network. We have intuitive new Singhala words like පාපැදිය paapædiya for bicycle meaning what-you-pedal. There are words like තරු වැල þaru væla (the stars) or තරු කැල þaru kæla (those stars) where වැල and කැල suggest the entire collection and a given collection respectively. May be they could be used when considering network. They published in SLS1134 a rule that says brandy should be transcribed as බ්රැන්ඩි, which stands for the sound brunDi. Going by that erroneous rule, the Sanskrit words, krUra and bahuzruta correctly spelled as ක්රෑර and බහුශ්රැත would now be pronounced as krææra and bahuzræþa (using OE). And Pali brūhi (HK Sans) brUhi would now be spoken as bææhi. Thanks to Monier-Williams, at this ripe old age I discovered 'lipi lekhana' to mean stationery than documents. No wonder some Indian linguists politely say that there is Buddhist Sanskrit that misuse the language. I think they mean us. Another one I found was lipisaMkhyA, which means exactly what we tern text. Tamil Nadu where Indian linguists reside is closer to Colombo than Delhi I think we should not consider political boundaries when it comes to matters regarding language and let India take the lead. (However IMO, using English terms is best, but not like 'dongle' for flash drive as they do in Lanka). On Mon, Jul 9, 2012 at 10:14 PM, Shriramana Sharma samj...@gmail.comwrote: Changing the subject line. On Tue, Jul 10, 2012 at 7:19 AM, Harshula harsh...@gmail.com wrote: 0D9A ක sinhala letter ka = SINHALA LETTER ALPAPRAANA KAYANNA Hi -- while I agree with Michael that it would be better to have had a uniform naming standard across all Indic scripts which are perhaps more globally and country-neutrally termed as Brahmic scripts (and Sinhala is certainly a Brahmic scripts) I think it is not totally inappropriate to use the native names since the main target audience is indeed the native users. Unicode is a global standard, and it would not be inappropriate to acquiesce to native users' perceptions in cosmetic matters such as user names (since these would not result in technical problems). The South East Asian scripts despite being Brahmic are certainly not named after the Indic pattern. So why not Sinhala use its native conventions as well? The Indic naming pattern was largely a result of the GOI's desire to have a uniform naming across *Indic* (!= Brahmic, right now) scripts to facilitate production of cross-script pan-Indic software. if they had given the states free rein in naming stuff, we would have had quite confusing naming standards within Indic scripts itself I'm sure. Many Tamilians would probably like to see TAMIL LETTER LLLA named ZHA, but the ZHA of native Tamil perception would not correspond to the ZHA of Bengali (or Assamese or whatever) perception, resulting in confusion for software makers and hence unsatisfactory software and discontent all around! One will certainly agree that there is (quite naturally) less interconnection of India and Sri Lanka than between the Indian states themselves. It is quite unlikely the GOI is going to invest money in producing Sinhala fonts/software. This being so, if the Sri Lankan Govt wished to have their native names in the global standard to facilitate
Re: Sinhala naming conventions
Michael is right. For instance, there is no such thing called AL in Singhala. It is baby babble dropping h from hal. 'hal' means consonant. I leaned the sign as hal kiriima. AL has been given official status by indifferent technocrats. Recall that they said there are no Singhala numerals when the books published from 1800s to 2001 showed them clearly. A Comprehensive Grammar of the Sinhalese Language by A.M. Gunasekara (1891) Published by Asian Education Services - New Delahi and Madras India. හෙළ හොඩියේ වතගොත - පැරණි පොත් සමාගම - 2001 (hela hodiyee vaþagoþa - pærani poþ samaagama - 2001) SLR 130.00 (about US$ 1.00) I got down the first book through Alibris.com and I think I emailed or called for the second one. This second one is all you need as a font maker, Michael. Publisher: Paerani Poth Samaagama 198 Highlevel Road Nugegoda, Sri Lanka Phone 01-852911 Email: mob...@slt.net On Tue, Jul 10, 2012 at 2:29 AM, Michael Everson ever...@evertype.comwrote: On 10 Jul 2012, at 08:13, Shriramana Sharma wrote: On Tue, Jul 10, 2012 at 12:39 PM, Harshula harsh...@gmail.com wrote: I haven't heard any complaints from native Sinhala speakers about the existing descriptions: e.g. 0D9A ක SINHALA LETTER ALPAPRAANA KAYANNA = sinhala letter ka The existing names are fine for Sinhala speakers. Michael feels it would have been better if they had been named after the *Indian* Indic scripts. No, my view is that it would have been better if all the *Brahmic* scripts had the same naming conventions. Michael Everson * http://www.evertype.com/
Re: Sorting Pali in Tibetan Script
Sorry to disconcert you. sir. I just gave the Singhala order. Singhala is the script of Pali. The l and ḷ together makes sense. It is not inSanskrit anyway. But the aṃ, a, ā order is not Singhala or Sanskrit order, not in any grammar book. Anusvara comes after vowels and diphthings. But you are determined to apply your own order. That's your choice. This I wrote generally for others to note. On Sat, Jul 7, 2012 at 8:41 PM, Richard Wordingham richard.wording...@ntlworld.com wrote: On Sat, 7 Jul 2012 17:43:41 -0500 Naena Guru naenag...@gmail.com wrote: This is the Pali sorting order in PTS Pali. The Last letter is the retroflex L: a ā i ī u ū e o aṃ aaṃ iṃ iiṃ uṃ uuṃ eṃ oṃ k kh g gh ṅ c ch j jh ñ ṭ ṭh ḍ ḍh ṇ t th d dh n p ph b bh m y r l v s h ḷ Disconcertingly, no. ḷ follows l, sorted as a primary difference. Otherwise, the consonants do follow the normal Indic script order. Also, the first three vowels are aṃ, a, ā. This information wasn't what I needed, but thank you for the attempt to help. Richard.
Re: Sorting Pali in Tibetan Script
Here is a more authoritative answer: !. Pali is a subset of Sanskrit 2. The Sanskrit sorting order is as Monier-Wiliams gives and it matches vyAkaraNa books. http://www.sanskrit-lexicon.uni-koeln.de/scans/MWScan/tamil/index.html You can place dark el where ever you want to because Sanskrit does not have it. Please respect the tradition. Thank you. On Sun, Jul 8, 2012 at 6:40 PM, Naena Guru naenag...@gmail.com wrote: Sorry to disconcert you. sir. I just gave the Singhala order. Singhala is the script of Pali. The l and ḷ together makes sense. It is not inSanskrit anyway. But the aṃ, a, ā order is not Singhala or Sanskrit order, not in any grammar book. Anusvara comes after vowels and diphthings. But you are determined to apply your own order. That's your choice. This I wrote generally for others to note. On Sat, Jul 7, 2012 at 8:41 PM, Richard Wordingham richard.wording...@ntlworld.com wrote: On Sat, 7 Jul 2012 17:43:41 -0500 Naena Guru naenag...@gmail.com wrote: This is the Pali sorting order in PTS Pali. The Last letter is the retroflex L: a ā i ī u ū e o aṃ aaṃ iṃ iiṃ uṃ uuṃ eṃ oṃ k kh g gh ṅ c ch j jh ñ ṭ ṭh ḍ ḍh ṇ t th d dh n p ph b bh m y r l v s h ḷ Disconcertingly, no. ḷ follows l, sorted as a primary difference. Otherwise, the consonants do follow the normal Indic script order. Also, the first three vowels are aṃ, a, ā. This information wasn't what I needed, but thank you for the attempt to help. Richard.
Re: Sorting Pali in Tibetan Script
This is the Pali sorting order in PTS Pali. The Last letter is the retroflex L: a ā i ī u ū e o aṃ aaṃ iṃ iiṃ uṃ uuṃ eṃ oṃ k kh g gh ṅ c ch j jh ñ ṭ ṭh ḍ ḍh ṇ t th d dh n p ph b bh m y r l v s h ḷ On Sat, Jul 7, 2012 at 12:14 PM, Richard Wordingham richard.wording...@ntlworld.com wrote: Can someone please advise me as to the sorting of Pali as Pali in Tibetan script. I need a prompt response rather than a complete treatment. It is possible that I have been misunderstood what I have been able to pull together. What I understand is the following: (a) The retroflex lateral ('LLA' in most Unicode encodings) is written U+0F63 TIBETAN LETTER LA, U+0F39 TIBETAN MARK TSA -PHRU, as at http://www.tipitaka.org/tibt/ . (b) For Pali, the retroflex lateral should be sorted as though a full letter, rather than as letter plus subscript. This is general international practice, embodied in scripts that have LLA encoded as an independent letter, such as Sinhalese (backed up by SLS 1134:2004) and Thai (many dictionaries). (c) The long vowel II sorts at a primary level between the short vowel I and the short vowel U - general practice in Indic scripts, and captured by ISO 14651 and the Default Unicode Collation Element Table (DUCET). Now if I am correct, this does have an interesting processing effect. The syllable LLII, in NFD, will be written U+0F63, U+0F71 TIBETAN VOWEL SIGN AA, U+0F72 TIBETAN VOWEL SIGN I, U+0F39, so to collate LLII on the basis of the constituent consonant and vowel requires the discontiguous contraction U+0F63, U+0F39 and then the contraction U+0F71, U+0F72 from the skipped characters. Version 6.1.0 of the Unicode Collation Algorithm requires the ability to do exactly this. However, it has been proposed that Version 6.2.0 *prohibit* this ability. The treatment of LL.HA could be interesting, but is not of urgent interest. Doubts are cast on my analysis by the rules for Tibetan collation given, for example at http://developer.mimer.com/collations/tibetan/Chilton_slides.pdf , which states that U+0F71 is given a secondary weight and makes no mention of the long vowels, and certainly makes no mention of any LLA. If the desciption there is correct and complete, it seems that I should see a sort order LI U+0F63, U+0F72 LLI U+0F63, U+0F72, U+0F39 LII U+0F63, U+0F71, U+0F72 LLII U+0F63, U+0F71, U+0F72, U+0F39. Is this the correct order for sorting as Tibetan? The diacritics do seem to apply back-to-front. Richard.
Re: Romanized Singhala - Think about it again
Thank you Goliath. On another subject, I think the script you dreamed of as a boy is very nearly fuþorc. foþorc is the (Old) English alphabet. Thank you. On Wed, Jul 4, 2012 at 1:54 PM, Doug Ewell d...@ewellic.org wrote: [removing cc list] Naena Guru wrote: On this 4th of July, let me quote James Madison: [quote from Madison irrelevant to character encoding principles snipped] I gave much thought to why many here at the Unicode mailing list reacted badly to my saying that Unicode solution for Singhala is bad. Unicode encodes Latin characters in their own block, and Sinhala characters in their own block. Many of us disagree with a solution to encode Sinhala characters as though they were merely Latin characters with different shapes, and agree with the Unicode solution to encode them as separate characters. This is a technical matter. I see the problem. This is what confused Philippe too. This is primarily a transliteration. Transliterations go from one script to another. Not one Unicode code block (I said code page earlier with an old habit) to another. So, let's take the font issue out for the time being and concentrate on the transliteration. A transliteration scheme is a solution for a problem and has a technology platform it is made for. Older (predecessor of) IAST Sanskrit and PTS Pali were solutions made with letterpress printing in mind. They used dots and bars for accents because they could be improvised easily in the street-side printing presses. That was 1800s. Suddenly with computers, accented letters became hard to get. HK Sanskrit made Sanskrit friendly for the computer by limiting it to ASCII. Now, after electronic communication became cleaner, we expanded the 7-bit set to full-byte set. Now iso-8859-1 set is available everywhere. Earlier I said the Plain Text idea is bad too. And many of us disagree with that rather vehemently as well, for many reasons. The responses came as attacks on *my* solution than in defense of Unicode Singhala. It's not personal unless you wish to make it personal. You came onto the Unicode mailing list, a place unsurprisingly filled with people who believe the Unicode model is a superior if not perfect character encoding model, and claimed that encoding Sinhala as if it were Latin (and requiring a special font to see the Sinhala glyphs) is a better model. Are you really surprised that some people here disagree with you? If you write to a Linux mailing list that Linux is terrible and Microsoft Windows is wonderful, you will see pushback there too. Here is a defense of Unicode Sinhala: it allows you, me, or anyone else to create, read, search, and sort plain text in Sinhala, optionally with any other script or combination of scripts in the same text, using any of a fairly wide variety of fonts, rendering engines, and applications. The purpose of designating naenaguru@gmail.com as a spammer is to prevent criticism. The list administrator, Sarasvati, can speak to this issue. Every mailing list, every single one, has rules concerning the conduct of posters. I note that your post made it to the list, though, so I'm not sure what you're on about. It is shameful that a standards organization belonging to corporations of repute resorts to censorship like bureaucrats and academics of little Lanka. Do not attempt to represent this as a David and Goliath battle between the big bad Unicode Consortium and poor little Sri Lanka or its citizens. This is a technical matter. I ask you to reconsider: As a way of explaining Romanized Singhala, I made some improvements to www.LovataSinhala.com. Mainly, it now has near the top of each page a link that says, ’switch the script’. That switches the base font of the body tag of the page between the Latin and Singhala typefaces. Please read the smaller page that pops up. The fundamental model is still one of representing Sinhala text using Latin characters, and relying on a font switch. It is still completely antithetical to the Unicode model. I also verified that I hadn’t left any Unicode characters outside ISO-8859-1 in the source code -- HTML, JavaScript or CSS. The purpose of declaring the character set as iso-8859-1 than utf-8 is to avoid doubling and trebling the size of the page by utf-8. I think, if you have characters outside iso-8859-1 and declare the page as such, you get Character-not-found for those locations. (I may be wrong). You didn't read what Philippe wrote. Representing Sinhala characters in UTF-8 takes *fewer* bytes, typically less than half, compared to using numeric character references like #3523;#3538;#3458;#3524;#**3517; #3517;#3538;#3520;#3539;#**3512;#3495; #3465;#3524;#3517;. Philippe Verdy, obviously has spent a lot of time researching the web site and even went as far as to check the faults of the web service provider, Godaddy.com. He called my font a hack font without any proof of it. A font
Re: Romanized Singhala - Think about it again
On Thu, Jul 5, 2012 at 6:51 AM, Philippe Verdy verd...@wanadoo.fr wrote: 2012/7/5 Naena Guru naenag...@gmail.com: On Wed, Jul 4, 2012 at 11:33 PM, Philippe Verdy verd...@wanadoo.fr wrote: Anyway, consider the solutions already proposed in Sinhalese Wikipedia. There are verious solutions proposed, including several input methods supported there. But the purpose of these solutions is always to generate Sinhalese texts perfectly encoded with Unicode and nothing else. Thank you for the kind suggestion. The problem is Unicode Sinhala does not perfectly support Singhala! The solution is for Sinhala not for Unicode! I am not saying Unicode has a bad intention but an ill-conceived product. The fault is with Lankan technocrats that took the proposal as it was given and ever since prevented public participation. My solution is 'perfectly encoded with Unicode'. Yes thee may remain some issues with older OSes that have limited support for standard OpenType layout tables. But there's now no problem at all since Windows XP SP2. Windows 7 has the full support, and for those users that have still not upgraded from Windows XP, Windows 8 will be ready in next August with an upgrade cost of about US$ 40 in US (valid offer currently advertized for all users upgrading from XP or later), and certainly even less for users in India and Sri Lanka. The above are not any of my complaints. Per Capita Income in Sri Lanka $2400. They are content with cell phones. The practical place for computers is the Internet Cafe. Linux is what the vast majority needs. And standard Unicode fonts with free licences are already available for all systems (not just Linux for which they were initially developed); Yes, only 4 rickety ones. Who is going to buy them anyway? Still Iskoola Pota made by Microsoft by copying a printed font is the best. You check the Plain Text by mixing Singhala and Latin in the Arial Unicode MS font to see how pretty Plain text looks. They spent $2 or 20 million for someone to come and teach them how to make fonts. (Search ICTA.lk). Staying friendly with them is profitable. World bank backs you up too. Sometime in 1990s when I was in Lanka, I tried to select a PC for my printer brother. We wanted to buy Adobe, Quark Express etc. The store keeper gave a list and asked us to select the programs. Knowing that they are expensive, I asked him first to tell me how much they cost. He said that he will install anything we wanted for free! The same trip coming back, in Zurich, the guys tried to give me a illicit copy of Windows OS in appreciation for installing German and Italian (or French?) code pages on their computers. there even exists solutions for older versions of iPhone 4. OR on Android smartphones and tablets. Mine works in them with no special solution. It works anywhere that supports Open Type -- no platform discrimination No one wants to get back to the situation that existed in the 1980's when there was a proliferation of non-interoperable 8 bit encodings for each specific platform. I agree. Today, 14 languages, including English, French, German and Italian all share the same character space called ISO-8859-1. Romanized Singhala uses the same. So, what's the fuss about? The font? Consider that as the oft suggested IME. Haha! And your solution also does not work in multilingual contexts; If mine does not work in some multilingual context, none of the 14 languages I mentioned above including English and French don't either. it does not work with many protocols or i18n libraries for applications. i18n is for multi-byte characters. Mine are single-byte characters. As you see, the safest place is SBCS. Or it requires specific constraints on web pages requiring complex styling everywhere to switch fonts. Did you see http://www.lovatasinhala.com? May be you are confusing Unicode Sinhala and romanized Singhala. Unicode Sinhala has a myriad such problems. That is why it should be abandoned! Please look at the web site and say it more coherently, if I misunderstood you. You are once again confusing the Sinhalese language wit hthe Sinhalese script. May be Latin-1 is a good and sufficient script for transcribing the language. But Unicode is not made for standardizing transliterations. The script is what is being encoded, the way it is. Even if this script is deffective on some aspect for the language. As long as your transliteration scheme using Latin letters encodings is showing Latin letters, it will be fine. You are very kind. So now I have fulfilled your order by providing a link on the right side of the page to get rid of the Singhala font. But a font that represents Latin letters using Sinhalese glyphs is definitely broken. It will not work within multilingual contexts except when using many font switches
Re: Romanized Singhala - Think about it again
of the entire Latin repertoire. You don't have to tell that to me. I have traveled quite a bit in the IT world. Don't be surprised if it is more than what you've seen. (Did you forget that earlier you accused me of using characters outside ISO-8859-1 while claiming I am within it? That is because you saw IAST and PTS displayed. They use those wonderful letters symbols and diacritics you are trying to tout. Is there a problem with Asians using ISO-8859-1 code space even for transliteration? The bonus will be that you can still write the Sinhalese language with a romanisation like yours, Bonus? but there's no need to reinvent the Sinhalese script Singhala script existed many, many years since before the English and French adopted Latin. What I did was saving it from the massacre going on with Unicode Sinhala. itself that your encoding is not even capable of completely support in all its aspects (your system only supports a reduces subset of the script). What is the basis for this nonsense?. (Little birds whispering in the background. Watch out. They are laughing). My solution supports the entire script, Singhala, Pali and Sanskrit plus two rare allophones of Sanskrit as well. Tell me what it lacks and I will add it, haha! One time you said I assigned Unicode Sinhala characters to the 'hack' font. What I do is assigning Latin characters to Singhala phonemes. That is called transliteration. There are no 'contextual versions' of the same Singhala letters like you said earlier. Ask your friends what they have more than mine in the Singhala script. Ask them why they included only two ligatures when there are 15 such. Ask them how many Singhala letters there are. Even the legacy ISCII system (used in India) is better, because it is supported by a published open standard, for which there's a clear and stable conversion from/to Unicode. My solution is supported by two standards: ISO-8859-1 and Open Type. ISO-8859-1 is Basic Latin plus Latin-1 Extension part of Unicode standard. Bottom line is this: If Latin-1 is good enough for English and French, it is good enough for Singhala too. And if Open Type is good for English and French, it is good for Singhala too. 2012/7/5 Naena Guru naenag...@gmail.com: Philippe, My last message was partial. It went out by mistake. I'll try again. It takes very long for this old man. -- Forwarded message -- From: Naena Guru naenag...@gmail.com Date: Wed, Jul 4, 2012 at 10:32 PM Subject: Re: Romanized Singhala - Think about it again To: verd...@wanadoo.fr Hi, Philippe. Thanks for keeping engaged in the discussion. Too little time spent could lead to misunderstanding. On Wed, Jul 4, 2012 at 3:42 PM, Philippe Verdy verd...@wanadoo.fr wrote: 2012/7/4 Naena Guru naenag...@gmail.com: Philippe Verdy, obviously has spent a lot of time Not a lot of time... Sorry. researching the web site and even went as far as to check the faults of the web service provider, Godaddy.com. I did not even note that your hosting provider was that company. I just looked at the HTTP headers to look at the MIME type and charset declarations. Nothing else. I know that the browser tells it. It is not a big deal, WOFF is the compressed TTF, but TTF gets delivered. If and when GoDaddy fixes their problem, the pages get delivered faster. Or I can make that fix in a .htaccess file. No time! He called my font a hack font without any proof of it. It is really a hack. Your font assigns Sinhalese characters to Latin letters (or some punctuations) of ISO 8859-1. My font does not have anything to do with Singhalese characters if you mean Unicode characters. You are very confusing. A Character in this context is a datatype. In the 80s it was one byte in size and used to signal not to use in arithmetic. (We still did it to convert between Capitals and Simple forms.) In the Unicode character database, a character is a numerical position. A Unicode Sinhala character is defined in Hex [0D80 - 0DFF]. Unicode Sinhala characters represent an incomplete hotchpotch of ideas of letters, ligatures and signs. I have none of that in the font. I say and know that Unicode Sinhala is a failure. It inhibits use of Singhala on the computer and the network. I do not concern me with fixing it because it cannot be fixed. Only thing I did in relation to it is to write an elaborate set of routines to *translate* (not map) between constructs of Unicode Sinhala characters and romanized Singhala. That is not in the font. The font has lookup tables. It also assigns contextual variants of the same abstract Sinhalese letters, to ISO 8859-1 codes, What contexts cause what variants? Looks like you are saying Singhala letters cha plus glyphs for some ligatures of multiple Sinhalese letters to ISO 8859-1 codes, plus it reorders these glyphs so that they no longer match
Romanized Singhala - Think about it again
Pardon me for including a CC list. These are people who showed for and against opinion. On this 4th of July, let me quote James Madison: A zeal for different opinions concerning religion, concerning government, and many other points, as well of speculation as of practice; an attachment to different leaders ambitiously contending for pre-eminence and power; or to persons of other descriptions whose fortunes have been interesting to the human passions, have, in turn, divided mankind into parties, inflamed them with mutual animosity, and rendered them much more disposed to vex and oppress each other than to co-operate for their common good. I gave much thought to why many here at the Unicode mailing list reacted badly to my saying that Unicode solution for Singhala is bad. Earlier I said the Plain Text idea is bad too. The responses came as attacks on *my* solution than in defense of Unicode Singhala. The purpose of designating naenaguru@gmail.com as a spammer is to prevent criticism. It is shameful that a standards organization belonging to corporations of repute resorts to censorship like bureaucrats and academics of little Lanka. * I ask you to reconsider:* As a way of explaining Romanized Singhala, I made some improvements to www.LovataSinhala.com http://www.lovatasinhala.com/. Mainly, it now has near the top of each page a link that says, ’switch the script’. That switches the base font of the body tag of the page between the Latin and Singhala typefaces. *Please read the smaller page that pops up.* I also verified that I hadn’t left any Unicode characters outside ISO-8859-1 in the source code -- HTML, JavaScript or CSS. The purpose of declaring the character set as iso-8859-1 than utf-8 is to avoid doubling and trebling the size of the page by utf-8. I think, if you have characters outside iso-8859-1 and declare the page as such, you get Character-not-found for those locations. (I may be wrong). Philippe Verdy, obviously has spent a lot of time researching the web site and even went as far as to check the faults of the web service provider, Godaddy.com. He called my font a hack font without any proof of it. It has only characters relevant to romanized Singhala within the SBCS. Most of the work was in the PUA and Look-up Tables. I am reminded of Inspector Clouseau that has many gadgets and in the end finds himself as the culprit. I will still read and try those other things Philippe suggests, when I get time. What is important for me is to improve on orthography rules and add more Indic languages -- Devanagari and Tamil coming up. As for those who do not want to think rationally and think Unicode is a religion, I can only point to my dilemma: http://lovatasinhala.com/assayaa.htm Have a Happy Fourth of July!
Re: Romanized Singhala - Think about it again
Philippe, ask your friends why ordinary people Anglicize if Unicode Sinhala is so great. See just one of many community forums: http://elakiri.com I know you do not care about a language of a 15 milllion people, but it matters to them. On Wed, Jul 4, 2012 at 10:46 PM, Philippe Verdy verd...@wanadoo.fr wrote: You are alone to think that. Users of the Sinhalese edition of Wikipedia do not need your hack or even webfonts to use the website. It only uses standard Unicode, with very common web browsers. And it works as is. For users that are not preequiped with the necessary fonts and browsers, Wikipedia indicates this vey useful site: http://www.siyabas.lk/sinhala_how_to_install_in_english.html I have two guys here in the US that asked me to help get rid of Unicode Sinhala that I helped them install from that 'very useful site'. Copies of this message goes to them. Actually, you do not need their special installation if you have Windows 7. Windows XP needs update of Uniscribe, and Vista too. Their installation programs are faulty and interferes with your OS settings. This solves the problem at least for older version of Windows or old distributions of Linux (now all popular distributions support Sinhalese). No web fonts are even necessary (WOFT works only in Windows but not in older versions of Windows with old versions of IE). You mean WEFT? Now TTF (OTF) are compressed into WOFF. I see that Microsoft is finally supporting it.(At least my font downloads, or may be it picks up the font in my computer? Now I am confused) Everything is covered : working with TrueType and OpenType, adding an IME if needed. And then navigating on standard Sinhalese websites encoded with Unicode. Philippe, try making a web page with Unicode Sinhala. Note that for version of Windows with older versions than IE6 there is no support only because these older versions did not have the necessary minimum support for complex scripts. The alternative is to use another browser such as Firefox which uses its own independant renderer that does not depend on Windows Uniscribe support. But these users are now extremely rare. Almost everyone now uses at least XP for Windows (Windows 95/98 are definitely dead), or uses a Mac, or a smartphone, or another browser (such as Firefox, Chrome, Opera). I agree. Nobody except you support your tricks and hacks. You come really too late truing to solve a problem that no longer exists as it has been solved since long for Sinhalese. Mine is a comprehensive solution. It is a transliteration. Ask users that compared the two. Find ordinary Singhalese. They use Unicode Sinhala to read news web sites. The rest of the time they Anglicize or write in English. Everything is covered here too, buddy. Adobe apps since 2004, Apple since 2004, Mozilla since 2006, All other modern browsers since 2010. MS Office 2010. Abiword, gNumeric, Linux all the works. IE 8,9 partial. IE 10 full. So? 2012/7/5 Naena Guru naenag...@gmail.com: Hi, Philippe. Thanks for keeping engaged in the discussion. Too little time spent could lead to misunderstanding. On Wed, Jul 4, 2012 at 3:42 PM, Philippe Verdy verd...@wanadoo.fr wrote: 2012/7/4 Naena Guru naenag...@gmail.com: Philippe Verdy, obviously has spent a lot of time Not a lot of time... Sorry. researching the web site and even went as far as to check the faults of the web service provider, Godaddy.com. I did not even note that your hosting provider was that company. I just looked at the HTTP headers to look at the MIME type and charset declarations. Nothing else. I know that the browser tells it. It is not a big deal, WOFF is the compressed TTF, but TTF gets delivered. If and when GoDaddy fixes their problem, the pages get delivered faster. Or I can make that fix in a .htaccess file. No time! He called my font a hack font without any proof of it. It is really a hack. Your font assigns Sinhalese characters to Latin letters (or some punctuations) of ISO 8859-1. My font does not have anything to do with Singhalese characters if you mean Unicode characters. You are very confusing. A Character in this context is a datatype. In the 80s it was one byte in size and used to signal not to use in arithmetic. (We still did it to convert between Capitals and Simple forms.) In the Unicode character database, a character is a numerical position. A Unicode Sinhala character is defined in Hex [0D80 - 0DFF]. Unicode Sinhala characters represent an incomplete hotchpotch of ideas of letters, ligatures and signs. I have none of that in the font. I say and know that Unicode Sinhala is a failure. It inhibits use of Singhala on the computer and the network. I do not concern me with fixing it because it cannot be fixed. Only thing I did in relation to it is to write an elaborate set of routines to *translate* (not map) between constructs of Unicode Sinhala
Re: Offlist: complex rendering
was through US-International and shapes were automatic. There is a very important function of the font other than the visual help for typing and reading. It also upholds and preserves dying orthographies. It should be developed into a true orthographic font. (Help and time?). Singhala has three orthographies. Sinhala words do not have joined letters. Sanskrit and Pali have a set of double and treble conjoint letters. In addition, Pali eliminates the hal sign within the words and allows only the halant. Pali needs its own fonts. Below are links to two files that show the first paragraph of this web page: http://www.divaina.com/2012/06/17/scholast.html Unicode Sinhala: http://ahangama.com/sing/DBS.htm http://ahangama.com/sing/DSS.htm (4 kB) Romanized Singhala: http://ahangama.com/sing/DSS.htm (1 kB) Compare the shape formation and the sizes of the files. How much bandwidth is taken for the Unicode Sinhala file to go as UFT-8? 6kB! Six times the romanized file. Beyond that, imagine how that Unicode Singhala page was made from scratch and how many more steps were needed to get there than Latin-1 page. If you closely inspect the original page from Divaina, you see that they did not input Unicode Sinhala directly but used an intermediary step. (There are two stray English letters). These are things that matter for ordinary citizens, not university dons paid and venerated by those same poor citizens. I have only limited time. So please do not expect me to reply to the entire barrage of attack. (I think I am already banned as a spammer for repeating, or not to hear the unpleasant truth?). Thank you. PS: I added Donald Gaminitilake to the list as he had a third solution for SInghala and was one I watched pummeled by the crowd in Lanka. On Mon, Jun 18, 2012 at 2:09 AM, Ruvan Weerasinghe a...@ucsc.cmb.ac.lkwrote: Not that it is of much consequence, but even the website cited for 'Anglicizing' Sinhala (http://www.sinhalaya.com/) now has a majority of content in UNICODE! I couldn't really find posts using images in this site - not that I tried hard to find them! 5 years ago, this argument may have had some attention by those not conversant with UNICODE, but today, with the multitude of blogs, wiki's (Sinhala wikipedia has 7k entries now in case you didn't know), google search in Sinhala, only the mischievous could insist we need to go back. Incidentally, please visit the following link to see the hit counts for common Sinhala words in a google search in August/September 2009. Clicking any of the words will give you a google hit count for that word today. This should give a good idea of the spread of UNICODE Sinhala: http://ucsc.cmb.ac.lk/wiki/index.php/LTRL:Web_Statistics_(Online_Sinhala)/August_September (e.g. the word අනුරාධපුරය which had 2500 - 4000 hits that time, has over 32,000 today; the word ඉදිරිපත් which had 60k+ hits then has over 600k hits now) Regards. Ruvan Weerasinghe University of Colombo School of Computing Colombo 00700, Sri Lanka. Web:http://www.ucsc.lk Phone: +94112158953; Fax:+94112587239 -- *From: *Harshula harsh...@gmail.com *To: *Naena Guru naenag...@gmail.com, jc ahang...@gmail.com *Cc: *Tom Gewecke t...@bluesky.org, unicode@unicode.org, Tissa Dharmagunaratne tiss...@aol.com, Ranjith Ruberu antonyrub...@hotmail.com, Kusum Perera kusum...@gmail.com, Bertie Fernando bertieferna...@hotmail.com, Ruvan Weerasinghe a...@ucsc.cmb.ac.lk, Gihan Dias gi...@cse.mrt.ac.lk, Wasantha Deshapriya wasan...@icta.lk *Sent: *Monday, June 18, 2012 10:07:14 AM *Subject: *Re: Offlist: complex rendering Hi JC, You have been making the same allegations for more than half a decade. Now you have moved on to a new forum, the Unicode Consortium. The reality is that all the professionals and academics that work in computational linguistics, Sinhala localization, etc in Sri Lanka are on-board with Unicode Sinhala. We are seeing research and applications developed on top of Unicode Sinhala. IIRC, back then you were unable to demonstrate the shortcomings of Unicode Sinhala that your scheme solved. If you have complaints about operating systems that do not implement Unicode Sinhala correctly, please contact the specific company. cya, # On Fri, 2012-06-15 at 01:45 -0500, Naena Guru wrote: Tom, Thank you for taking an interest in this matter. You said, Mapping multiple scripts to Latin-1 codepoints is contrary to the most basic principles of Unicode and represents a backwards technology leap of 20 years or more. Well, do you otherwise agree that the transliteration is good? It can be typed easily, and certainly not like the Unicode Indic transliteration that is only good for Aliens to discover some day. Unicode has a principle about shapes assigned to characters. It is the opposite of what you said. At the time I started this project Unicode version 2 specifically said that it does
Re: Offlist: complex rendering
Tom, Thank you for taking an interest in this matter. You said, Mapping multiple scripts to Latin-1 codepoints is contrary to the most basic principles of Unicode and represents a backwards technology leap of 20 years or more. Well, do you otherwise agree that the transliteration is good? It can be typed easily, and certainly not like the Unicode Indic transliteration that is only good for Aliens to discover some day. Unicode has a principle about shapes assigned to characters. It is the opposite of what you said. At the time I started this project Unicode version 2 specifically said that it does not define shapes. That is the reason I tried it. Think of it as a help for the person that types. I tested it on real people. They are unaware that the underlying codes are that of Latin. They are surprised and elated. So, if you are so averse to changing the shapes of Latin-1, what would you say about Fraktur and Gaelic that the standard specifically said are based on Latin-1 but have different shapes? You said, It doesn't seem realistic to me that it could ever see acceptance, and I'm a bit surprised that you continue to devote your talents to promoting it. Is there some reason you consider it to be promising nonetheless? (Thank you for calling me talented. I am not). It depends on whose acceptance you are talking about. You'll understand if you are a Singhalese, Tom. The leap 20 years back is what we need. Unicode parked us in a cul de sac. BTW, I haven't even started to promote it. I want the IT community to say this works, as it really does. Think why people Anglicize in this very popular web site: www.sinhalaya.com There are many such. (try elakiri.com) You will see some Unicode Sinhala, but most posts are written using hack fonts and made into graphics to post. The Lankan government is so worried that they have launched a program to teach English to everyone perhaps seeing the demise of Singhala due to digital creep. (Wisdom of politicians!). Also look at the web site of the IT agency of the government: http://www.icta.lk/ How much prominence did they give the language of the 70%? The bureaucrats are giving themselves medals. (See the pictures). They are making laws forcing the government employees to use Unicode Singhala, because they are reluctant. It's a Third World country. The literacy rate is 90% plus, not a little India. But the people are docile. They depend on the government to tell what todo. The bureaucracy in return depends on the West to tell them what is right. The technocrats call themselves යුනිකේත (Love UNI!) Yes, Tom, I do have a very good reason. I know it because I am a Singhalese. It is *practical* and being accepted and commended by everyone that I showed it to. If English, German, Spanish, Icelandic, Danish etc. use Latin-1, and if Singhala *can* perfectly map to Latin-1, why shouldn't it? That is called transliteration. Recall that English fully romanized about year 600. Singhala is a minority language that is scheduled to be executed, and Unicode is unwittingly the reason. Brahmi probably is Old Singhala. The oldest Brahmi was found in Shree Langkaa (Sri Lanka) 2-3 centuries before it was seen in India. Some say Singhalese founded the Mayans. (What a chauvinist!). So, let's give it a boost before World Ends. I need the support of Unicode, which is like World Government for Laangkans. This is what I want Unicode to judge: - Is the transliteration practical? - Do I have a round trip conversion with precious Unicode Sinhala? Help us, Tom. This message is getting too long.I can list pros and cons of Dual-script Singhala and Unicode Sinhala to convince any techie why we should forget Unicode Sinhala. Let me end with a quote from SICP http://mitpress.mit.edu/sicp/full-text/book/book.html Educators, generals, dieticians, psychologists, and parents program. Armies, students, and some societies are programmed. An assault on large problems employs a succession of programs, most of which spring into existence en route. These programs are rife with issues that appear to be particular to the problem at hand. To appreciate programming as an intellectual activity in its own right you must turn to computer programming; you must read and write computer programs -- many of them. It doesn't matter much what the programs are about or what applications they serve. What does matter is how well they perform and how smoothly they fit with other programs in the creation of still greater programs. *The programmer must seek both perfection of part and adequacy of collection.* Do we want to be programmed or be programmers? Is the collection adequate? Best regards, JC On Thu, Jun 14, 2012 at 8:08 AM, Tom Gewecke t...@bluesky.org wrote: naenaguru wrote: Map sounds to QWERTY extended key layouts adding non-English letters - Result: strict, rule based alphabet extending from ASCII to Latin-1 - Mapping multiple scripts to Latin-1 codepoints is contrary to the most basic
Re: complex rendering (was: Re: Mandombe)
Extremely interesting! Ordinary people need practical solutions. Thank you for trying. I made the first smartfont for Singhala in 2004. (Perhaps the first ever for any written language -- a little bragging there). Only SIL.org WorldPad could show it then. Now, starting with Windows Notepad, Office 2010, AbiWord and all the browsers *except IE* render the complex letters perfectly. (Er, Google Chrome, nearly there): http://www.lovatasinhala.com (hand coded) and http://www.ahangama.com/ (WordPress blog). I think Anderson should try Open Type and MS Volt like I did. Like he, I am no typographer. It is the programming that is most important in getting out the orthography. It took me 9 months day and night to get it to where it is now. Redesigning glyphs or making other font faces are trivial, now that the rendering rules are defined. My approach was as follows (applies to Indic and perhaps Arabic and Hebrew too): Start observing Anglicizing - Map sounds to QWERTY extended key layouts adding non-English letters - Result: strict, rule based alphabet extending from ASCII to Latin-1 - Font: Program all shapes, base as well as conjoint, to the PUA - Program look-up tables according to the orthography rules in the grammar of the language. On Mon, Jun 11, 2012 at 5:27 PM, vanis...@boil.afraid.org wrote: From: Szelp, A. Sz. a.sz.szelp_at_gmail.com On Mon, Jun 11, 2012 at 10:58 AM, Stephan Stiller sstiller_at_stanford.eduwrote: This is interesting only if the encodable elements would be different - remember, Unicode is not a font standard. +1; rendering can be so much more complex than encoding. I'd really like to see a successful renderer for Nastaliq, (vertical) Mongolian, or Duployan. (What *are* the hardest writing systems to render?) Vertical mongolian does not seem to be harder to render _conceptually_ than, let's say, simple arabic. It's more the architectural limitations of rendering engines that seem to limit its availability, and the intermixing with horizontal text. For Nastaliq, Thomas Milo's DecoType is miraculous: it's hard, but given the good job they did, obviously not impossible. — Well, I don't know about Duployan. /Sz I guess this is my invitation to chime in. I'm close to releasing a beta of a Graphite engine for (Chinook repertoire) Duployan, using a PUA encoding. By the release of 6.3/7.0, we should have a working implementation of Unicode Duployan/shorthand rendering for Graphite enabled applications. Like a Nastaliq implementation, it's convoluted and involved, but not impossible. It will not, however, be nearly as beautiful as DecoType; I'm not a designer at heart, and a Duployan implementation as stunning as Milo's Nastaliq will require the skills of people several orders of magnitude more talented than I. -Van Anderson
Re: Unicode, SMS and year 2012
Hi Cristian, This is a bit of a deviation from the issues you raise, but it relates to the subject in a different way. The SMS char set does not seem to follow Unicode. How I see Unicode is as a set of character groups, 7-bit, 8-bit (extends and replaces 7-bit), 16-bit, and CJKV that use some sort of 16-bit paring. As Unicode says, they are just numeric codes assigned to letters or whatever other ideas. It is the task if the devices to decide what they are and show them You say that there are only two character sets in GSM: 7-bit, which is a reassignment of codes to a select Latin letter shapes, and 16-bit for the rest. It appears as if they decided that a certain set of letters are common to some preferred markets, and that it is efficient to reassign the established Unicode characters to this newly selected letter shapes. Had they simply used the 8-bit ISO-8859-1 set, the number of characters per SMS would limit to 140 instead of 160. (Is that why Twitter limits the # of chars to 140?). Of course, that would not have included some users whose letters are 16-bit characters under Unicode. I made a comprehensive transliteration for the Singhala script (Singhala+Sanskrit+Pali). It shows perfectly when 'dressed' with a smartfont. The following are two web sites that illustrate this solution (every character is ISO-8859-1, except for the occasional ZWNJ, which actually should be 8-bit NBH that somebody decided to leave undefined. Use any browser except IE. IE does not understand Open Type) http://www.lovatasinhala.com (hand coded) http://www.ahangama.com/ (WordPress blog) All Indic languages could be transliterated this way. it makes Indic similar to Latin based European languages with intuitive typing and orthographic results, which Unicode Sinhala can't do. It takes about half the bandwidth to transmit that the double-byte set. I just noticed that transliterated Singhala would not be fully covered with SMS 7-bit because some Unicode 8-bit characters are not in this set. Looking at my iPhone, I see that the International icon brings up key-layout plus font pairs. I think what they should do is to separate fonts and key-layouts.This way, the user could select the key layout for input and whatever font they want to use to show it. The next thing I am going to say made many readers here very angry, but may I say it again? The idea of Last Resort Font that makes basic editors Plain Text is a ploy to brag that the computer can show all the world's languages that most you cannot read anyway. The text runs of foreign languages should show as series of Glyph Not Found character or the specific hint glyph of a language. The user of a foreign language would know where to download fonts of their native language. In the small market of Singhala, no font is present that goes typographically well with Arial Unicode. There is no incentive or money to make beautiful fonts for a minority language like Singhala. The plain text result for Singhala is ugly. The OS makers unnecessarily made hodge-podge Last Resort Fonts I hope both the mobile device industry and the PC side separate fonts and characters and allow the users to decide the default font sets in their devices. This is eminently rational because the rendering of the font happens locally, whereas the characters travel across the network. This will also help those who like me who understand that their language is better served by a transliteration solution than a convoluted double-byte solution that discourages the natives to use their script. Actually, this is causing bilingual Singhalese to abandon their native language. The government is making special emphasis on English, as Singhala is terribly difficult to use in the modern setting. This is a grave problem for a society of near 100% literacy rate, and just a few million. On Fri, Apr 27, 2012 at 3:06 AM, Cristian Secară or...@secarica.ro wrote: Few years ago there was a discussion here about Unicode and SMS (Subject: Unicode, SMS, PDA/cellphones). Then and now the situation is the same, i.e. a SMS text message that uses characters from the GSM character set can include 160 characters per message (stream of 7 bit × 160), whereas a message that uses everything else can include only 70 characters per message (stream of UCS2 16 bit × 70). Although my language (Romanian) was and is affected by this discrepancy, then I was skeptical about the possibility to improve something in the area, mostly because at that time both the PC and mobile market suffered about other critical language problems for me (like missing gliphs in fonts, or improper keyboard implementation). Things evolved and now the perspectives are much better. Regarding the SMS, at that time Richard Wordingham pointed that the SCSU might be a proper solution for the SMS encoding [when it comes to non-GSM characters]. Recently I studied as much aspects as I could about the SMS standardization, in a step that I started approx a
Re: combining: half, double, triple et cetera ad infinitum
Unicode was created for a commercial reason, particularly for the benefit of its directors. The idea of Plain Text is not anything practical but was used as a means of attracting supporters, who for the most part hadn't had any experience with computers. The following line is Unicode text: මේ අකුරු ලියා ඇත්තේ යුනිකෝඩ් අකුරෙනි. I bet most of you see it as a row of Character-not-found glyph. Some would see it in the non-Latin script, but yet separated into meaningless components that go to make letters. So, the headache you are talking about within Latin becomes a tumor that makes you insane when outside the SBCS. On Mon, Nov 14, 2011 at 6:37 AM, QSJN 4 UKR qsjn4...@gmail.com wrote: Why did the Unicode Consortium think that combination of one base character and few combining is possible, and combination of few base characters with one combining character is not? E.g. U+0483 tilda has to cover a number. Whole number! Not one figure!! What for have we it, if we can't use it right way? Half combining marks are about decision... And what if we deal not with simple horizontal line? what if we need the same tilda above number or titlo above abbreviation? Text with numbers and abbreviations is not a plane text? Idea (bad as I think, but there is many other stupid things in Unicode): add some new combining characters of new ccc = word-above, word-below, word-over (ZWS works as word separator if needed).
Re: Subtitles in Indic/Arabic/other scripts requiring CTL
Shriramana, The question you raise relates to the problem of font rendering. According to the Open Type standard, each script, i.e. Latin, Tamil etc. have their own rules on how letters are constructed and displayed. For instance, when you write 'ke' in Tamil, the kavanna is preceded by the kombu. That is a 'Tamil specific' rule. The same rule is applied separately to Davanagari, Sinhala etc., because each language has its own code page. All this makes each script to have its own set of font rendering rules. If some application does not know Tamil font rendering rules, then you won't get the result you want. Getting back to the example I started, kombu is a sign not tied directly to a vowel and Kavanna is a base letter. Why on earth did Kombu get a codepoint equal in status to a letter? Initially, the Indic problem was interpreted taking Latin alphabetic writing system as the basis. We simply define the alphabet and string the letters (shapes) together to make the text. When they tried to encode Indic, they were completely thrown off the track by the way vyaakarana books showed the hodiya in the native script. They should have looked at transliteration schemes such as HK-Sanskrit. Hodiya is the phoneme chart built around Sanskrit. In it, the hal letters or consonants were displayed as shapes of ka, ga etc. Originally in Indic, both the pure consonant and the letter with 'a' implicit had the same form. The text in the vyaakarana book itself, if anyone cared to read, treated hal as consonants and not as some mysterious thing that carried an 'a' that had to be smoked out with a virama. Originally people did not use the virama or halant (or pulli in Tamil) inside words. Halant and virama means this mark indicates that the last letter of the word is a consonant. In Singhala, the Tamil word 'uppu' would be written u+p-pu, where p and pu *touched* indicating that the p is a pure consonant. When you write megam, the ending m would have a hal kiriimee lakuna (pulli, the virama). If it did not have the pulli, then it would be megama. I am not sure how Tamil dealt with this, but searching for 'Tamil palm leaf book', you can find many pictures of old manuscripts that you could inspect to see how Tamil had it. In planning the Unicode code page, we look at Tamil and decide, okay, 'ke' needs a kombu and kavanna. Let's then give a codepoint to kombu as well. Now write k followed by e and we get kavanna followed by kombu -- wrong order. Therefore, we write a special rule for Tamil saying that if you follow k with e, replace k already on screen with e and k, in that transposed order, and some more reasoning and complications get the result. Unicode is a disaster for our languages. You cannot sort, backspace delete, search and replace and so on as as we do with English -- no telephone books at least for Singhala, my native language. Every application knows or supposed to know Latin font rendering rules. You test your program first with Latin, right? The latest rules for font making are in the Open Type standard published in 2003 by Microsoft. Nearly all new versions of common programs understand Latin script rules in the Open Type standard except Internet Explorer. The solution is to first transliterate Tamil into the Latin script and to write a smartfont. Then nearly all recent version of programs will let you show Tamil correctly, except Internet Explorer. Internet Explorer does not understand the Open Type standard. (Microsoft wrote the standard). People in the West and those who came to West do not know finer points about our languages. It is our onus to get our solution, Sri Ramana, and it is not hard. I did this for Singhala. See the following web site. It is all romanized Sinhala in the background and shown in the complex native script. Do not use Internet Explorer. If you use Firefox, you may have to click on a second page to see the native Singhala smartfont dress the Latin script. This obviates the need for a separate code page for Singhala: http://www.lovatasinhala.com/liyanna.php On Fri, Nov 11, 2011 at 8:27 AM, Shriramana Sharma samj...@gmail.comwrote: My father (Dr T Vasudevan, cc-ed here) is working on a spoken (actually video) tutorial project using Indian languages as the instruction medium. He would like to add Indic language subtitles in (obviously) Indic scripts. For now, Tamil, both as the language he is working on (as it is our mother tongue) as well as the Indic script which is simplest in terms of CTL. However it seems current video players (at least two OSS ones -- MPlayer and VideoLAN that they are using in that project) do not support CTL. In speaking about this with another friend of mine he was wondering whether Arabic people have worked on getting Arabic subtitles, as it would also involve CTL. Can anyone shed light on CTL in subtitles? Perhaps Arabic or hopefully Indic? -- Shriramana Sharma
Re: Purpose of plain text (WAS: Re: combining: half, double, triple et cetera ad infinitum)
If you use the same underlying text code, you could still differentiate languages by fonts. It is a different way of looking at the problem. This is true in the case of transliterated Indic. I haven't given thought as to how that could be implemented with BIDI. When you compare benefits Indic would obtain through transliteration against having their own code sets, you see that living in the SBCS has clear advantages over DBCS. I have done, tested and proven it.In simple terms it reduces your dependency. As for making money, some people are better placed to make money and produce things for the main reason of making money. Others might spend effort and their own money knowing they have no good chance of making money, but their work benefits other people or helps them extricate them from a bad situation that they have fallen into. Making solutions should not solely for making solutions for getting rich. We only need to look around to see what such selfish action has done. According to marketing strategy, if you can create a need so you can craft the solution. Look at TV commercials. Thank you. On Mon, Nov 14, 2011 at 10:18 AM, CE Whitehead cewcat...@hotmail.comwrote: Hi. From: Naena Guru naenaguru_at_gmail.com Date: Mon, 14 Nov 2011 09:30:40 -0600 Unicode was created for a commercial reason, particularly for the benefit of its directors. I expect that it benefits more than just its directors (I don't expect anyone to act completely pro-bono; people have to get something out of what they do; sorry for commenting on this). The idea of Plain Text is not anything practical but was used as a means of attracting supporters, who for the most part hadn't had any experience with computers. I do believe that plain text is used in input forms, and the bidi algorithm at least becomes quite important here. (Someone correct me if this is in error.) . . . Best, --C. E. Whitehead cewcat...@hotmail.com
Re: Purpose of plain text (WAS: Re: combining: half, double, triple et cetera ad infinitum)
If it came out as Unicode has its only goal as money making, that is not what I meant to say. Nothing can be such. You sell something for the buyer's benefit, right? I apologize if you feel hurt over it. However, it is probably the main objective. Who works for nothing except odd crazies like me? Many voluntary contributions benefit the owners of Unicode. Unicode is open for people like me to speak at. In contrast, the agency in Sri Lanka (ICTA) is closed to the public. Their inspiration is the communist system. That forces me to work on my own. My situation is like helping someone who bought a lemon thinking it is a great Cadillac.I say that the bus others are traveling in is good for him too. When years back I asked why ligatures formed inside Notepad and not inside Word, I had the clear reply that it is owing to a business decision. Let me try to clearly say what I want to say: 1. Unicode came up with the idea of one codepoint for one letter of any language. 2. The justification was that on one text stream you could show all the different languages. At least that is what I understood. 3. The above 2 is not practical and does not work even now after so many years 4. Why Indic code pages do not work so well for text processing is not the fault of Unicode but that of the user groups 5. However, technology arrived at those countries too late to for actual users, not bureaucrats, to understand the mistakes 6. Therefore, I say that there was an undue push by Unicode to complete the standard, by issuing ultimatums for registering ligatures etc. Having said all that, all is not so bad. I say transliterate to Latin and make smartfonts. It is a proven success. You said, This thread presumes that display is, by orders of magnitude, the most important aspect of text processing. That is perfectly met by using smartfonts. As for, every other operation that could be performed on text is secondary is beautifully met with fonts too. I do not understand what you meant by jury-rigged to accommodate visual display order. Did you mean using unexpected shapes for Latin codes? If you meant that, how do you justify earlier versions of Unicode standard giving specific explanation about codepoints that they do not represent shapes and Fraktur and Gaelic could very well use Latin as their underlying codes? I think the ability to use text in the computer in the way you expect text to behave in it is very important. For instance, if you have shape representations mapped to code clusters, scanned text could be more accurately digitized. May peace be with you. On Mon, Nov 14, 2011 at 2:51 PM, Doug Ewell d...@ewellic.org wrote: This thread presumes that display is, by orders of magnitude, the most important aspect of text processing, and that every other operation that could be performed on text is secondary. Different writing systems all represented as font changes, using the same character codes. Backing text streams jury-rigged to accommodate visual display order. This is where the real insanity lies, for anyone who needs to do more with text than display it. The continued accusations that Unicode has been, and is only, a money-making endeavor for someone is too reprehensible and too far removed from reality to merit a response. -- Doug Ewell | Thornton, Colorado, USA | RFC 5645, 4645, UTN #14 www.ewellic.org | www.facebook.com/doug.ewell | @DougEwell
Re: Continue: Glaring mistake in the code list for South Asian Script / Naena Guru
On Thu, Nov 10, 2011 at 3:14 AM, delex r del...@indiatimes.com wrote: Naena Guru wrote: I have not read the entire thread of this conversation. It looks as if the debate has reached a level of acrimony. It is available since 1st week of September, 2011. Since 8th September,2011 to be exact. You may go through it and various responses counter responses .Thanks for your interest. “Acrimony” …. Probably not. I want to better call it a “sincere criticism” Good attitude was abandoned because Indians could not come to a consensus. As a result, the rest of Indian languages including Tamil and Singhala were not even taken up for consideration. I am an Indian. Are you too? Singhalese. Unicode had a different idea. The driving principle of Unicode was the Plain Text idea. “Every letter of every language on earth would have its own proud code point uniquely its own. [Imagine a time capsule when we go down in flames Yes. I accept it was a good and proud idea of the present generation of mankind. Unicode will be there to tell about the great human civilization]!” Ya , perhaps this is the reason I am so eagerly writing so that we (Unicode) record and tell the right and correct things to the future generations. If after an Armageddon , no computer is left in this world, there is a chance some written documents may survive so that they can start from scratch. “They should know that “Assamese” language is not written in “Bengali” script.” It is actually the other way round. That is just replace “Assamese” with “Bengali” and “Bengali” with “Assamese” in the above quotation. No idea what you are talking about Now, Unicode wanted the help of ISO to get to the authorities of those countries where government bureaucracies, not businesses mattered. Our authorities actually recognized the uniqueness of the “Assamese” script with unique and extra characters. Check ISCII (Indian standard codes for Information……) documentation Not my subject ..If you know how Third World governments work, you know what I mean. Lots of leg pulling among others Yes, yes. The problem is that the Unicode scheme has divided the world and categorized scripts as computer friendly and barbarian. Well I am trying to tell Unicode that “Assamese” is not a barbarian language without a script. Good luck, but see this: http://www.lovatasinhala.com/index.php That is complex and also simple Latin text. You do not need to struggle with Unicode if you simply transliteration and make a font. They, and officials at home all get indignant when you point out their errors. This method has least dependency, and stays with the most favored languages, though some might not like the lowly company.
Re: combining: half, double, triple et cetera ad infinitum
Thank you Asmus, I made many here very angry. They have put in a lot toward bringing up the standard, and therefore, it is understandable. Evidence in these things can never be proven, it only makes people madder. Besides, I worded wrongly to give the impression that it is the only motive. That is not so. Many volunteers sweated through. On the other hand, no company would send people to work at Unicode if they did not have an economic interest. I should not have even said anything here as I know that there is an alternative approach that does not hurt Unicode and hopefully its fans. It is the combination of the most viable and stable portion of Unicode, which is Latin-1 or SBCS and the Open Type standard. Standards compliant and elegant solution. See it here: www.lovatasinhala.com/ Do not use IE. If you use Firefox, sometimes you need to pick another page to see the complex script. On Mon, Nov 14, 2011 at 2:29 PM, Asmus Freytag asm...@ix.netcom.com wrote: On 11/14/2011 7:30 AM, Naena Guru wrote: Unicode was created for a commercial reason, particularly for the benefit of its directors. This statement, not backed up by evidence, indicates a rather rudimentary understanding of the forces that were behind the creation of the universal character set. Coming as it does without details, it doesn't add much to the discussion. I can understand that some people are frustrated, when, over two decades after the basic design was hashed out, the implementation is still not seamless. The reason for that has to be sought in the inherent complexity of writing systems. Even apparently very simple writing systems can have surprising complexity when you try to support all areas of use and in high quality typography. Unicode is not just a character set, it provides a common framework for organizing and formalizing much knowledge about writing systems. As a result, there is now far more information and more accessible information about writing systems than when Unicode was started. Had all this information been available 20 to 25 years ago, there's a fair chance that the design of a universal character set would have differed in some aspects from what we now know as Unicode. But even with the benefit of such hindsight, the sheer complexity of writing systems remains. This complexity means that there will most likely never be any implementation that supports all writing systems to their fullest (highest quality). Every practical implementation will subset somewhere, and that means there's no guarantee that any two implementations will faithfully interoperate. It might be that a different design would have made some implementations easier, but I strongly suspect that the limitations I described here are fundamental, so that the expectation would be that a different design would have merely made other tasks more difficult while making certain ones easier. A./
Re: combining: half, double, triple et cetera ad infinitum
For you to be able to read Sinhala Unicode two things must happen. First, like for all text, the font has to be present. The systems sold outside Sri Lanka do not have Sinhala Unicode, which is understandable because the community is very small. Then there are cases where even if you have the font, the characters do not combine. You all may have seen the Sinhala font, but how do I know you if you could actually read it? Here is a screen shot someone sent me two days back asking me to fix it. I could not help the person, and that is tragic. https://docs.google.com/leaf?id=0B0OoSILYvguZZDMzYzE1MDUtMjk2OS00YWI3LTk3NGItMDAyYzY2MmYwNWEyhl=en_US The first characters you see in this screen shot is ශ්රී. These are two letters that should combine to form this letter: ශ්රී . Combinations do not happen anywhere and it is gibberish for the reader. As far as I can understand, he has Uniscribe installed through 'files for complex scripts'. On Mon, Nov 14, 2011 at 6:00 PM, Tom Gewecke t...@bluesky.org wrote: On Nov 14, 2011, at 3:39 PM, Naena Guru wrote: I made many here very angry. Not into anger myself, but you certainly lost my respect when you insisted we could not read Unicode Sinhala text on our machines Question remains.
Re: Re: Continue: Glaring mistake in the code list for South Asian Script//Reply to Kent Karlsson
I have not read the entire thread of this conversation. It looks as if the debate has reached a level of acrimony. We need to inspect the background to this entire controversy to come to a rational understanding. In the nineties, there was a tug-o-war between ISO and Unicode on how to digitize languages like Indic. The ISO-8859 committee wanted to share the code points from u0080 to u00FF or the Lain-1 Supplement among the Western Europeans and anyone else that wanted to define their codes. ISO-8859-12 (Devanagari) was abandoned because Indians could not come to a consensus. As a result, the rest of Indian languages including Tamil and Singhala were not even taken up for consideration. Unicode had a different idea. The driving principle of Unicode was the Plain Text idea. “Every letter of every language on earth would have its own proud codepoint uniquely its own. [Imagine a time capsule when we go down in flames. Unicode will be there to tell about the great human civilization]!” During a meeting with Europeans held in America, Unicode’s idea won. (Juicier version of this story was on Unicode web site that is since removed). Now, Unicode wanted the help of ISO to get to the authorities of those countries where government bureaucracies, not businesses mattered. What motivations did Americans, Europeans and South Asians have about this entire endeavor? Americans wanted to expand business to the outside world (the good word is Globalizing). The Europeans had their own domineering commercial control over the former colonies, yet with a sentimental affection. In South Asia, where literacy rate was very low (except Sri Lanka), only a minority of English educated elite had any exposure to computers. The Indic region was still struggling working with the typewriter. There was no public awareness in what was happening in America called Unicode. If at all, it was some piece of news that some esoteric thing was going on where Americans with their great scientific prowess were going to give them something grand, a white elephant, perhaps. At least in Sri Lanka, the motivation was the World Bank’s offer of loans. They got a European who cannot read Sinhala to sign for the standard. Other than the promise of money and jobs, they had no idea where they were going. If you know how Third World governments work, you know what I mean. With twelve years of making the Arial font etc. we arrived at the conclusion that one default font with all letters of the world is not practical, and kind of ridiculous. We now have SBCS, DBCS etc. , to spread several scripts, and at least in Sri Lanka, bureaucrats as surrogates to give excuses and reassurances to the public. (Computers have gone 16-bit, Only standard is Unicode, others are hack jobs). They are contemplating to make laws to force Unicode Sinhala on people! All mature software are written for SBCS. There is no incentive for those whose programs work so well to hazard re-writing their programs to accommodate DBCS. Compiler makers are encouraging programmers to globalize and write their programs for Unicode, but who is going to pay to overhaul 30 to 40 year programs that have stabilized and reached near perfection? Besides, software piracy was the order of the day outside America. Presently, Indian and Lankan general public has arrived at a point when they are able to use their languages on the computer. People want to communicate using their computers in their native languages. Unicode Indic is very hard to use. It is nothing like English. It requires so many new things like word processing programs, physical keyboards and such. Unicode came way before they were ready, and they have now become victims. It is long past the Korean debacle. Now Unicode is set in concrete. The problem is that the Unicode scheme has divided the world and categorized scripts as computer friendly and barbarian. I do not know about CJKV, but Indic would have been much better off had they made their standards within SBCS. I tested this for Sinhala and it is a great success. See it at the following link (Please do not use Internet Explorer because it does not support full-font downloading or Open Type font rendering): http://www.lovatasinhala.com/ Sinhala is one of, if not *the* most complex of Indic and has two major orthographies, Mixed Sinhala and Pali. I studied the division of vyaakarana (grammar) of Sinhala / Sanskrit writing and made a comprehensive transliteration on to ISO-8859-1 with no loss. And then I made a smartfont to dress the Latin encodings in the native script. People use this system unknowing that the underlying code is ‘English’. It is a refinement of Anglicizing. You type the way you speak and the orthographic font shows it magically in its full complexity. It has been in existence since 2005 despite the government bureaucrats disparaging it. Sinhala includes Sanskrit at the core of its phoneme chart or Hodiya. It also covers Pali. You use this system just like
How is NBH (U0083) Implemented?
The Unicode character NBH (No Break Here: U0083) is understood as a hidden character that is used to keep two adjoining visual characters from being separated in operations such as word wrapping. It seems to be similar to ZWJ (Zero Width nonJoiner: U200C) in that it can prevent automatic formation of a ligature as programmed in a font. However, it seems to me that an NBH evokes a question mark (?) Is this an oversight by implementers or am I making wrong assumptions? There is also the NBSP (No-break Space: U00A0), which I think has to be mapped to the space character in fonts, that glues two letters together by a space. If you do not want a space between two letters and also want to prevent glyph substitutions to happen, then NBH seems to be the correct character to use. NBH is more appropriate for use within ISO-8859-1 characters than ZWNJ, because the latter is double-byte. Programs that handle SBCS well ought to be afforded the use of NBH as it is a SBCS character. Or, am I completely mistaken here? Thank you.