Re: Romanized Singhala - Think about it again
Le 17/07/12 02:43, Naena Guru a écrit : Jean, sorry I am late. I used spare time as and when I got it. On Sun, Jul 8, 2012 at 10:20 PM, Jean-François Colson j...@colson.eu wrote: Le 09/07/12 01:29, Naena Guru a écrit : Jean-François, Let me approximate it in Romanized Singhala: 'jyó-frósvaa'. Just trying... I don’t know how that transcription should be pronounced but in IPA, Jean-François is /ʒɑ̃.fʁɑ̃.swa/. They came as rectangles (in XP). That’s not surprising. Windows XP is an old out-of-date system with, by default, a very limited set of fonts. But nobody prevents you to download some additional free fonts. They showed correctly in your message inside Firefox running in Puppy Linux, but where I an replying, it shows a reversed Euro like character That is surprising. Which font does include a reversed euro sign ? in place of the a-umlaut. I didn’t use any umlauts but two tildes. This again illustrates how hazardous it is for characters outside Latin-1. It illustrates how hazardous it is to use such an old OS as Windows XP. I can only approximate the first letter as English j+y, English j + y? I don’t know that, neither in English nor in French. /ʒ/ is the French j. It is not the English j plus something but rather the English j minus something. The English j is /dʒ/. That’s an affricate, i.e. roughly a sound which begins as a plosive and evolves to end as a fricative. The French j is a fricative. its nearest approximation in English is the z in azure or the s in leisure. /ɑ̃/ is not an /ɑ/ with umlaut but an /ɑ/ with tilde. It is pronounced as the a in the English word “car” but with a nasal quality, i.e. some air passes through the nose. Jean is an homophone of gens which you can hear here: http://fr.wiktionary.org/wiki/gens#Prononciation The speaker has recorded “des gens”, so focus your attention on the second syllable. /f/ is pronounced like in English. /ʁ/ is the French r, but there are several varieties of r among the French dialects, so using the English r instead is not a big problem. /s/ and /w/ are pronounced as in English. /a/ is very similar to /ɑ/. It is the begining of the diphtong in the English word “sky”. which is same on Singhala. The rest is pretty close, I think. Thank you for your interest. See inline responses. On Thu, Jul 5, 2012 at 7:35 AM, Jean-François Colson j...@colson.eu wrote: Le 05/07/12 10:02, Naena Guru a écrit : On Wed, Jul 4, 2012 at 11:33 PM, Philippe Verdy verd...@wanadoo.fr wrote: Anyway, consider the solutions already proposed in Sinhalese Wikipedia. There are verious solutions proposed, including several input methods supported there. But the purpose of these solutions is always to generate Sinhalese texts perfectly encoded with Unicode and nothing else. Thank you for the kind suggestion. The problem is Unicode Sinhala does not perfectly support Singhala! What’s wrong? Are there missing letters? Many, many. The solution is for Sinhala not for Unicode! Or rather for Sinhala by Unicode. Sure, if you want to do it with proper deliberation. I am not saying Unicode has a bad intention but an ill-conceived product. What precisely is ill-conceived? Anglo-centric thinking is what is wrong. ? Letters have no direct relation to speech -- very few. In Singhala (perhaps as in French too as someone said?) you write what you say. In Singhala, the exception is clearly understood rule set about how to pronounce short 'a' -- whether muted or not. Therefore, the approach should have been to encode the vowels, diphthongs and consonants as base letters. I assigned the acute accent to the 'ng' sound and the umlaut to the guttyral H in Sanskrit, but they could be assigned independent codepoints. Let me take you on the scenic route: Number of letters in Singhala is only theoretical. In the case of Singhala orthography, the actually used number depends on the Sanskrit vocabulary. Do you mean there are many conjunct consonants, sometimes with a separate glyph? Yes, many. There are three orthographies. Singhala does not have CCs at all. Sanskrit has a lot. Pali has touch letters in addition to what Sanskrit has. Modern Singhala is mixed Singhala and Sanskrit. With Unicode Sinhala, you need to know which ones join and provide the ZWJ and hope and pray that the font has the CC. Often they are absent. So I guess your problem could be solved by providing new fonts with a better support of conjunct consonants. What you did for 8-bit Sinhala, you could do it for Unicode Sinhala too. Then SLS1134 gives wrong advice too. Could you explain in details what is wrong in their advice? In Devanagari, they’re made by typing two or more consonants separated by halants. Isn’t
Re: Romanized Singhala - Think about it again
Le 17/07/12 02:43, Naena Guru a écrit : Just see the daily questions and dedicated section for Indic at Unicode.org, and think why ordinary people Anglicize instead of using Unicode Sinhala. (e.g. elakiri.com). Some also use the Sinhalese script. I’ve sometimes seen people type in Arabic with Latin letters in a French Library, because the computers they used only had French keyboards and they didn’t know an Arabic keyboard enough to touch type in Arabic with Arabic letters. That's right. Everyone is familiar with the good old QWERTY keyboard. The Singhalese have developed their own Anglicizing convention. The Tamils do it too, but their Anglicizing is different from the one the Singhalese use. They are little more respectful of their language and try to Anglicize more precisely. I used the Singhala typewriter in late 60s. The gayanna was where you get period on QWERTY. It is entirely different from the layout of the English one with dead keys for parts of letters. This is what Unicode Sinhala inherited. It is many fold easier if Singhala follows closely with English layout. I made one for Unicode. The best I could get still needed three-finger keys. Besides, even after you enter ZWJ and do not get the desired conjoints because the font does not have them. Typing and encoding are two different matters. If present Sinhalese fonts don’t do the job, you can improve them. You can develop a hundred keyboard layouts and input methods to type the same text in a hundred different ways. Aren’t there any keyboards with the Sinhalese letters drown on the keytops? If there aren’t and you think the present Sinhalese keyboard layout doesn’t fit the QWERTY layout well enough, feel free to design a new layout and distribute drivers for the main operating systems. It's a colossal failure! Really? Of course, I don't have to repeat. You have read what I said. I have. The people Anglicize than using Unicode Sinhala. What do you mean? If they transliterate, that’s not really anglicization. You get a glimpse of the light. Anglicizing is trying to use English writing conventions to write Singhala. Anglicizing is not a complete mapping, transliteration is. Singhala has 58 phonemes including 10 digraphs used for Sanskrit and Pali (aspirates). The English alphabet is not enough even for English. It has discarded þorn, eð, æsc etc.. So, it has digraphs. Then because of the capitalizing convention it makes its set of letters even fewer. To be fair, the Lankan technocrats did not have a clue when they were asked to approve the standard. I know that problem. The same occured for French with Latin-1. That’s why some French letters are missing in Latin-1. Tell me about it. Latin-1 (ISO-8859-1) lacks the French letter Œ/œ and the capital Ÿ. Œ is used in a number of common words such as cœur (heart), œil (eye), œsophage (oesophagus), Œdipe (Oedipus), œuf (egg), etc. Ÿ is used in a few toponyms such as L’Haÿ-les-Roses, a commune near Paris, which can be capitalized as L’HAŸ-LES-ROSES. It also lacks the apostrophe ’. Those characters were added in Latin-9 (ISO-8859-15) Œ = 0xBC, œ = 0xBD, Ÿ = 0xBE and in CP1252 Œ = 0x8C, œ = 0x9C, Ÿ = 0x9F, ’ = 0x92 Of course, AFAICT, they were part of the first release of Unicode. It is first come, first serve. It is. Isn't language, and therefore, the writing a (if not the) major part of a culture? You’re right. It was a time when there was (perhaps even now) a typist in the corner of the office of the bureaucrat. The big guys do not know touch-typing even now. Proof: A university professor wrote me a harangue using cyber-sex orthography (no capitals) accusing me for working for Americans. I had suggested that Unicode is a conspiracy to confuse us. (That is a bit way over, no such motive, nevertheless the effect is the same) Romanized Singhala uses the same. So, what's the fuss about? The font? The fact that your encoding won’t be supported on many computers worldwide. Jean, for the umpteenth time, I am not encoding anything. It is a transliteration. It is using a different script (Latin) than what you use traditionally (Singhala): Not සිංහල අකුරු, but 'síhala akuru'. OK. Do you display Sinhala with Latin letters? If you do, that’s not a problem. If you display it with Sinhalese letters, you’ll need to change the font whenever you want to write in another script. Just imagine a Sinhala/English dictionary. How many font changes would you make for such a book? That’s a big step backwards. අ - a එ - e ක් - k අං - á ඤ් - ç ශ් - z etc. . . About the font that unnerves you: Think of 3D cinema. If you wear the 3D glasses, you see clearly. The font is for the user's benefit. The web masters can give the option I gave on my site to keep happy those who dislike (warning: I must select mild adjectives to honor sensitivities of some)
Re: Romanized Singhala - Think about it again
Let's stop this nightmare. The solution that uses a font hack that overrides the sematnics of Latin letters will never work as it should. Th eseparation of code points is necessary, even if this is just to show an URL containing Sinhalese letters in the domain name part (and without alternig the semantic of the dot, slash and colon separators). It will be inacceptable to have the http://; prefix isolated with a separate font just to be read correctly. from the rest of the URL. Inacceptable also beauecause it will alter the internals of international stadnards that are widely used. Inacceptable because Sinhalese domain names will remain separated from the proposed romanizations. That user really has a complete misunderstanding of the standard and severly lacks basic knowledge of the concepts. He shuld first read the definitions to see that what is in the standard is definitely not what he suposes by just looking at a simple basic chart (which is mostly informative and has very littel use for technical implementations). Reading the standard up to Chapter 3 (crequirements and convrormance) is absolutely necesarry for him. He won't make any progress to understand his own problems before reading it and criticizing constantly what he has never read ffor not understanding it... He should also read the introduction of the OpenType specifications which also use their own definitions (wsomething he is mixing as well). He must absolutely first understnad the character model and the separation between what is Unicode, what is a abstract character, a glyph, an encoding form, and the binary serialisation of an encoding into a stream of bytes, plus other concepts used by common protocols and languages such as transport syntaxes and alternate representations using things like character entities (in SGML, XML, HTML), or numerical escapes (e.g. in C/C++, PHP, Java, Ruby...) or string expressions using builtin/Standard functions in Basic...)
Re: Charset declaration in HTML (was: Romanized Singhala - Think about it again)
On Tue, Jul 10, 2012 at 11:58 PM, Leif Halvard Silli xn--mlform-...@xn--mlform-iua.no wrote: Naena Guru, Tue, 10 Jul 2012 01:40:19 -0500: HTML5 assumes UTF-8 as the character set if you do not declare one explicitly. My current pages are in HTML 4. There is in principle no difference between what HTML5-parsers assume and what HTML4-parsers assume: All of them default to the default encoding for the locale. I see. That is, for the transliteration, the locale should be Sinhala (Latin). Yes. I know that it is not official. I loathe the spelling Sinhala. Oh, well, you cannot have it all. Notepad forced me to save the file in UTF-8 format. I ran it through W3C Validator. It passed HTML5 test with the following warning: [image: Warning] Byte-Order Mark found in UTF-8 File. I assume that you used the validator at http://validator.w3.org. Yes, and it validated it. I was talking about BOM in a different context. It showed up when I opened the file in HTML-Kit that was first created in Notepad and saved under UTF-8. HTML-Kit Tools asked me to specify the character set. It took it. but messed up the macron and dot letters anyway. What I was trying to emphasize was the fact that it is hard for those people that try to make web pages in those 'character sets'. I have been making web pages since 1990s and never had these problems because they were written by hand in English. But if you instead use the most updated HTML5-compatible validators at http://www.validator.nu or http://validator.w3.org/nu/ then will not get any warning just because your file uses the Byte-Order Mark. HTML5 explicitly allows you to use the BOM. Thanks. This too validated all seven pages as HTML5 (I upgrated from HTML 4) The Unicode Byte-Order Mark (BOM) in UTF-8 encoded files is known to cause problems for some text editors and older browsers. You may want to consider avoiding its use until it is better supported. Weasel words from the validator. The notion about older browsers is not very relevant. How old are they? IE6 have no problems with the BOM, for instance. And that is probably one of the few, somewhat relevant, old browsers. As I said before BOM was no problem for me. As for editors: If your own editor have no problems with the BOM, then what? But I think Notepad can also save as UTF-8 but without the BOM - there should be possible to get an option for choosing when you save it. Else you can use the free Notepad++. And many others. In VIM, you set or unset the BOM via the commands set bomb set nobomb Yes, yes. I've seen it before. I have Notepad++. It intimidated me the first time and never used it, haha! -- Leif H Silli
Re: Ewellic again (was: Re: Romanized Singhala - Think about it again)
My error. Sorry, Doug. On Sun, Jul 8, 2012 at 8:00 PM, Doug Ewell d...@ewellic.org wrote: Unicode character database goes from zero to some very big number. There are no holes in it to define character sets for somebody's fancy. Well, Doug Ewell did one for Esparanto expanding fuþorc. Ewellic is not futhorc. They are different scripts. From the Omniglot page on Ewellic (with *emphasis* added): The shape of Ewellic letters was *inspired by* the Runic and Cirth scripts, but shows greater (though still imperfect) regularity of form. -- Doug Ewell | Thornton, Colorado, USA http://www.ewellic.org | @DougEwell
Re: Charset declaration in HTML (was: Romanized Singhala - Think about it again)
Hey, Philippe, Your input is much appreciated. So, in a nutshell, I don't have to worry. One of these days I need to crunch down (minify) the CSS and JavaScript pages. I left them readily readable so that techs like you could easily read them in place in any browser without having to pretty print. The pages are not big by any standard and they download pretty fast. Your earlier point about WOFF is what I am going to try and tackle today (Sunday). In the meanwhile, thanks again. On Tue, Jul 10, 2012 at 11:32 PM, Philippe Verdy verd...@wanadoo.fr wrote: 2012/7/10 Naena Guru naenag...@gmail.com I wanted to see how hard it is to edit a page in Notepad. So I made a copy of my LIYANNA page and replaced the character entities I used for Unicode Sinhala, accented Pali and Sanskrit with their raw letters. Notepad forced me to save the file in UTF-8 format. I ran it through W3C Validator. It passed HTML5 test with the following warning: [image: Warning] Byte-Order Mark found in UTF-8 File. The Unicode Byte-Order Mark (BOM) in UTF-8 encoded files is known to cause problems for some text editors and older browsers. You may want to consider avoiding its use until it is better supported. The BOM is the first character of the file. There are myriad hoops that non-Latin users go through to do things that we routinely do. This problem I saw right at the inception. I already know why romanizing is so good. Don't you? You should probably ignore this non-critical warning now ; it is only for extremely strict compatibility with deprecated softwares that should have been updated since long for obvious security and performance reasons. Those old browsers are deprecating fast (due to the massive and fast spread of security attacks, automatic security updates to close issues competely (instead of just by preventive virus detection based on code bahavior or code patterns which will never be complete and fast enough to react to these extremely frequent attacks). Older editors do not have the cumfort that newer editors have. The memory usage of these newer editors are no longer a problem (notably for web developers that have systems largely above what theiur average users have), and systems capable of running them have never been so cheap. In addition, memory and storage costs have dramatically decreased. We are more concerned about the bandwidth usage, so your web editing platform should include an optimisation process and converters that will automatically use a compact representation (numeric character references for example can be sent by your server as raw UTF-8, in addition the server can now support on-the-fly data compression over the HTTP sessions ; there also exists frontend proxies that will do that for you without requiring you to change the development/editing methods you use. Most text editors even in Linux can now open sucessfully UTF-8 files starting by a BOM without complaining. Just like Notepad does since long. And they allow you to change this edit mode before saving. Most text processors will silently discard the U+FEFF character (it should be safe to do that everywhere, given that U+FEFF should no longer be used for anything else than BOM's) [side node] But Notepad has another problem since long : it cannot sucessfully open a text file whose lines are terminated by LF only, it absolutely wants them to be converted using CR+LF sequences ; this problem is much more severe than the use of a leading BOM. As well, Excell cannot successfully decode an UTF-8 encoded CSV file. But it can autoamtically recognize it if you used instead the import data function. This is inconsistant (also it still does not allow specifying how to convert numbers using dots instead of commas, when running it on a non-English user locale, you need to manually use a search/replace function; it does not allow selecting the date format for CSV file imports, making searhd/replacements operations is not trivial on date fields ; no question is asked to the user, it only uses implicits defaults even when they are wrong, most of the time for actual cases of CSV files). [/side node] But It has nothing to do with your problem of romanization or behavior with Latin. BOMs are only absent from old 8-bit character sets that are no longer recommanded in any modern Internet protocols ; and from 7-bit ASCII used only for internal technical data but not for any text intended to be read and translated. Only UTF-8 support is mandatory now. And that's fine. HTTP headers or URLs require a specufinc encoding but webservers and designing tools can ta ke care of that Everythng else is optional and will require an explicit metadata (the exceptions being UTF-16 and UTF-32 which are not well suited for interchanges across heterogeneous networks and independant realms, but used mostly for internal processes, for which you absolutely don't need any byte order change, so for which
Re: Charset declaration in HTML (was: Romanized Singhala - Think about it again)
Philippe Verdy, Wed, 11 Jul 2012 07:36:56 +0200: 2012/7/11 Leif Halvard Silli: In VIM, you set or unset the BOM via the commands set bomb set nobomb Should these command specify if your computer will explode when saving the file ? :'o Probably signals the weird fear that some have for 'da BOM'. set bom set nobom Sorry, could not resist. Those commands, without the -d, are unknown in VIM. It would have been too simple without the -d. ;-) -- leif h silli
Re: Charset declaration in HTML (was: Romanized Singhala - Think about it again)
Leif Halvard Silli wrote: As for editors: If your own editor have no problems with the BOM, then what? But I think Notepad can also save as UTF-8 but without the BOM - there should be possible to get an option for choosing when you save it. Perhaps there should be such an option in Notepad, but there isn't. The decision to have Notepad always write the signature to UTF-8 files, and always rely on it to read them, has been documented to death. The bottom line is, there are zillions of editors available for Windows, many of them free, and people who want to create or modify UTF-8 files which will be consumed by a process that is intolerant of the signature should not use Notepad. That goes for HTML (pre-5) pages, Unix shell scripts, and others. -- Doug Ewell | Thornton, Colorado, USA http://www.ewellic.org | @DougEwell
Re: Charset declaration in HTML (was: Romanized Singhala - Think about it again)
Thank you Otto. Sorry for delay in replying. I spent the entire Sunday replying Jaques twins. You are absolutely right about choice between ISO-8859-1 and UTF-8. I shouldn't have said 'using ISO-8859-1 is advantageous over UTF-8' It is efficient if your pages are written in a language that uses single byte codepoints. When you mix multi-byte based codepoints, like you said, the ideal is to have them in their raw form. But in practice, this is not as easy as we think. Actually, the trade-off is not great for me because I use only little non-SBCS characters. Each 2-byte character would end up as six bytes in a Hex char entity. If you want to control the look of your web site, then you probably have to have expensive software to do it. As for poor me, I use CSS, JavaScript and HTML inside HTML-Kit. HTML5 assumes UTF-8 as the character set if you do not declare one explicitly. My current pages are in HTML 4. As I said, I use HTML-Kit (and Tools). If I have raw Unicode Sinhala in the HTML or Javascript, it messes them and gives you character-not-found for them on the web page. I must have character entities if I need the comfort of HTML-Kit. There are web sites that help you process your SBCS and multi-byte mixed text to make character entities for non Latin-1 characters. I used them when making my only page that has them (Liyanna). Stop and think why there are such websites. (Search text to unicode). The world outside Latin-1 is a harsh one. If I want to have raw Unicode Sinhala, PTS Pali or IAST Sanskrit, I have to use Notepad instead of HTML-Kit. It is hard to code without color-coded text. I wanted to see how hard it is to edit a page in Notepad. So I made a copy of my LIYANNA page and replaced the character entities I used for Unicode Sinhala, accented Pali and Sanskrit with their raw letters. Notepad forced me to save the file in UTF-8 format. I ran it through W3C Validator. It passed HTML5 test with the following warning: [image: Warning] Byte-Order Mark found in UTF-8 File. The Unicode Byte-Order Mark (BOM) in UTF-8 encoded files is known to cause problems for some text editors and older browsers. You may want to consider avoiding its use until it is better supported. The BOM is the first character of the file. There are myriad hoops that non-Latin users go through to do things that we routinely do. This problem I saw right at the inception. I already know why romanizing is so good. Don't you? UTF-8 encoding is this RFC: http://www.ietf.org/rfc/rfc2279.txt This is the table it gives on the way UTF-8 encoding works: - 007F 0xxx ASCII 0080- 07FF 110x 10xx === Latin -1 plus higher 0800- 1110 10xx 10xx == Unicode Sinhala 0001 -001F 0xxx 10xx 10xx 10xx 0020 -03FF 10xx 10xx 10xx 10xx 10xx 0400 -7FFF 110x 10xx ... 10xx Observe that Latin 'a' transforms from UCS-2 to two coded bytes with UTF-8 and Unicode Sinhala Ayanna goes from two to three. Unicode Sinhala: 0D80 - 0DFF a = Hex 61 = Bin 0110 0001 - UTF-8 Template: 110x 10xx UTF-8 Encoding: 1101 1011 = Hex C1 A1 ayanna = Hex 0D85 = Bin 11011000 0101 - UTF-8 Template: 1110 10xx 10xx UTF-8 encoding: 1110 10110110 1101 = Hex E0 B6 85 Thanks for your input. It is appreciated. On Wed, Jul 4, 2012 at 2:25 PM, Otto Stolz otto.st...@uni-konstanz.dewrote: Hello Naena Guru, on 2012-07-04, you wrote: The purpose of declaring the character set as iso-8859-1 than utf-8 is to avoid doubling and trebling the size of the page by utf-8. I think, if you have characters outside iso-8859-1 and declare the page as such, you get Character-not-found for those locations. (I may be wrong). You are wrong, indeed. If you declare your page as ISO-8859-1, every octet (aka byte) in your page will be understood as a Latin-1 character; hence you cannot have any other character in your page. So, your notion of “characters outside iso-8859-1” is completely meaningless. If you declare your page as UTF-8, you can have any Unicode character (even PUA characters) in your page. Regardless of the charset declaration of your page, you can include both Numeric Character References and Character Entity References in your HTML source, cf., e.g., http://www.w3.org/TR/html401/**charset.html#h-5.3http://www.w3.org/TR/html401/charset.html#h-5.3 . These may refer to any Unicode character, whatsoever. However, they will take considerably more storage space (and transmission bandwidth) than the UTF-8 encoded characters would take. Good luck, Otto Stolz
Re: Romanized Singhala - Think about it again
On Mon, 09 Jul 2012 05:20:45 +0200 Jean-François Colson j...@colson.eu wrote: Le 09/07/12 01:29, Naena Guru a écrit : Number of letters in Singhala is only theoretical. In the case of Singhala orthography, the actually used number depends on the Sanskrit vocabulary. Do you mean there are many conjunct consonants, sometimes with a separate glyph? In Devanagari, they’re made by typing two or more consonants separated by halants. Isn’t that possible with Sinhala? No, SLS 1134 (2004) keeps it simple by making these viramas visible, i.e. real halants, making the associated consonants the last in the akshara. For the ordinsry conjuncts, including raphe, it prescribes VIRAMA, ZWJ. ZWJ, VIRAMA is used to make consonants touch. SLS 1134 spares users some of the complexity by requiring the commonest subscript and superscript consonants to be on the keyboard. (This may well be useless for X, unless X has had its keyboard mapping extended to allow the combinations as single keystrokes.) Richard.
Re: Charset declaration in HTML (was: Romanized Singhala - Think about it again)
2012/7/10 Naena Guru naenag...@gmail.com I wanted to see how hard it is to edit a page in Notepad. So I made a copy of my LIYANNA page and replaced the character entities I used for Unicode Sinhala, accented Pali and Sanskrit with their raw letters. Notepad forced me to save the file in UTF-8 format. I ran it through W3C Validator. It passed HTML5 test with the following warning: [image: Warning] Byte-Order Mark found in UTF-8 File. The Unicode Byte-Order Mark (BOM) in UTF-8 encoded files is known to cause problems for some text editors and older browsers. You may want to consider avoiding its use until it is better supported. The BOM is the first character of the file. There are myriad hoops that non-Latin users go through to do things that we routinely do. This problem I saw right at the inception. I already know why romanizing is so good. Don't you? You should probably ignore this non-critical warning now ; it is only for extremely strict compatibility with deprecated softwares that should have been updated since long for obvious security and performance reasons. Those old browsers are deprecating fast (due to the massive and fast spread of security attacks, automatic security updates to close issues competely (instead of just by preventive virus detection based on code bahavior or code patterns which will never be complete and fast enough to react to these extremely frequent attacks). Older editors do not have the cumfort that newer editors have. The memory usage of these newer editors are no longer a problem (notably for web developers that have systems largely above what theiur average users have), and systems capable of running them have never been so cheap. In addition, memory and storage costs have dramatically decreased. We are more concerned about the bandwidth usage, so your web editing platform should include an optimisation process and converters that will automatically use a compact representation (numeric character references for example can be sent by your server as raw UTF-8, in addition the server can now support on-the-fly data compression over the HTTP sessions ; there also exists frontend proxies that will do that for you without requiring you to change the development/editing methods you use. Most text editors even in Linux can now open sucessfully UTF-8 files starting by a BOM without complaining. Just like Notepad does since long. And they allow you to change this edit mode before saving. Most text processors will silently discard the U+FEFF character (it should be safe to do that everywhere, given that U+FEFF should no longer be used for anything else than BOM's) [side node] But Notepad has another problem since long : it cannot sucessfully open a text file whose lines are terminated by LF only, it absolutely wants them to be converted using CR+LF sequences ; this problem is much more severe than the use of a leading BOM. As well, Excell cannot successfully decode an UTF-8 encoded CSV file. But it can autoamtically recognize it if you used instead the import data function. This is inconsistant (also it still does not allow specifying how to convert numbers using dots instead of commas, when running it on a non-English user locale, you need to manually use a search/replace function; it does not allow selecting the date format for CSV file imports, making searhd/replacements operations is not trivial on date fields ; no question is asked to the user, it only uses implicits defaults even when they are wrong, most of the time for actual cases of CSV files). [/side node] But It has nothing to do with your problem of romanization or behavior with Latin. BOMs are only absent from old 8-bit character sets that are no longer recommanded in any modern Internet protocols ; and from 7-bit ASCII used only for internal technical data but not for any text intended to be read and translated. Only UTF-8 support is mandatory now. And that's fine. HTTP headers or URLs require a specufinc encoding but webservers and designing tools can ta ke care of that Everythng else is optional and will require an explicit metadata (the exceptions being UTF-16 and UTF-32 which are not well suited for interchanges across heterogeneous networks and independant realms, but used mostly for internal processes, for which you absolutely don't need any byte order change, so for which you don't even need any BOM: If there's one, you can safely discard it from the input strings, adjusting the length and offset positions in the source if that source is randomly seeakable ; you don't need to adjust these lengths and/or positions if the source is a serial input stream which is not seekable in the backward direction or randomly seekable in the forward direction in a fast direct manner without reading all intermediate positions.)
Re: Charset declaration in HTML (was: Romanized Singhala - Think about it again)
Naena Guru, Tue, 10 Jul 2012 01:40:19 -0500: HTML5 assumes UTF-8 as the character set if you do not declare one explicitly. My current pages are in HTML 4. There is in principle no difference between what HTML5-parsers assume and what HTML4-parsers assume: All of them default to the default encoding for the locale. Notepad forced me to save the file in UTF-8 format. I ran it through W3C Validator. It passed HTML5 test with the following warning: [image: Warning] Byte-Order Mark found in UTF-8 File. I assume that you used the validator at http://validator.w3.org. But if you instead use the most updated HTML5-compatible validators at http://www.validator.nu or http://validator.w3.org/nu/ then will not get any warning just because your file uses the Byte-Order Mark. HTML5 explicitly allows you to use the BOM. The Unicode Byte-Order Mark (BOM) in UTF-8 encoded files is known to cause problems for some text editors and older browsers. You may want to consider avoiding its use until it is better supported. Weasel words from the validator. The notion about older browsers is not very relevant. How old are they? IE6 have no problems with the BOM, for instance. And that is probably one of the few, somewhat relevant, old browsers. As for editors: If your own editor have no problems with the BOM, then what? But I think Notepad can also save as UTF-8 but without the BOM - there should be possible to get an option for choosing when you save it. Else you can use the free Notepad++. And many others. In VIM, you set or unset the BOM via the commands set bomb set nobomb -- Leif H Silli
Re: Charset declaration in HTML (was: Romanized Singhala - Think about it again)
2012/7/11 Leif Halvard Silli xn--mlform-...@xn--mlform-iua.no: it. Else you can use the free Notepad++. And many others. In VIM, you set or unset the BOM via the commands set bomb set nobomb Should these command specify if your computer will explode when saving the file ? :'o set bom set nobom Sorry, could not resist.
Ewellic again (was: Re: Romanized Singhala - Think about it again)
Unicode character database goes from zero to some very big number. There are no holes in it to define character sets for somebody's fancy. Well, Doug Ewell did one for Esparanto expanding fuþorc. Ewellic is not futhorc. They are different scripts. From the Omniglot page on Ewellic (with *emphasis* added): The shape of Ewellic letters was *inspired by* the Runic and Cirth scripts, but shows greater (though still imperfect) regularity of form. -- Doug Ewell | Thornton, Colorado, USA http://www.ewellic.org | @DougEwell
Re: Romanized Singhala - Think about it again
2012/7/9 Naena Guru naenag...@gmail.com: Using Latin letters for a transliteration of Sinhala is not a hack, but making fonts said to be Latin-1 with Sinhalese letters instead of the Latin letters is a hack. Your hack us a hack? Simply because you've absolutely not understood anything abot what is Unicode. And you are always confusing concepts. It's true that the Unicode and ISO/IEC 10646 need to use a teminology, that may not be understood the way you mean it or use it. That's why they include definitions of these terms. Don't interpret the terminology in way different to what is defined. Well, you can characterize the smartfont solution anyway you like. The problem for you is that it works! No it does not work. Becaue you seem to assume that we can always select the font. In most cases we cannot. So letters are encoded and given unique code points, but the font to render it is determined externally by the renderer. IOn most cases users won't want to have to guess which font to use. Notably when these fonts are also not available on their platform. There's a huge life of text outside HTML and rich text formats for documents. You absolutely want to ignore it. The UCS is there to allow exactly the separation between the presentation (fonts for example) and the semantics of the encoded texts. The UCS is also designed to avoid the dependancy between languages. Only the scripts are encoded (see the desription of what is defined as abstract characters). An encoding is not just a collection of bits in fixed-width numbers. Otherwise we would only see numbers on screen. The code points in the UCS are given semantics via character properties. - The representative glyph seen in Charts is only a very tiny part of these properties, and in fact the least used of all of them. They are only useful for producing visual charts. - What is more imporant is how each distinctive code is behaving within various mappings to support various algorithms. Including the possibility to switch fonts transparenlty without breaking the text completely (for example displaying a Greek Theta when a Latin Z awith acute was encoded, or even a Latin X when a Latin R was encoded). The encoding is what allows words and orthographies to be recognized, still independantly of the font styles and other optional typographic effects (because all scripts are made of an almost infinite number of possible styles, that users will still read as part of the script while still also recognizing the orthography used and the language). Unicode and ISO/IEC 10646 do not encode glyphs directly in the UCS. They do not encode orthographies, they do not encode languages. What is encoded is a set of correlated properties. One of these properties is a numeric property name code point which is also independant of the final binary encoding (it could be one of the standard UTF's or even a legacy 8-bit encoding with a mapping to/from the UCS) ! Sorry for this Kindergarten lesson, but you should understand the role of the font. A font is a support application at the User Interface level. Yes. But Unicode does not really matter about which font you will use. Provided that they map glyphs coherently in such a way that Sinhalese letters will not be rendered instead of the intended Latin letters EVEN if a Sinhalese font has been selected. When text moves between applications and between computers, they travel as numeric codes representing the text in the form of digital bytes. The computer can't say French from Singhala. Note relevant to out discussions in this Unicode mailing list. We don't care about that and SHOULD not even car about. Unicode supports a wide range of possible binary encodings. They don't change however the code point assignments which are the central point from which all other properties are mapped in al applications, including for rendering (but not limited to it). Oh, thank you for the generosity of allowing me use of the entire Latin repertoire. You don't have to tell that to me. We need to tell it again to you because you absolutely want to restrict the repertoire to an 8-bit subset, when you ALSO contractictorily say that you want to support thousands of aksharas. Unicode supports millions of characters and tens of millions of glyphs (possibly more) using a 21-bit encoding space (actually less than 20 if we leave aside the PUA which are also supported separately but with an extremely free encoding with almost no standard properties). This space is still representable with various encodings (some are part of the Unicode and ISO/IEC 10646 standards, some are supported in the references, and there are tons others, incljuding many legacy 7-bit or 8-bit SBCS encodings, from ISO or from proprietary platforms, or from national standards not part of ISO, e.g. those developed in China PR such as GB18030, or in India such as ISCII, plus many older standards that have since been deprecated and are no longer recommended). But ISO
Re: Romanized Singhala - Think about it again
Thank you Goliath. On another subject, I think the script you dreamed of as a boy is very nearly fuþorc. foþorc is the (Old) English alphabet. Thank you. On Wed, Jul 4, 2012 at 1:54 PM, Doug Ewell d...@ewellic.org wrote: [removing cc list] Naena Guru wrote: On this 4th of July, let me quote James Madison: [quote from Madison irrelevant to character encoding principles snipped] I gave much thought to why many here at the Unicode mailing list reacted badly to my saying that Unicode solution for Singhala is bad. Unicode encodes Latin characters in their own block, and Sinhala characters in their own block. Many of us disagree with a solution to encode Sinhala characters as though they were merely Latin characters with different shapes, and agree with the Unicode solution to encode them as separate characters. This is a technical matter. I see the problem. This is what confused Philippe too. This is primarily a transliteration. Transliterations go from one script to another. Not one Unicode code block (I said code page earlier with an old habit) to another. So, let's take the font issue out for the time being and concentrate on the transliteration. A transliteration scheme is a solution for a problem and has a technology platform it is made for. Older (predecessor of) IAST Sanskrit and PTS Pali were solutions made with letterpress printing in mind. They used dots and bars for accents because they could be improvised easily in the street-side printing presses. That was 1800s. Suddenly with computers, accented letters became hard to get. HK Sanskrit made Sanskrit friendly for the computer by limiting it to ASCII. Now, after electronic communication became cleaner, we expanded the 7-bit set to full-byte set. Now iso-8859-1 set is available everywhere. Earlier I said the Plain Text idea is bad too. And many of us disagree with that rather vehemently as well, for many reasons. The responses came as attacks on *my* solution than in defense of Unicode Singhala. It's not personal unless you wish to make it personal. You came onto the Unicode mailing list, a place unsurprisingly filled with people who believe the Unicode model is a superior if not perfect character encoding model, and claimed that encoding Sinhala as if it were Latin (and requiring a special font to see the Sinhala glyphs) is a better model. Are you really surprised that some people here disagree with you? If you write to a Linux mailing list that Linux is terrible and Microsoft Windows is wonderful, you will see pushback there too. Here is a defense of Unicode Sinhala: it allows you, me, or anyone else to create, read, search, and sort plain text in Sinhala, optionally with any other script or combination of scripts in the same text, using any of a fairly wide variety of fonts, rendering engines, and applications. The purpose of designating naenaguru@gmail.com as a spammer is to prevent criticism. The list administrator, Sarasvati, can speak to this issue. Every mailing list, every single one, has rules concerning the conduct of posters. I note that your post made it to the list, though, so I'm not sure what you're on about. It is shameful that a standards organization belonging to corporations of repute resorts to censorship like bureaucrats and academics of little Lanka. Do not attempt to represent this as a David and Goliath battle between the big bad Unicode Consortium and poor little Sri Lanka or its citizens. This is a technical matter. I ask you to reconsider: As a way of explaining Romanized Singhala, I made some improvements to www.LovataSinhala.com. Mainly, it now has near the top of each page a link that says, ’switch the script’. That switches the base font of the body tag of the page between the Latin and Singhala typefaces. Please read the smaller page that pops up. The fundamental model is still one of representing Sinhala text using Latin characters, and relying on a font switch. It is still completely antithetical to the Unicode model. I also verified that I hadn’t left any Unicode characters outside ISO-8859-1 in the source code -- HTML, JavaScript or CSS. The purpose of declaring the character set as iso-8859-1 than utf-8 is to avoid doubling and trebling the size of the page by utf-8. I think, if you have characters outside iso-8859-1 and declare the page as such, you get Character-not-found for those locations. (I may be wrong). You didn't read what Philippe wrote. Representing Sinhala characters in UTF-8 takes *fewer* bytes, typically less than half, compared to using numeric character references like #3523;#3538;#3458;#3524;#**3517; #3517;#3538;#3520;#3539;#**3512;#3495; #3465;#3524;#3517;. Philippe Verdy, obviously has spent a lot of time researching the web site and even went as far as to check the faults of the web service provider, Godaddy.com. He called my font a hack font without any proof of it. A font
Re: Romanized Singhala - Think about it again
On Thu, Jul 5, 2012 at 6:51 AM, Philippe Verdy verd...@wanadoo.fr wrote: 2012/7/5 Naena Guru naenag...@gmail.com: On Wed, Jul 4, 2012 at 11:33 PM, Philippe Verdy verd...@wanadoo.fr wrote: Anyway, consider the solutions already proposed in Sinhalese Wikipedia. There are verious solutions proposed, including several input methods supported there. But the purpose of these solutions is always to generate Sinhalese texts perfectly encoded with Unicode and nothing else. Thank you for the kind suggestion. The problem is Unicode Sinhala does not perfectly support Singhala! The solution is for Sinhala not for Unicode! I am not saying Unicode has a bad intention but an ill-conceived product. The fault is with Lankan technocrats that took the proposal as it was given and ever since prevented public participation. My solution is 'perfectly encoded with Unicode'. Yes thee may remain some issues with older OSes that have limited support for standard OpenType layout tables. But there's now no problem at all since Windows XP SP2. Windows 7 has the full support, and for those users that have still not upgraded from Windows XP, Windows 8 will be ready in next August with an upgrade cost of about US$ 40 in US (valid offer currently advertized for all users upgrading from XP or later), and certainly even less for users in India and Sri Lanka. The above are not any of my complaints. Per Capita Income in Sri Lanka $2400. They are content with cell phones. The practical place for computers is the Internet Cafe. Linux is what the vast majority needs. And standard Unicode fonts with free licences are already available for all systems (not just Linux for which they were initially developed); Yes, only 4 rickety ones. Who is going to buy them anyway? Still Iskoola Pota made by Microsoft by copying a printed font is the best. You check the Plain Text by mixing Singhala and Latin in the Arial Unicode MS font to see how pretty Plain text looks. They spent $2 or 20 million for someone to come and teach them how to make fonts. (Search ICTA.lk). Staying friendly with them is profitable. World bank backs you up too. Sometime in 1990s when I was in Lanka, I tried to select a PC for my printer brother. We wanted to buy Adobe, Quark Express etc. The store keeper gave a list and asked us to select the programs. Knowing that they are expensive, I asked him first to tell me how much they cost. He said that he will install anything we wanted for free! The same trip coming back, in Zurich, the guys tried to give me a illicit copy of Windows OS in appreciation for installing German and Italian (or French?) code pages on their computers. there even exists solutions for older versions of iPhone 4. OR on Android smartphones and tablets. Mine works in them with no special solution. It works anywhere that supports Open Type -- no platform discrimination No one wants to get back to the situation that existed in the 1980's when there was a proliferation of non-interoperable 8 bit encodings for each specific platform. I agree. Today, 14 languages, including English, French, German and Italian all share the same character space called ISO-8859-1. Romanized Singhala uses the same. So, what's the fuss about? The font? Consider that as the oft suggested IME. Haha! And your solution also does not work in multilingual contexts; If mine does not work in some multilingual context, none of the 14 languages I mentioned above including English and French don't either. it does not work with many protocols or i18n libraries for applications. i18n is for multi-byte characters. Mine are single-byte characters. As you see, the safest place is SBCS. Or it requires specific constraints on web pages requiring complex styling everywhere to switch fonts. Did you see http://www.lovatasinhala.com? May be you are confusing Unicode Sinhala and romanized Singhala. Unicode Sinhala has a myriad such problems. That is why it should be abandoned! Please look at the web site and say it more coherently, if I misunderstood you. You are once again confusing the Sinhalese language wit hthe Sinhalese script. May be Latin-1 is a good and sufficient script for transcribing the language. But Unicode is not made for standardizing transliterations. The script is what is being encoded, the way it is. Even if this script is deffective on some aspect for the language. As long as your transliteration scheme using Latin letters encodings is showing Latin letters, it will be fine. You are very kind. So now I have fulfilled your order by providing a link on the right side of the page to get rid of the Singhala font. But a font that represents Latin letters using Sinhalese glyphs is definitely broken. It will not work within multilingual contexts except when using many font switches in
Influence of Futhorc on Ewellic (was: Re: Romanized Singhala - Think about it again)
Naena Guru wrote: I think the script you dreamed of as a boy is very nearly fuþorc. foþorc is the (Old) English alphabet. From the Omniglot page on Ewellic: The shape of Ewellic letters was inspired by the Runic and Cirth scripts, but shows greater (though still imperfect) regularity of form. -- Doug Ewell | Thornton, Colorado, USA http://www.ewellic.org | @DougEwell
Re: Romanized Singhala - Think about it again
of the entire Latin repertoire. You don't have to tell that to me. I have traveled quite a bit in the IT world. Don't be surprised if it is more than what you've seen. (Did you forget that earlier you accused me of using characters outside ISO-8859-1 while claiming I am within it? That is because you saw IAST and PTS displayed. They use those wonderful letters symbols and diacritics you are trying to tout. Is there a problem with Asians using ISO-8859-1 code space even for transliteration? The bonus will be that you can still write the Sinhalese language with a romanisation like yours, Bonus? but there's no need to reinvent the Sinhalese script Singhala script existed many, many years since before the English and French adopted Latin. What I did was saving it from the massacre going on with Unicode Sinhala. itself that your encoding is not even capable of completely support in all its aspects (your system only supports a reduces subset of the script). What is the basis for this nonsense?. (Little birds whispering in the background. Watch out. They are laughing). My solution supports the entire script, Singhala, Pali and Sanskrit plus two rare allophones of Sanskrit as well. Tell me what it lacks and I will add it, haha! One time you said I assigned Unicode Sinhala characters to the 'hack' font. What I do is assigning Latin characters to Singhala phonemes. That is called transliteration. There are no 'contextual versions' of the same Singhala letters like you said earlier. Ask your friends what they have more than mine in the Singhala script. Ask them why they included only two ligatures when there are 15 such. Ask them how many Singhala letters there are. Even the legacy ISCII system (used in India) is better, because it is supported by a published open standard, for which there's a clear and stable conversion from/to Unicode. My solution is supported by two standards: ISO-8859-1 and Open Type. ISO-8859-1 is Basic Latin plus Latin-1 Extension part of Unicode standard. Bottom line is this: If Latin-1 is good enough for English and French, it is good enough for Singhala too. And if Open Type is good for English and French, it is good for Singhala too. 2012/7/5 Naena Guru naenag...@gmail.com: Philippe, My last message was partial. It went out by mistake. I'll try again. It takes very long for this old man. -- Forwarded message -- From: Naena Guru naenag...@gmail.com Date: Wed, Jul 4, 2012 at 10:32 PM Subject: Re: Romanized Singhala - Think about it again To: verd...@wanadoo.fr Hi, Philippe. Thanks for keeping engaged in the discussion. Too little time spent could lead to misunderstanding. On Wed, Jul 4, 2012 at 3:42 PM, Philippe Verdy verd...@wanadoo.fr wrote: 2012/7/4 Naena Guru naenag...@gmail.com: Philippe Verdy, obviously has spent a lot of time Not a lot of time... Sorry. researching the web site and even went as far as to check the faults of the web service provider, Godaddy.com. I did not even note that your hosting provider was that company. I just looked at the HTTP headers to look at the MIME type and charset declarations. Nothing else. I know that the browser tells it. It is not a big deal, WOFF is the compressed TTF, but TTF gets delivered. If and when GoDaddy fixes their problem, the pages get delivered faster. Or I can make that fix in a .htaccess file. No time! He called my font a hack font without any proof of it. It is really a hack. Your font assigns Sinhalese characters to Latin letters (or some punctuations) of ISO 8859-1. My font does not have anything to do with Singhalese characters if you mean Unicode characters. You are very confusing. A Character in this context is a datatype. In the 80s it was one byte in size and used to signal not to use in arithmetic. (We still did it to convert between Capitals and Simple forms.) In the Unicode character database, a character is a numerical position. A Unicode Sinhala character is defined in Hex [0D80 - 0DFF]. Unicode Sinhala characters represent an incomplete hotchpotch of ideas of letters, ligatures and signs. I have none of that in the font. I say and know that Unicode Sinhala is a failure. It inhibits use of Singhala on the computer and the network. I do not concern me with fixing it because it cannot be fixed. Only thing I did in relation to it is to write an elaborate set of routines to *translate* (not map) between constructs of Unicode Sinhala characters and romanized Singhala. That is not in the font. The font has lookup tables. It also assigns contextual variants of the same abstract Sinhalese letters, to ISO 8859-1 codes, What contexts cause what variants? Looks like you are saying Singhala letters cha plus glyphs for some ligatures of multiple Sinhalese letters to ISO 8859-1 codes, plus it reorders these glyphs so that they no longer match
Re: Romanized Singhala - Think about it again
2012/7/5 Naena Guru naenag...@gmail.com: The above are not any of my complaints. Per Capita Income in Sri Lanka $2400. They are content with cell phones. The practical place for computers is the Internet Cafe. Linux is what the vast majority needs. And Linux fully supports the standard Unicode encoding of the Sinhalese script. May be there are still some missing letters to encode, but then it's not too late to encode them. Propose them, formalize them. Help disambiguating the various cases. But ask your self why Sinhalese Wikipedia works and us usable in Linux too... There already exists free OpenType fonts for Sinhalese that are using the stanadrd Uncode/ISO/IEC 10646 assgnments. Di you say you can't read the Sinhalese Wikipedia on your Linux machines ?
Re: Romanized Singhala - Think about it again
Le 05/07/12 10:02, Naena Guru a écrit : On Wed, Jul 4, 2012 at 11:33 PM, Philippe Verdy verd...@wanadoo.fr mailto:verd...@wanadoo.fr wrote: Anyway, consider the solutions already proposed in Sinhalese Wikipedia. There are verious solutions proposed, including several input methods supported there. But the purpose of these solutions is always to generate Sinhalese texts perfectly encoded with Unicode and nothing else. Thank you for the kind suggestion. The problem is Unicode Sinhala does not perfectly support Singhala! What's wrong? Are there missing letters? *The solution is for Sinhala not for Unicode!* Or rather for Sinhala by Unicode. **I am not saying Unicode has a bad intention but an ill-conceived product. What precisely is ill-conceived? The fault is with Lankan technocrats that took the proposal as it was given and ever since prevented public participation. My solution is 'perfectly encoded with Unicode'. No. It's an 8-bit character set independant from Unicode. Yes thee may remain some issues with older OSes that have limited support for standard OpenType layout tables. But there's now no problem at all since Windows XP SP2. Windows 7 has the full support, and for those users that have still not upgraded from Windows XP, Windows 8 will be ready in next August with an upgrade cost of about US$ 40 in US (valid offer currently advertized for all users upgrading from XP or later), and certainly even less for users in India and Sri Lanka. The above are not any of my complaints. Per Capita Income in Sri Lanka $2400. They are content with cell phones. The practical place for computers is the Internet Cafe. Linux is what the vast majority needs. And standard Unicode fonts with free licences are already available for all systems (not just Linux for which they were initially developed); Yes, only 4 rickety ones. Who is going to buy them anyway? Why would you buy them if they're free? Still Iskoola Pota made by Microsoft by copying a printed font is the best. You check the Plain Text by mixing Singhala and Latin in the Arial Unicode MS font to see how pretty Plain text looks. They spent $2 or 20 million for someone to come and teach them how to make fonts. (Search ICTA.lk). Staying friendly with them is profitable. World bank backs you up too. Sometime in 1990s when I was in Lanka, I tried to select a PC for my printer brother. We wanted to buy Adobe, Quark Express etc. The store keeper gave a list and asked us to select the programs. Knowing that they are expensive, I asked him first to tell me how much they cost. He said that he will install anything we wanted for free! The same trip coming back, in Zurich, the guys tried to give me a illicit copy of Windows OS in appreciation for installing German and Italian (or French?) code pages on their computers. there even exists solutions for older versions of iPhone 4. OR on Android smartphones and tablets. Mine works in them with no special solution. It works anywhere that supports Open Type -- no platform discrimination Is there any platform discrimination with Unicode Sinhala? No one wants to get back to the situation that existed in the 1980's when there was a proliferation of non-interoperable 8 bit encodings for each specific platform. I agree. Today, 14 languages, including English, French, German and Italian all share the same character space called ISO-8859-1. In fact, ISO-8859-1 is not well suited for French (my native language): it lacks a few letters which were added to ISO-8859-15. However, I always use Unicode today, even for French-only texts. Romanized Singhala uses the same. So, what's the fuss about? The font? The problem is that only your translitteration scheme, with Latin letters, is supported by ISO-8859-1, not the Sinhalese letters themselves. Consider that as the oft suggested IME. Haha! And your solution also does not work in multilingual contexts; If mine does not work in some multilingual context, none of the 14 languages I mentioned above including English and French don't either. They do because they use Latin letters, not Sinhalese letters. it does not work with many protocols or i18n libraries for applications. i18n is for multi-byte characters. Mine are single-byte characters. OK. Do it as you want, but it won't be Unicode compliant. As you see, the safest place is SBCS. I don't see. Why is it safer? Or it requires specific constraints on web pages requiring complex styling everywhere to switch fonts. Did you see http://www.lovatasinhala.com http://www.lovatasinhala.com/? May be you are confusing Unicode Sinhala and romanized Singhala. Unicode Sinhala has a myriad such problems. Which problems? That is why it should be abandoned! Why wouldn't you try to solve the problems, whatever they could be, instead of proposing an entirely
Re: Romanized Singhala - Think about it again
Naena Guru wrote: I know you do not care about a language of a 15 milllion people, but it matters to them. These kinds of straw man arguments are rude and counter-productive. Such a characterization is highly unlikely to be true for anyone on this list, and you've just ensured that few of them will pay any more attention to you. - John Burger MITRE
Re: Romanized Singhala - Think about it again
Seems to me that Naena Guru is demonstrating the truth of two adages: a) A fanatic is a person who redoubles his efforts when he loses sight of his goal; and b) Every movement starts with a fanatic, but for the movement to succeed, the fanatic must be removed from the movement. Peter Ingerman On 2012-07-05 09:46, John D Burger wrote: Naena Guru wrote: I know you do not care about a language of a 15 milllion people, but it matters to them. These kinds of straw man arguments are rude and counter-productive. Such a characterization is highly unlikely to be true for anyone on this list, and you've just ensured that few of them will pay any more attention to you. - John Burger MITRE - No virus found in this message. Checked by AVG - www.avg.com Version: 2012.0.2193 / Virus Database: 2437/5112 - Release Date: 07/05/12 - No virus found in this message. Checked by AVG - www.avg.com Version: 2012.0.2193 / Virus Database: 2437/5112 - Release Date: 07/05/12
Romanized Singhala - Think about it again
Pardon me for including a CC list. These are people who showed for and against opinion. On this 4th of July, let me quote James Madison: A zeal for different opinions concerning religion, concerning government, and many other points, as well of speculation as of practice; an attachment to different leaders ambitiously contending for pre-eminence and power; or to persons of other descriptions whose fortunes have been interesting to the human passions, have, in turn, divided mankind into parties, inflamed them with mutual animosity, and rendered them much more disposed to vex and oppress each other than to co-operate for their common good. I gave much thought to why many here at the Unicode mailing list reacted badly to my saying that Unicode solution for Singhala is bad. Earlier I said the Plain Text idea is bad too. The responses came as attacks on *my* solution than in defense of Unicode Singhala. The purpose of designating naenaguru@gmail.com as a spammer is to prevent criticism. It is shameful that a standards organization belonging to corporations of repute resorts to censorship like bureaucrats and academics of little Lanka. * I ask you to reconsider:* As a way of explaining Romanized Singhala, I made some improvements to www.LovataSinhala.com http://www.lovatasinhala.com/. Mainly, it now has near the top of each page a link that says, ’switch the script’. That switches the base font of the body tag of the page between the Latin and Singhala typefaces. *Please read the smaller page that pops up.* I also verified that I hadn’t left any Unicode characters outside ISO-8859-1 in the source code -- HTML, JavaScript or CSS. The purpose of declaring the character set as iso-8859-1 than utf-8 is to avoid doubling and trebling the size of the page by utf-8. I think, if you have characters outside iso-8859-1 and declare the page as such, you get Character-not-found for those locations. (I may be wrong). Philippe Verdy, obviously has spent a lot of time researching the web site and even went as far as to check the faults of the web service provider, Godaddy.com. He called my font a hack font without any proof of it. It has only characters relevant to romanized Singhala within the SBCS. Most of the work was in the PUA and Look-up Tables. I am reminded of Inspector Clouseau that has many gadgets and in the end finds himself as the culprit. I will still read and try those other things Philippe suggests, when I get time. What is important for me is to improve on orthography rules and add more Indic languages -- Devanagari and Tamil coming up. As for those who do not want to think rationally and think Unicode is a religion, I can only point to my dilemma: http://lovatasinhala.com/assayaa.htm Have a Happy Fourth of July!
Re: Romanized Singhala - Think about it again
[removing cc list] Naena Guru wrote: On this 4th of July, let me quote James Madison: [quote from Madison irrelevant to character encoding principles snipped] I gave much thought to why many here at the Unicode mailing list reacted badly to my saying that Unicode solution for Singhala is bad. Unicode encodes Latin characters in their own block, and Sinhala characters in their own block. Many of us disagree with a solution to encode Sinhala characters as though they were merely Latin characters with different shapes, and agree with the Unicode solution to encode them as separate characters. This is a technical matter. Earlier I said the Plain Text idea is bad too. And many of us disagree with that rather vehemently as well, for many reasons. The responses came as attacks on *my* solution than in defense of Unicode Singhala. It's not personal unless you wish to make it personal. You came onto the Unicode mailing list, a place unsurprisingly filled with people who believe the Unicode model is a superior if not perfect character encoding model, and claimed that encoding Sinhala as if it were Latin (and requiring a special font to see the Sinhala glyphs) is a better model. Are you really surprised that some people here disagree with you? If you write to a Linux mailing list that Linux is terrible and Microsoft Windows is wonderful, you will see pushback there too. Here is a defense of Unicode Sinhala: it allows you, me, or anyone else to create, read, search, and sort plain text in Sinhala, optionally with any other script or combination of scripts in the same text, using any of a fairly wide variety of fonts, rendering engines, and applications. The purpose of designating naenaguru@gmail.com as a spammer is to prevent criticism. The list administrator, Sarasvati, can speak to this issue. Every mailing list, every single one, has rules concerning the conduct of posters. I note that your post made it to the list, though, so I'm not sure what you're on about. It is shameful that a standards organization belonging to corporations of repute resorts to censorship like bureaucrats and academics of little Lanka. Do not attempt to represent this as a David and Goliath battle between the big bad Unicode Consortium and poor little Sri Lanka or its citizens. This is a technical matter. I ask you to reconsider: As a way of explaining Romanized Singhala, I made some improvements to www.LovataSinhala.com. Mainly, it now has near the top of each page a link that says, ’switch the script’. That switches the base font of the body tag of the page between the Latin and Singhala typefaces. Please read the smaller page that pops up. The fundamental model is still one of representing Sinhala text using Latin characters, and relying on a font switch. It is still completely antithetical to the Unicode model. I also verified that I hadn’t left any Unicode characters outside ISO-8859-1 in the source code -- HTML, JavaScript or CSS. The purpose of declaring the character set as iso-8859-1 than utf-8 is to avoid doubling and trebling the size of the page by utf-8. I think, if you have characters outside iso-8859-1 and declare the page as such, you get Character-not-found for those locations. (I may be wrong). You didn't read what Philippe wrote. Representing Sinhala characters in UTF-8 takes *fewer* bytes, typically less than half, compared to using numeric character references like #3523;#3538;#3458;#3524;#3517; #3517;#3538;#3520;#3539;#3512;#3495; #3465;#3524;#3517;. Philippe Verdy, obviously has spent a lot of time researching the web site and even went as far as to check the faults of the web service provider, Godaddy.com. He called my font a hack font without any proof of it. A font that places glyphs for one character in the code space defined for a fundamentally different character is generally referred to as a hack (or hacked) font. A Latin-only font that placed a glyph looking like 'B' in the space reserved for 'A' would also be a hacked font. As for those who do not want to think rationally and think Unicode is a religion, I can only point to my dilemma: http://lovatasinhala.com/assayaa.htm You need to stop making this religion accusation. This is a technical matter. This is the last attempt I will make to help show YOU where the water is. -- Doug Ewell | Thornton, Colorado, USA http://www.ewellic.org | @DougEwell
Charset declaration in HTML (was: Romanized Singhala - Think about it again)
Hello Naena Guru, on 2012-07-04, you wrote: The purpose of declaring the character set as iso-8859-1 than utf-8 is to avoid doubling and trebling the size of the page by utf-8. I think, if you have characters outside iso-8859-1 and declare the page as such, you get Character-not-found for those locations. (I may be wrong). You are wrong, indeed. If you declare your page as ISO-8859-1, every octet (aka byte) in your page will be understood as a Latin-1 character; hence you cannot have any other character in your page. So, your notion of “characters outside iso-8859-1” is completely meaningless. If you declare your page as UTF-8, you can have any Unicode character (even PUA characters) in your page. Regardless of the charset declaration of your page, you can include both Numeric Character References and Character Entity References in your HTML source, cf., e.g., http://www.w3.org/TR/html401/charset.html#h-5.3. These may refer to any Unicode character, whatsoever. However, they will take considerably more storage space (and transmission bandwidth) than the UTF-8 encoded characters would take. Good luck, Otto Stolz
Re: Romanized Singhala - Think about it again
2012/7/4 Naena Guru naenag...@gmail.com: Philippe Verdy, obviously has spent a lot of time Not a lot of time... Sorry. researching the web site and even went as far as to check the faults of the web service provider, Godaddy.com. I did not even note that your hosting provider was that company. I just looked at the HTTP headers to look at the MIME type and charset declarations. Nothing else. He called my font a hack font without any proof of it. It is really a hack. Your font assigns Sinhalese characters to Latin letters (or some punctuations) of ISO 8859-1. It also assigns contextual variants of the same abstract Sinhalese letters, to ISO 8859-1 codes, plus glyphs for some ligatures of multiple Sinhalese letters to ISO 8859-1 codes, plus it reorders these glyphs so that they no longer match the Sinhalese logicial order. Yes this font is a hack because it pretends to be ISO 8859-1 when it is not. It is a specific distinct encoding which is neither ISO 859-1 and neither Unicode, but something that exists in NO existing standard. It has only characters relevant to romanized Singhala within the SBCS. Most of the work was in the PUA and Look-up Tables. I am reminded of Inspector Clouseau that has many gadgets and in the end finds himself as the culprit. And you have invented a Inspector Guru gadget for your private use on your site, instead of developping a TRUE separate encoding that you SHOULD NOT name ISO 8859-1. Try to do that, but be aware that the ISO registry of 8-bit encodings is now frozen. You'll have to convince the IANA registry to register your new encoding. For now it is registered nowhere. This is a purely local creation for your site. I will still read and try those other things Philippe suggests, when I get time. What is important for me is to improve on orthography rules and add more Indic languages -- Devanagari and Tamil coming up. As for those who do not want to think rationally and think Unicode is a religion, No. Unicode is a technical solution for a long problem : interoperability of standards using open technologies. Given that you do not want to even develop your own encoding as a registered open standard compatible with a lot of applications (remember that all new web standards MUST now support Unicode in at least one of its standard UTF, you're just loosing time here) I can only point to my dilemma: http://lovatasinhala.com/assayaa.htm Have a Happy Fourth of July! Next time don't cite me personnaly trying to conveince others that I have supported or said something I did not write myself. You have interpreted my words at your convenience, but I don't want to be associated nominatively and publicly with your personnal interpretations. Even if I also have my own opinions, I don't want to cite anyone else's opinions without just quoting his own sentences (provided that these sentences were public or that I was authorized by him to quote his sentences in other contexts). Stop this abuse of personalities. Thanks.
Re: Romanized Singhala - Think about it again
Philippe, ask your friends why ordinary people Anglicize if Unicode Sinhala is so great. See just one of many community forums: http://elakiri.com I know you do not care about a language of a 15 milllion people, but it matters to them. On Wed, Jul 4, 2012 at 10:46 PM, Philippe Verdy verd...@wanadoo.fr wrote: You are alone to think that. Users of the Sinhalese edition of Wikipedia do not need your hack or even webfonts to use the website. It only uses standard Unicode, with very common web browsers. And it works as is. For users that are not preequiped with the necessary fonts and browsers, Wikipedia indicates this vey useful site: http://www.siyabas.lk/sinhala_how_to_install_in_english.html I have two guys here in the US that asked me to help get rid of Unicode Sinhala that I helped them install from that 'very useful site'. Copies of this message goes to them. Actually, you do not need their special installation if you have Windows 7. Windows XP needs update of Uniscribe, and Vista too. Their installation programs are faulty and interferes with your OS settings. This solves the problem at least for older version of Windows or old distributions of Linux (now all popular distributions support Sinhalese). No web fonts are even necessary (WOFT works only in Windows but not in older versions of Windows with old versions of IE). You mean WEFT? Now TTF (OTF) are compressed into WOFF. I see that Microsoft is finally supporting it.(At least my font downloads, or may be it picks up the font in my computer? Now I am confused) Everything is covered : working with TrueType and OpenType, adding an IME if needed. And then navigating on standard Sinhalese websites encoded with Unicode. Philippe, try making a web page with Unicode Sinhala. Note that for version of Windows with older versions than IE6 there is no support only because these older versions did not have the necessary minimum support for complex scripts. The alternative is to use another browser such as Firefox which uses its own independant renderer that does not depend on Windows Uniscribe support. But these users are now extremely rare. Almost everyone now uses at least XP for Windows (Windows 95/98 are definitely dead), or uses a Mac, or a smartphone, or another browser (such as Firefox, Chrome, Opera). I agree. Nobody except you support your tricks and hacks. You come really too late truing to solve a problem that no longer exists as it has been solved since long for Sinhalese. Mine is a comprehensive solution. It is a transliteration. Ask users that compared the two. Find ordinary Singhalese. They use Unicode Sinhala to read news web sites. The rest of the time they Anglicize or write in English. Everything is covered here too, buddy. Adobe apps since 2004, Apple since 2004, Mozilla since 2006, All other modern browsers since 2010. MS Office 2010. Abiword, gNumeric, Linux all the works. IE 8,9 partial. IE 10 full. So? 2012/7/5 Naena Guru naenag...@gmail.com: Hi, Philippe. Thanks for keeping engaged in the discussion. Too little time spent could lead to misunderstanding. On Wed, Jul 4, 2012 at 3:42 PM, Philippe Verdy verd...@wanadoo.fr wrote: 2012/7/4 Naena Guru naenag...@gmail.com: Philippe Verdy, obviously has spent a lot of time Not a lot of time... Sorry. researching the web site and even went as far as to check the faults of the web service provider, Godaddy.com. I did not even note that your hosting provider was that company. I just looked at the HTTP headers to look at the MIME type and charset declarations. Nothing else. I know that the browser tells it. It is not a big deal, WOFF is the compressed TTF, but TTF gets delivered. If and when GoDaddy fixes their problem, the pages get delivered faster. Or I can make that fix in a .htaccess file. No time! He called my font a hack font without any proof of it. It is really a hack. Your font assigns Sinhalese characters to Latin letters (or some punctuations) of ISO 8859-1. My font does not have anything to do with Singhalese characters if you mean Unicode characters. You are very confusing. A Character in this context is a datatype. In the 80s it was one byte in size and used to signal not to use in arithmetic. (We still did it to convert between Capitals and Simple forms.) In the Unicode character database, a character is a numerical position. A Unicode Sinhala character is defined in Hex [0D80 - 0DFF]. Unicode Sinhala characters represent an incomplete hotchpotch of ideas of letters, ligatures and signs. I have none of that in the font. I say and know that Unicode Sinhala is a failure. It inhibits use of Singhala on the computer and the network. I do not concern me with fixing it because it cannot be fixed. Only thing I did in relation to it is to write an elaborate set of routines to *translate* (not map) between constructs of Unicode Sinhala