Re: Romanized Singhala - Think about it again

2012-07-17 Thread Jean-François Colson

Le 17/07/12 02:43, Naena Guru a écrit :
 Jean, sorry I am late. I used spare time as and when I got it.

 On Sun, Jul 8, 2012 at 10:20 PM, Jean-François Colson j...@colson.eu 
wrote:


 Le 09/07/12 01:29, Naena Guru a écrit :
 Jean-François,

 Let me approximate it in Romanized Singhala: 'jyó-frósvaa'. Just 
trying...
 I don’t know how that transcription should be pronounced but in 
IPA, Jean-François is /ʒɑ̃.fʁɑ̃.swa/.


 They came as rectangles (in XP).
That’s not surprising. Windows XP is an old out-of-date system with, by 
default, a very limited set of fonts. But nobody prevents you to 
download some additional free fonts.


 They showed correctly in your message inside Firefox running in Puppy 
Linux, but where I an replying, it shows a reversed Euro like character

That is surprising. Which font does include a reversed euro sign ?

 in place of the a-umlaut.
I didn’t use any umlauts but two tildes.

 This again illustrates how hazardous it is for characters outside 
Latin-1.

It illustrates how hazardous it is to use such an old OS as Windows XP.


 I can only approximate the first letter as English j+y,
English j + y? I don’t know that, neither in English nor in French.

/ʒ/ is the French j. It is not the English j plus something but rather 
the English j minus something.
The English j is /dʒ/. That’s an affricate, i.e. roughly a sound which 
begins as a plosive and evolves to end as a fricative.
The French j is a fricative. its nearest approximation in English is the 
z in azure or the s in leisure.


/ɑ̃/ is not an /ɑ/ with umlaut but an /ɑ/ with tilde. It is pronounced as 
the a in the English word “car” but with a nasal quality, i.e. some air 
passes through the nose.


Jean is an homophone of gens which you can hear here: 
http://fr.wiktionary.org/wiki/gens#Prononciation
The speaker has recorded “des gens”, so focus your attention on the 
second syllable.


/f/ is pronounced like in English.

/ʁ/ is the French r, but there are several varieties of r among the 
French dialects, so using the English r instead is not a big problem.


/s/ and /w/ are pronounced as in English.

/a/ is very similar to /ɑ/. It is the begining of the diphtong in the 
English word “sky”.


 which is same on Singhala. The rest is pretty close, I think.




 Thank you for your interest. See inline responses.

 On Thu, Jul 5, 2012 at 7:35 AM, Jean-François Colson 
j...@colson.eu wrote:


 Le 05/07/12 10:02, Naena Guru a écrit :


 On Wed, Jul 4, 2012 at 11:33 PM, Philippe Verdy 
verd...@wanadoo.fr wrote:


 Anyway, consider the solutions already proposed in 
Sinhalese
 Wikipedia. There are verious solutions proposed, 
including several
 input methods supported there. But the purpose of these 
solutions is
 always to generate  Sinhalese texts perfectly encoded 
with Unicode and

 nothing else.

 Thank you for the kind suggestion. The problem is Unicode 
Sinhala does not perfectly support Singhala!

 What’s wrong? Are there missing letters?

 Many, many.

 The solution is for Sinhala not for Unicode!
 Or rather for Sinhala by Unicode.

 Sure, if you want to do it with proper deliberation.



 I am not saying Unicode has a bad intention but an 
ill-conceived product.

 What precisely is ill-conceived?

 Anglo-centric thinking is what is wrong.
 ?

 Letters have no direct relation to speech -- very few. In Singhala 
(perhaps as in French too as someone said?) you write what you say. In 
Singhala, the exception is clearly understood rule set about how to 
pronounce short 'a' -- whether muted or not.


 Therefore, the approach should have been to encode the vowels, 
diphthongs and consonants as base letters. I assigned the acute accent 
to the 'ng' sound and the umlaut to the guttyral H in Sanskrit, but they 
could be assigned independent codepoints.




  Let me take you on the scenic route:

 Number of letters in Singhala is only theoretical. In the case 
of Singhala orthography, the actually used number depends on the 
Sanskrit vocabulary.
 Do you mean there are many conjunct consonants, sometimes with a 
separate glyph?


 Yes, many. There are three orthographies. Singhala does not have CCs 
at all. Sanskrit has a lot. Pali has touch letters in addition to what 
Sanskrit has. Modern Singhala is mixed Singhala and Sanskrit. With 
Unicode Sinhala, you need to know which ones join and provide the ZWJ 
and hope and pray that the font has the CC. Often they are absent.
So I guess your problem could be solved by providing new fonts with a 
better support of conjunct consonants. What you did for 8-bit Sinhala, 
you could do it for Unicode Sinhala too.


 Then SLS1134 gives wrong advice too.
Could you explain in details what is wrong in their advice?


 In Devanagari, they’re made by typing two or more consonants 
separated by halants. Isn’t 

Re: Romanized Singhala - Think about it again

2012-07-17 Thread Jean-François Colson

Le 17/07/12 02:43, Naena Guru a écrit :
 Just see the daily questions and dedicated section for Indic at 
Unicode.org, and think why ordinary people Anglicize instead of using 
Unicode Sinhala. (e.g. elakiri.com).

 Some also use the Sinhalese script.
 I’ve sometimes seen people type in Arabic with Latin letters in a 
French Library, because the computers they used only had French 
keyboards and they didn’t know an Arabic keyboard enough to touch type 
in Arabic with Arabic letters.


 That's right. Everyone is familiar with the good old QWERTY keyboard. 
The Singhalese have developed their own Anglicizing convention. The 
Tamils do it too, but their Anglicizing is different from the one the 
Singhalese use. They are little more respectful of their language and 
try to Anglicize more precisely.


 I used the Singhala typewriter in late 60s. The gayanna was where you 
get period on QWERTY. It is entirely different from the layout of the 
English one with dead keys for parts of letters. This is what Unicode 
Sinhala inherited. It is many fold easier if Singhala follows closely 
with English layout. I made one for Unicode. The best I could get still 
needed three-finger keys. Besides, even after you enter ZWJ and do not 
get the desired conjoints because the font does not have them.

Typing and encoding are two different matters.
If present Sinhalese fonts don’t do the job, you can improve them.

You can develop a hundred keyboard layouts and input methods to type the 
same text in a hundred different ways.


Aren’t there any keyboards with the Sinhalese letters drown on the keytops?

If there aren’t and you think the present Sinhalese keyboard layout 
doesn’t fit the QWERTY layout well enough, feel free to design a new 
layout and distribute drivers for the main operating systems.






 It's a colossal failure!
 Really?

 Of course, I don't have to repeat. You have read what I said.
I have.



 The people Anglicize than using Unicode Sinhala.
 What do you mean? If they transliterate, that’s not really 
anglicization.


 You get a glimpse of the light. Anglicizing is trying to use English 
writing conventions to write Singhala. Anglicizing is not a complete 
mapping, transliteration is. Singhala has 58 phonemes including 10 
digraphs used for Sanskrit and Pali (aspirates). The English alphabet is 
not enough even for English. It has discarded þorn, eð, æsc etc.. So, it 
has digraphs. Then because of the capitalizing convention it makes its 
set of letters even fewer.




 To be fair, the Lankan technocrats did not have a clue when they 
were asked to approve the standard.
 I know that problem. The same occured for French with Latin-1. 
That’s why some French letters are missing in Latin-1.


 Tell me about it.
Latin-1 (ISO-8859-1) lacks the French letter Œ/œ and the capital Ÿ.
Œ is used in a number of common words such as cœur (heart), œil (eye), 
œsophage (oesophagus), Œdipe (Oedipus), œuf (egg), etc.
Ÿ is used in a few toponyms such as L’Haÿ-les-Roses, a commune near 
Paris, which can be capitalized as L’HAŸ-LES-ROSES.

It also lacks the apostrophe ’.

Those characters were added in Latin-9 (ISO-8859-15)
 Œ = 0xBC, œ = 0xBD, Ÿ = 0xBE
and in CP1252
 Œ = 0x8C, œ = 0x9C, Ÿ = 0x9F, ’ = 0x92

Of course, AFAICT, they were part of the first release of Unicode.

 It is first come, first serve.
It is.

 Isn't language, and therefore, the writing a (if not the) major part 
of a culture?

You’re right.



 It was a time when there was (perhaps even now) a typist in the 
corner of the office of the bureaucrat. The big guys do not know 
touch-typing even now. Proof: A university professor wrote me a harangue 
using cyber-sex orthography (no capitals) accusing me for working for 
Americans. I had suggested that Unicode is a conspiracy to confuse us. 
(That is a bit way over, no such motive, nevertheless the effect is the 
same)




 Romanized Singhala uses the same. So, what's the fuss 
about? The font?


 The fact that your encoding won’t be supported on many computers 
worldwide.


 Jean, for the umpteenth time, I am not encoding anything. It is a 
transliteration. It is using a different script (Latin) than what you 
use traditionally (Singhala):

 Not සිංහල අකුරු, but 'síhala akuru'.
OK. Do you display Sinhala with Latin letters?
If you do, that’s not a problem.
If you display it with Sinhalese letters, you’ll need to change the font 
whenever you want to write in another script.
Just imagine a Sinhala/English dictionary. How many font changes would 
you make for such a book?

That’s a big step backwards.

 අ - a
 එ - e
 ක් - k
 අං - á
 ඤ් - ç
 ශ් - z
 etc. . .

 About the font that unnerves you:
 Think of 3D cinema. If you wear the 3D glasses, you see clearly. The 
font is for the user's benefit. The web masters can give the option I 
gave on my site to keep happy those who dislike (warning: I must select 
mild adjectives to honor sensitivities of some) 

Re: Romanized Singhala - Think about it again

2012-07-17 Thread Philippe Verdy
Let's stop this nightmare. The solution that uses a font hack that
overrides the sematnics of Latin letters will never work as it should.
Th eseparation of code points is necessary, even if this is just to
show an URL containing Sinhalese letters in the domain name part (and
without alternig the semantic of the dot, slash and colon separators).
It will be inacceptable to have the http://; prefix isolated with a
separate font just to be read correctly. from the rest of the URL.
Inacceptable also beauecause it will alter the internals of
international stadnards that are widely used. Inacceptable because
Sinhalese domain names will remain separated from the proposed
romanizations.

That user really has a complete misunderstanding of the standard and
severly lacks basic knowledge of the concepts. He shuld first read the
definitions to see that what is in the standard is definitely not what
he suposes by just looking at a simple basic chart (which is mostly
informative and has very littel use for technical implementations).

Reading the standard up to Chapter 3 (crequirements and convrormance)
is absolutely necesarry for him. He won't make any progress to
understand his own problems before reading it and criticizing
constantly what he has never read ffor not understanding it...

He should also read the introduction of the OpenType specifications
which also use their own definitions (wsomething he is mixing as
well). He must absolutely first understnad the character model and the
separation between what is Unicode, what is a abstract character, a
glyph, an encoding form, and the binary serialisation of an encoding
into a stream of bytes, plus other concepts used by common protocols
and languages such as transport syntaxes and alternate representations
using things like character entities (in SGML, XML, HTML), or
numerical escapes (e.g. in C/C++, PHP, Java, Ruby...) or string
expressions using builtin/Standard functions in Basic...)



Re: Charset declaration in HTML (was: Romanized Singhala - Think about it again)

2012-07-15 Thread Naena Guru
On Tue, Jul 10, 2012 at 11:58 PM, Leif Halvard Silli 
xn--mlform-...@xn--mlform-iua.no wrote:

 Naena Guru, Tue, 10 Jul 2012 01:40:19 -0500:

  HTML5 assumes UTF-8 as the character set if you do not declare one
  explicitly. My current pages are in HTML 4.

 There is in principle no difference between what HTML5-parsers assume
 and what HTML4-parsers assume: All of them default to the default
 encoding for the locale.

I see. That is, for the transliteration, the locale should be Sinhala
(Latin). Yes. I know that it is not official. I loathe the spelling
Sinhala. Oh, well, you cannot have it all.


  Notepad forced
  me to save the file in UTF-8 format. I ran it through W3C Validator. It
  passed HTML5 test with the following warning:
 
  [image: Warning] Byte-Order Mark found in UTF-8 File.

 I assume that you used the validator at


 http://validator.w3.org.

Yes, and it validated it. I was talking about BOM in a different context.
It showed up when I opened the file in HTML-Kit that was first created in
Notepad and saved under UTF-8. HTML-Kit Tools asked me to specify the
character set. It took it. but messed up the macron and dot letters anyway.
What I was trying to emphasize was the fact that it is hard for those
people that try to make web pages in those 'character sets'. I have been
making web pages since 1990s and never had these problems because they were
written by hand in English.



 But if you instead use the most updated HTML5-compatible validators at

 http://www.validator.nu
 or  http://validator.w3.org/nu/



 then will not get any warning just because your file uses the
 Byte-Order Mark. HTML5 explicitly allows you to use the BOM.

Thanks. This too validated all seven pages as HTML5 (I upgrated from HTML
4)




  The Unicode Byte-Order Mark (BOM) in UTF-8 encoded files is known to
 cause
  problems for some text editors and older browsers. You may want to
 consider
  avoiding its use until it is better supported.

 Weasel words from the validator. The notion about older browsers is
 not very relevant. How old are they? IE6 have no problems with the BOM,
 for instance. And that is probably one of the few, somewhat relevant,
 old browsers.

As I said before BOM was no problem for me.


 As for editors: If your own editor have no problems with the BOM, then
 what? But I think Notepad can also save as UTF-8 but without the BOM -
 there should be possible to get an option for choosing when you save
 it. Else you can use the free Notepad++. And many others. In VIM, you
 set or unset the BOM via the commands

 set bomb
 set nobomb

Yes, yes. I've seen it before. I have Notepad++. It intimidated me the
first time and never used it, haha!

 --
 Leif H Silli



Re: Ewellic again (was: Re: Romanized Singhala - Think about it again)

2012-07-15 Thread Naena Guru
My error. Sorry, Doug.

On Sun, Jul 8, 2012 at 8:00 PM, Doug Ewell d...@ewellic.org wrote:

 Unicode character database goes from zero to some very big number. There
 are no holes in it to define character sets for somebody's fancy. Well,
 Doug Ewell did one for Esparanto expanding fuþorc.


 Ewellic is not futhorc. They are different scripts.

 From the Omniglot page on Ewellic (with *emphasis* added):
 The shape of Ewellic letters was *inspired by* the Runic and Cirth
 scripts, but shows greater (though still imperfect) regularity of form.

 --
 Doug Ewell | Thornton, Colorado, USA
 http://www.ewellic.org | @DougEwell ­





Re: Charset declaration in HTML (was: Romanized Singhala - Think about it again)

2012-07-15 Thread Naena Guru
Hey, Philippe,

Your input is much appreciated. So, in a nutshell, I don't have to worry.
One of these days I need to crunch down (minify) the CSS and JavaScript
pages. I left them readily readable so that techs like you could easily
read them in place in any browser without having to pretty print. The pages
are not big by any standard and they download pretty fast. Your earlier
point about WOFF is what I am going to try and tackle today (Sunday).

In the meanwhile, thanks again.

On Tue, Jul 10, 2012 at 11:32 PM, Philippe Verdy verd...@wanadoo.fr wrote:

 2012/7/10 Naena Guru naenag...@gmail.com

 I wanted to see how hard it is to edit a page in Notepad. So I made a
 copy of my LIYANNA page and replaced the character entities I used for
 Unicode Sinhala, accented Pali and Sanskrit with their raw letters. Notepad
 forced me to save the file in UTF-8 format. I ran it through W3C Validator.
 It passed HTML5 test with the following warning:

 [image: Warning] Byte-Order Mark found in UTF-8 File.

 The Unicode Byte-Order Mark (BOM) in UTF-8 encoded files is known to
 cause problems for some text editors and older browsers. You may want to
 consider avoiding its use until it is better supported.

 The BOM is the first character of the file. There are myriad hoops that
 non-Latin users go through to do things that we routinely do. This problem
 I saw right at the inception. I already know why romanizing is so good.
 Don't you?


 You should probably ignore this non-critical warning now ; it is only for
 extremely strict compatibility with deprecated softwares that should have
 been updated since long for obvious security and performance reasons.

 Those old browsers are deprecating fast (due to the massive and fast
 spread of security attacks, automatic security updates to close issues
 competely (instead of just by preventive virus detection based on code
 bahavior or code patterns which will never be complete and fast enough to
 react to these extremely frequent attacks).

 Older editors do not have the cumfort that newer editors have. The memory
 usage of these newer editors are no longer a problem (notably for web
 developers that have systems largely above what theiur average users have),
 and systems capable of running them have never been so cheap. In addition,
 memory and storage costs have dramatically decreased.

 We are more concerned about the bandwidth usage, so your web editing
 platform should include an optimisation process and converters that will
 automatically use a compact representation (numeric character references
 for example can be sent by your server as raw UTF-8, in addition the server
 can now support on-the-fly data compression over the HTTP sessions ; there
 also exists frontend proxies that will do that for you without requiring
 you to change the development/editing methods you use.

 Most text editors even in Linux can now open sucessfully UTF-8 files
 starting by a BOM without complaining. Just like Notepad does since long.
 And they allow you to change this edit mode before saving.

 Most text processors will silently discard the U+FEFF character (it should
 be safe to do that everywhere, given that U+FEFF should no longer be used
 for anything else than BOM's)

 [side node]
   But Notepad has another problem since long : it cannot sucessfully open
 a text file whose lines are terminated by LF only, it absolutely wants them
 to be converted using CR+LF sequences ; this problem is much more severe
 than the use of a leading BOM.
   As well, Excell cannot successfully decode an UTF-8 encoded CSV file.
 But it can autoamtically recognize it if you used instead the import data
 function. This is inconsistant (also it still does not allow specifying how
 to convert numbers using dots instead of commas, when running it on a
 non-English user locale, you need to manually use a search/replace
 function; it does not allow selecting the date format for CSV file imports,
 making searhd/replacements operations is not trivial on date fields ; no
 question is asked to the user, it only uses implicits defaults even when
 they are wrong, most of the time for actual cases of CSV files).
 [/side node]

 But It has nothing to do with your problem of romanization or behavior
 with Latin. BOMs are only absent from old 8-bit character sets that are no
 longer recommanded in any modern Internet protocols ; and from 7-bit ASCII
 used only for internal technical data but not for any text intended to be
 read and translated.

 Only UTF-8 support is mandatory now. And that's fine. HTTP headers or URLs
 require a specufinc encoding but webservers and designing tools can ta ke
 care of that

 Everythng else is optional and will require an explicit metadata (the
 exceptions being UTF-16 and UTF-32 which are not well suited for
 interchanges across heterogeneous networks and independant realms, but used
 mostly for internal processes, for which you absolutely don't need any byte
 order change, so for which 

Re: Charset declaration in HTML (was: Romanized Singhala - Think about it again)

2012-07-11 Thread Leif Halvard Silli
Philippe Verdy, Wed, 11 Jul 2012 07:36:56 +0200:
 2012/7/11 Leif Halvard Silli:
 In VIM, you set or unset the BOM via the commands
 
 set bomb
 set nobomb
 
 Should these command specify if your computer will explode when saving
 the file ?
 
 :'o

Probably signals the weird fear that some have for 'da BOM'. 
 
 set bom
 set nobom
 
 Sorry, could not resist.

Those commands, without the -d, are unknown in VIM. It would have been 
too simple without the -d. ;-)
-- 
leif h silli



Re: Charset declaration in HTML (was: Romanized Singhala - Think about it again)

2012-07-11 Thread Doug Ewell

Leif Halvard Silli wrote:


As for editors: If your own editor have no problems with the BOM, then
what? But I think Notepad can also save as UTF-8 but without the BOM -
there should be possible to get an option for choosing when you save
it.


Perhaps there should be such an option in Notepad, but there isn't. The 
decision to have Notepad always write the signature to UTF-8 files, and 
always rely on it to read them, has been documented to death.


The bottom line is, there are zillions of editors available for Windows, 
many of them free, and people who want to create or modify UTF-8 files 
which will be consumed by a process that is intolerant of the signature 
should not use Notepad. That goes for HTML (pre-5) pages, Unix shell 
scripts, and others.


--
Doug Ewell | Thornton, Colorado, USA
http://www.ewellic.org | @DougEwell ­ 





Re: Charset declaration in HTML (was: Romanized Singhala - Think about it again)

2012-07-10 Thread Naena Guru
Thank you Otto.

Sorry for delay in replying. I spent the entire Sunday replying Jaques
twins.

You are absolutely right about choice between ISO-8859-1 and UTF-8. I
shouldn't have said 'using ISO-8859-1 is advantageous over UTF-8' It is
efficient if your pages are written in a language that uses single byte
codepoints. When you mix multi-byte based codepoints, like you said, the
ideal is to have them in their raw form. But in practice, this is not as
easy as we think.

Actually, the trade-off is not great for me because I use only little
non-SBCS characters. Each 2-byte character would end up as six bytes in a
Hex char entity. If you want to control the look of your web site, then you
probably have to have expensive software to do it. As for poor me, I use
CSS, JavaScript and HTML inside HTML-Kit.

HTML5 assumes UTF-8 as the character set if you do not declare one
explicitly. My current pages are in HTML 4.

As I said, I use HTML-Kit (and Tools). If I have raw Unicode Sinhala in the
HTML or Javascript, it messes them and gives you character-not-found for
them on the web page. I must have character entities if I need the comfort
of HTML-Kit. There are web sites that help you process your SBCS and
multi-byte mixed text to make character entities for non Latin-1
characters. I used them when making my only page that has them (Liyanna).
Stop and think why there are such websites. (Search text to unicode). The
world outside Latin-1 is a harsh one.

If I want to have raw Unicode Sinhala, PTS Pali or IAST Sanskrit, I have to
use Notepad instead of HTML-Kit. It is hard to code without color-coded
text.

I wanted to see how hard it is to edit a page in Notepad. So I made a copy
of my LIYANNA page and replaced the character entities I used for Unicode
Sinhala, accented Pali and Sanskrit with their raw letters. Notepad forced
me to save the file in UTF-8 format. I ran it through W3C Validator. It
passed HTML5 test with the following warning:

[image: Warning] Byte-Order Mark found in UTF-8 File.

The Unicode Byte-Order Mark (BOM) in UTF-8 encoded files is known to cause
problems for some text editors and older browsers. You may want to consider
avoiding its use until it is better supported.

The BOM is the first character of the file. There are myriad hoops that
non-Latin users go through to do things that we routinely do. This problem
I saw right at the inception. I already know why romanizing is so good.
Don't you?

UTF-8 encoding is this RFC:
http://www.ietf.org/rfc/rfc2279.txt

This is the table it gives on the way UTF-8 encoding works:
 - 007F   0xxx   ASCII
 0080- 07FF   110x 10xx === Latin -1 plus higher
 0800-    1110 10xx 10xx == Unicode Sinhala

0001 -001F    0xxx 10xx 10xx 10xx
0020 -03FF    10xx 10xx 10xx 10xx 10xx
0400 -7FFF    110x 10xx ... 10xx

Observe that Latin 'a' transforms from UCS-2 to two coded bytes with UTF-8
and Unicode Sinhala Ayanna goes from two to three.
Unicode Sinhala: 0D80 - 0DFF
a = Hex 61 = Bin 0110 0001 -
UTF-8 Template: 110x 10xx
UTF-8 Encoding: 1101 1011 = Hex C1 A1

ayanna = Hex 0D85 = Bin  11011000 0101 -
UTF-8 Template: 1110 10xx 10xx
UTF-8 encoding: 1110 10110110 1101 = Hex E0 B6 85

Thanks for your input. It is appreciated.


On Wed, Jul 4, 2012 at 2:25 PM, Otto Stolz otto.st...@uni-konstanz.dewrote:

 Hello Naena Guru,

 on 2012-07-04, you wrote:

 The purpose of
 declaring the character set as iso-8859-1 than utf-8 is to avoid doubling
 and trebling the size of the page by utf-8. I think, if you have
 characters
 outside iso-8859-1 and declare the page as such, you get
 Character-not-found for those locations. (I may be wrong).


 You are wrong, indeed.

 If you declare your page as ISO-8859-1, every octet
 (aka byte) in your page will be understood as a Latin-1
 character; hence you cannot have any other character
 in your page. So, your notion of “characters outside
 iso-8859-1” is completely meaningless.

 If you declare your page as UTF-8, you can have
 any Unicode character (even PUA characters) in
 your page.

 Regardless of the charset declaration of your page,
 you can include both Numeric Character References
 and Character Entity References in your HTML source,
 cf., e.g., 
 http://www.w3.org/TR/html401/**charset.html#h-5.3http://www.w3.org/TR/html401/charset.html#h-5.3
 .
 These may refer to any Unicode character, whatsoever.
 However, they will take considerably more storage space
 (and transmission bandwidth) than the UTF-8 encoded
 characters would take.

 Good luck,
   Otto Stolz





Re: Romanized Singhala - Think about it again

2012-07-10 Thread Richard Wordingham
On Mon, 09 Jul 2012 05:20:45 +0200
Jean-François Colson j...@colson.eu wrote:

 Le 09/07/12 01:29, Naena Guru a écrit :

  Number of letters in Singhala is only theoretical. In the case of 
  Singhala orthography, the actually used number depends on the
  Sanskrit vocabulary.

 Do you mean there are many conjunct consonants, sometimes with a 
 separate glyph?
 In Devanagari, they’re made by typing two or more consonants
 separated by halants. Isn’t that possible with Sinhala?

No, SLS 1134 (2004) keeps it simple by making these viramas visible,
i.e. real halants, making the associated consonants the last in
the akshara. For the ordinsry conjuncts, including raphe, it prescribes
VIRAMA, ZWJ. ZWJ, VIRAMA is used to make consonants touch.

SLS 1134 spares users some of the complexity by requiring the
commonest subscript and superscript consonants to be on the keyboard.
(This may well be useless for X, unless X has had its keyboard
mapping extended to allow the combinations as single keystrokes.)

Richard.




Re: Charset declaration in HTML (was: Romanized Singhala - Think about it again)

2012-07-10 Thread Philippe Verdy
2012/7/10 Naena Guru naenag...@gmail.com

 I wanted to see how hard it is to edit a page in Notepad. So I made a copy
 of my LIYANNA page and replaced the character entities I used for Unicode
 Sinhala, accented Pali and Sanskrit with their raw letters. Notepad forced
 me to save the file in UTF-8 format. I ran it through W3C Validator. It
 passed HTML5 test with the following warning:

 [image: Warning] Byte-Order Mark found in UTF-8 File.

 The Unicode Byte-Order Mark (BOM) in UTF-8 encoded files is known to cause
 problems for some text editors and older browsers. You may want to consider
 avoiding its use until it is better supported.

 The BOM is the first character of the file. There are myriad hoops that
 non-Latin users go through to do things that we routinely do. This problem
 I saw right at the inception. I already know why romanizing is so good.
 Don't you?


You should probably ignore this non-critical warning now ; it is only for
extremely strict compatibility with deprecated softwares that should have
been updated since long for obvious security and performance reasons.

Those old browsers are deprecating fast (due to the massive and fast spread
of security attacks, automatic security updates to close issues competely
(instead of just by preventive virus detection based on code bahavior or
code patterns which will never be complete and fast enough to react to
these extremely frequent attacks).

Older editors do not have the cumfort that newer editors have. The memory
usage of these newer editors are no longer a problem (notably for web
developers that have systems largely above what theiur average users have),
and systems capable of running them have never been so cheap. In addition,
memory and storage costs have dramatically decreased.

We are more concerned about the bandwidth usage, so your web editing
platform should include an optimisation process and converters that will
automatically use a compact representation (numeric character references
for example can be sent by your server as raw UTF-8, in addition the server
can now support on-the-fly data compression over the HTTP sessions ; there
also exists frontend proxies that will do that for you without requiring
you to change the development/editing methods you use.

Most text editors even in Linux can now open sucessfully UTF-8 files
starting by a BOM without complaining. Just like Notepad does since long.
And they allow you to change this edit mode before saving.

Most text processors will silently discard the U+FEFF character (it should
be safe to do that everywhere, given that U+FEFF should no longer be used
for anything else than BOM's)

[side node]
  But Notepad has another problem since long : it cannot sucessfully open a
text file whose lines are terminated by LF only, it absolutely wants them
to be converted using CR+LF sequences ; this problem is much more severe
than the use of a leading BOM.
  As well, Excell cannot successfully decode an UTF-8 encoded CSV file. But
it can autoamtically recognize it if you used instead the import data
function. This is inconsistant (also it still does not allow specifying how
to convert numbers using dots instead of commas, when running it on a
non-English user locale, you need to manually use a search/replace
function; it does not allow selecting the date format for CSV file imports,
making searhd/replacements operations is not trivial on date fields ; no
question is asked to the user, it only uses implicits defaults even when
they are wrong, most of the time for actual cases of CSV files).
[/side node]

But It has nothing to do with your problem of romanization or behavior with
Latin. BOMs are only absent from old 8-bit character sets that are no
longer recommanded in any modern Internet protocols ; and from 7-bit ASCII
used only for internal technical data but not for any text intended to be
read and translated.

Only UTF-8 support is mandatory now. And that's fine. HTTP headers or URLs
require a specufinc encoding but webservers and designing tools can ta ke
care of that

Everythng else is optional and will require an explicit metadata (the
exceptions being UTF-16 and UTF-32 which are not well suited for
interchanges across heterogeneous networks and independant realms, but used
mostly for internal processes, for which you absolutely don't need any byte
order change, so for which you don't even need any BOM: If there's one, you
can safely discard it from the input strings, adjusting the length and
offset positions in the source if that source is randomly seeakable ; you
don't need to adjust these lengths and/or positions if the source is a
serial input stream which is not seekable in the backward direction or
randomly seekable in the forward direction in a fast direct manner without
reading all intermediate positions.)


Re: Charset declaration in HTML (was: Romanized Singhala - Think about it again)

2012-07-10 Thread Leif Halvard Silli
Naena Guru, Tue, 10 Jul 2012 01:40:19 -0500:

 HTML5 assumes UTF-8 as the character set if you do not declare one
 explicitly. My current pages are in HTML 4.

There is in principle no difference between what HTML5-parsers assume 
and what HTML4-parsers assume: All of them default to the default 
encoding for the locale.

 Notepad forced
 me to save the file in UTF-8 format. I ran it through W3C Validator. It
 passed HTML5 test with the following warning:
 
 [image: Warning] Byte-Order Mark found in UTF-8 File.

I assume that you used the validator at

http://validator.w3.org. 

But if you instead use the most updated HTML5-compatible validators at 

http://www.validator.nu 
or  http://validator.w3.org/nu/

then will not get any warning just because your file uses the 
Byte-Order Mark. HTML5 explicitly allows you to use the BOM.

 The Unicode Byte-Order Mark (BOM) in UTF-8 encoded files is known to cause
 problems for some text editors and older browsers. You may want to consider
 avoiding its use until it is better supported.

Weasel words from the validator. The notion about older browsers is 
not very relevant. How old are they? IE6 have no problems with the BOM, 
for instance. And that is probably one of the few, somewhat relevant, 
old browsers. 

As for editors: If your own editor have no problems with the BOM, then 
what? But I think Notepad can also save as UTF-8 but without the BOM - 
there should be possible to get an option for choosing when you save 
it. Else you can use the free Notepad++. And many others. In VIM, you 
set or unset the BOM via the commands

set bomb
set nobomb
-- 
Leif H Silli



Re: Charset declaration in HTML (was: Romanized Singhala - Think about it again)

2012-07-10 Thread Philippe Verdy
2012/7/11 Leif Halvard Silli xn--mlform-...@xn--mlform-iua.no:
 it. Else you can use the free Notepad++. And many others. In VIM, you
 set or unset the BOM via the commands

 set bomb
 set nobomb

Should these command specify if your computer will explode when saving
the file ?

:'o

set bom
set nobom

Sorry, could not resist.



Ewellic again (was: Re: Romanized Singhala - Think about it again)

2012-07-08 Thread Doug Ewell
Unicode character database goes from zero to some very big number. 
There are no holes in it to define character sets for somebody's 
fancy. Well, Doug Ewell did one for Esparanto expanding fuþorc.


Ewellic is not futhorc. They are different scripts.


From the Omniglot page on Ewellic (with *emphasis* added):
The shape of Ewellic letters was *inspired by* the Runic and Cirth 
scripts, but shows greater (though still imperfect) regularity of form.


--
Doug Ewell | Thornton, Colorado, USA
http://www.ewellic.org | @DougEwell ­




Re: Romanized Singhala - Think about it again

2012-07-08 Thread Philippe Verdy
2012/7/9 Naena Guru naenag...@gmail.com:
 Using Latin letters for a transliteration of Sinhala is not a hack, but
 making fonts said to be Latin-1 with Sinhalese letters instead of the Latin
 letters is a hack.

Your hack us a hack? Simply because you've absolutely not understood
anything abot what is Unicode. And you are always confusing concepts.
It's true that the Unicode and ISO/IEC 10646 need to use a teminology,
that may not be understood the way you mean it or use it. That's why
they include definitions of these terms. Don't interpret the
terminology in way different to what is defined.

 Well, you can characterize the smartfont solution anyway you like. The
 problem for you is that it works!

No it does not work. Becaue you seem to assume that we can always
select the font. In most cases we cannot. So letters are encoded and
given unique code points, but the font to render it is determined
externally by the renderer. IOn most cases users won't want to have to
guess which font to use. Notably when these fonts are also not
available on their platform.

There's a huge life of text outside HTML and rich text formats for
documents. You absolutely want to ignore it. The UCS is there to allow
exactly the separation between the presentation (fonts for example)
and the semantics of the encoded texts.

The UCS is also designed to avoid the dependancy between languages.
Only the scripts are encoded (see the desription of what is defined as
abstract characters).

An encoding is not just a collection of bits in fixed-width numbers.
Otherwise we would only see numbers on screen. The code points in the
UCS are given semantics via character properties.

- The representative glyph seen in Charts is only a very tiny part
of these properties, and in fact the least used of all of them. They
are only useful for producing visual charts.
- What is more imporant is how each distinctive code is behaving
within various mappings to support various algorithms. Including the
possibility to switch fonts transparenlty without breaking the text
completely (for example displaying a Greek Theta when a Latin Z awith
acute was encoded, or even a Latin X when a Latin R was encoded). The
encoding is what allows words and orthographies to be recognized,
still independantly of the font styles and other optional typographic
effects (because all scripts are made of an almost infinite number of
possible styles, that users will still read as part of the script
while still also recognizing the orthography used and the language).

Unicode and ISO/IEC 10646 do not encode glyphs directly in the UCS.
They do not encode orthographies, they do not encode languages. What
is encoded is a set of correlated properties. One of these properties
is a numeric property name code point which is also independant of
the final binary encoding (it could be one of the standard UTF's or
even a legacy 8-bit encoding with a mapping to/from the UCS) !

 Sorry for this Kindergarten lesson, but you should understand the role of
 the font. A font is a support application at the User Interface level.

Yes. But Unicode does not really matter about which font you will use.
Provided that they map glyphs coherently in such a way that Sinhalese
letters will not be rendered instead of the intended Latin letters
EVEN if a Sinhalese font has been selected.

 When text moves between
 applications and between computers, they travel as numeric codes
 representing the text in the form of digital bytes. The computer can't say
 French from Singhala.

Note relevant to out discussions in this Unicode mailing list. We
don't care about that and SHOULD not even car about. Unicode supports
a wide range of possible binary encodings. They don't change however
the code point assignments which are the central point from which all
other properties are mapped in al applications, including for
rendering (but not limited to it).

 Oh, thank you for the generosity of allowing me use of the entire Latin
 repertoire. You don't have to tell that to me.

We need to tell it again to you because you absolutely want to
restrict the repertoire to an 8-bit subset, when you ALSO
contractictorily say that you want to support thousands of aksharas.

Unicode supports millions of characters and tens of millions of glyphs
(possibly more) using a 21-bit encoding space (actually less than 20
if we leave aside the PUA which are also supported separately but with
an extremely free encoding with almost no standard properties). This
space is still representable with various encodings (some are part of
the Unicode and ISO/IEC 10646 standards, some are supported in the
references, and there are tons others, incljuding many legacy 7-bit or
8-bit SBCS encodings, from ISO or from proprietary platforms, or from
national standards not part of ISO, e.g. those developed in China PR
such as GB18030, or in India such as ISCII, plus many older standards
that have since been deprecated and are no longer recommended).

But ISO 

Re: Romanized Singhala - Think about it again

2012-07-07 Thread Naena Guru
Thank you Goliath.

On another subject, I think the script you dreamed of as a boy is very
nearly fuþorc. foþorc is the (Old) English alphabet.

Thank you.

On Wed, Jul 4, 2012 at 1:54 PM, Doug Ewell d...@ewellic.org wrote:

 [removing cc list]


 Naena Guru wrote:

  On this 4th of July, let me quote James Madison:


 [quote from Madison irrelevant to character encoding principles snipped]


  I gave much thought to why many here at the Unicode mailing list reacted
 badly to my saying that Unicode solution for Singhala is bad.


 Unicode encodes Latin characters in their own block, and Sinhala
 characters in their own block. Many of us disagree with a solution to
 encode Sinhala characters as though they were merely Latin characters with
 different shapes, and agree with the Unicode solution to encode them as
 separate characters. This is a technical matter.

 I see the problem. This is what confused Philippe too. This is primarily a
transliteration. Transliterations go from one script to another. Not one
Unicode code block (I said code page earlier with an old habit) to another.
So, let's take the font issue out for the time being and concentrate on the
transliteration.

A transliteration scheme is a solution for a problem and has a technology
platform it is made for. Older (predecessor of) IAST Sanskrit and PTS Pali
were solutions made with letterpress printing in mind. They used dots and
bars for accents because they could be improvised easily in the street-side
printing presses. That was 1800s. Suddenly with computers, accented letters
became hard to get. HK Sanskrit made Sanskrit friendly for the computer by
limiting it to ASCII. Now, after electronic communication became cleaner,
we expanded the 7-bit set to full-byte set. Now iso-8859-1 set is available
everywhere.




  Earlier I said the Plain Text idea is bad too.


 And many of us disagree with that rather vehemently as well, for many
 reasons.


  The responses came as attacks on *my* solution than in defense of Unicode
 Singhala.


 It's not personal unless you wish to make it personal. You came onto the
 Unicode mailing list, a place unsurprisingly filled with people who believe
 the Unicode model is a superior if not perfect character encoding model,
 and claimed that encoding Sinhala as if it were Latin (and requiring a
 special font to see the Sinhala glyphs) is a better model. Are you really
 surprised that some people here disagree with you? If you write to a Linux
 mailing list that Linux is terrible and Microsoft Windows is wonderful, you
 will see pushback there too.

 Here is a defense of Unicode Sinhala: it allows you, me, or anyone else to
 create, read, search, and sort plain text in Sinhala, optionally with any
 other script or combination of scripts in the same text, using any of a
 fairly wide variety of fonts, rendering engines, and applications.


  The purpose of designating naenaguru@‌‌gmail.com as a spammer is to
 prevent criticism.


 The list administrator, Sarasvati, can speak to this issue. Every mailing
 list, every single one, has rules concerning the conduct of posters. I note
 that your post made it to the list, though, so I'm not sure what you're on
 about.


  It is shameful that a standards organization belonging to corporations of
 repute resorts to censorship like bureaucrats and academics of little Lanka.


 Do not attempt to represent this as a David and Goliath battle between the
 big bad Unicode Consortium and poor little Sri Lanka or its citizens. This
 is a technical matter.


  I ask you to reconsider:
 As a way of explaining Romanized Singhala, I made some improvements to
 www.LovataSinhala.com. Mainly, it now has near the top of each page a
 link that says, ’switch the script’. That switches the base font of the
 body tag of the page between the Latin and Singhala typefaces. Please read
 the smaller page that pops up.


 The fundamental model is still one of representing Sinhala text using
 Latin characters, and relying on a font switch. It is still completely
 antithetical to the Unicode model.


  I also verified that I hadn’t left any Unicode characters outside
 ISO-8859-1 in the source code -- HTML, JavaScript or CSS. The purpose of
 declaring the character set as iso-8859-1 than utf-8 is to avoid doubling
 and trebling the size of the page by utf-8. I think, if you have characters
 outside iso-8859-1 and declare the page as such, you get
 Character-not-found for those locations. (I may be wrong).


 You didn't read what Philippe wrote. Representing Sinhala characters in
 UTF-8 takes *fewer* bytes, typically less than half, compared to using
 numeric character references like #3523;#3538;#3458;#3524;#**3517;
 #3517;#3538;#3520;#3539;#**3512;#3495; #3465;#3524;#3517;.


  Philippe Verdy, obviously has spent a lot of time researching the web
 site and even went as far as to check the faults of the web service
 provider, Godaddy.com. He called my font a hack font without any proof of
 it.


 A font 

Re: Romanized Singhala - Think about it again

2012-07-07 Thread Naena Guru
On Thu, Jul 5, 2012 at 6:51 AM, Philippe Verdy verd...@wanadoo.fr wrote:

 2012/7/5 Naena Guru naenag...@gmail.com:
 
 
  On Wed, Jul 4, 2012 at 11:33 PM, Philippe Verdy verd...@wanadoo.fr
 wrote:
 
  Anyway, consider the solutions already proposed in Sinhalese
  Wikipedia. There are verious solutions proposed, including several
  input methods supported there. But the purpose of these solutions is
  always to generate  Sinhalese texts perfectly encoded with Unicode and
  nothing else.
 
  Thank you for the kind suggestion. The problem is Unicode Sinhala does
 not
  perfectly support Singhala! The solution is for Sinhala not for Unicode!
 I
  am not saying Unicode has a bad intention but an ill-conceived product.
 The
  fault is with Lankan technocrats that took the proposal as it was given
 and
  ever since prevented public participation. My solution is 'perfectly
 encoded
  with Unicode'.
 
 
  Yes thee may remain some issues with older OSes that have limited
  support for standard OpenType layout tables. But there's now no
  problem at all since Windows XP SP2. Windows 7 has the full support,
  and for those users that have still not upgraded from Windows XP,
  Windows 8 will be ready in next August with an upgrade cost of about
  US$ 40 in US (valid offer currently advertized for all users upgrading
  from XP or later), and certainly even less for users in India and Sri
  Lanka.
 
  The above are not any of my complaints.
  Per Capita Income in Sri Lanka $2400. They are content with cell phones.
 The
  practical place for computers is the Internet Cafe. Linux is what the
 vast
  majority  needs.
 
 
  And standard Unicode fonts with free licences are already available
  for all systems (not just Linux for which they were initially
  developed);
 
  Yes, only 4 rickety ones. Who is going to buy them anyway? Still Iskoola
  Pota made by Microsoft by copying a printed font is the best. You check
 the
  Plain Text by mixing Singhala and Latin in the Arial Unicode MS font to
 see
  how pretty Plain text looks. They spent $2 or 20 million for someone to
 come
  and teach them how to make fonts. (Search ICTA.lk). Staying friendly with
  them is profitable. World bank backs you up too.
  Sometime in 1990s when I was in Lanka, I tried to select a PC for my
 printer
  brother. We wanted to buy Adobe, Quark Express etc. The store keeper
 gave a
  list and asked us to select the programs. Knowing that they are
 expensive, I
  asked him first to tell me how much they cost. He said that he will
 install
  anything we wanted for free! The same trip coming back, in Zurich, the
 guys
  tried to give me a illicit copy of  Windows OS in appreciation for
  installing German and Italian (or French?) code pages on their computers.
 
  there even exists solutions for older versions of iPhone
  4. OR on Android smartphones and tablets.
 
  Mine works in them with no special solution. It works anywhere that
 supports
  Open Type -- no platform discrimination
 
 
  No one wants to get back to the situation that existed in the 1980's
  when there was a proliferation of non-interoperable 8 bit encodings
  for each specific platform.
 
  I agree. Today, 14 languages, including English, French, German and
 Italian
  all share the same character space called ISO-8859-1. Romanized Singhala
  uses the same. So, what's the fuss about? The font? Consider that as the
 oft
  suggested IME. Haha!
 
 
  And your solution also does not work in multilingual contexts;
 
  If mine does not work in some multilingual context, none of the 14
 languages
  I mentioned above including English and French don't either.
 
  it does
  not work with many protocols or i18n libraries for applications.
 
  i18n is for multi-byte characters. Mine are single-byte characters. As
 you
  see, the safest place is SBCS.
 
  Or it
  requires specific constraints on web pages requiring complex styling
  everywhere to switch fonts.
 
  Did you see http://www.lovatasinhala.com? May be you are confusing
 Unicode
  Sinhala and romanized Singhala. Unicode Sinhala has a myriad such
 problems.
  That is why it should be abandoned! Please look at the web site and say
 it
  more coherently, if I misunderstood you.

 You are once again confusing the Sinhalese language wit hthe Sinhalese
 script. May be Latin-1 is a good and sufficient script for
 transcribing the language. But Unicode is not made for standardizing
 transliterations. The script is what is being encoded, the way it is.
 Even if this script is deffective on some aspect for the language. As
 long as your transliteration scheme using Latin letters encodings is
 showing Latin letters, it will be fine.

You are very kind. So now I have fulfilled your order by providing a link
on the right side of the page to get rid of the Singhala font.


 But a font that represents Latin letters using Sinhalese glyphs is
 definitely broken. It will not work within multilingual contexts
 except when using many font switches in 

Influence of Futhorc on Ewellic (was: Re: Romanized Singhala - Think about it again)

2012-07-07 Thread Doug Ewell

Naena Guru wrote:

I think the script you dreamed of as a boy is very nearly fuþorc. 
foþorc is the (Old) English alphabet.



From the Omniglot page on Ewellic:
The shape of Ewellic letters was inspired by the Runic and Cirth 
scripts, but shows greater (though still imperfect) regularity of form.


--
Doug Ewell | Thornton, Colorado, USA
http://www.ewellic.org | @DougEwell ­




Re: Romanized Singhala - Think about it again

2012-07-05 Thread Naena Guru
 of the entire Latin
repertoire. You don't have to tell that to me. I have traveled quite a bit
in the IT world. Don't be surprised if it is more than what you've seen.
(Did you forget that earlier you accused me of using characters outside
ISO-8859-1 while claiming I am within it? That is because you saw IAST and
PTS displayed. They use those wonderful letters symbols and diacritics you
are trying to tout. Is there a problem with Asians using ISO-8859-1 code
space even for transliteration?


 The bonus will be that you can still write the Sinhalese
 language with a romanisation like yours,

Bonus?

but there's no need to
 reinvent the Sinhalese script

Singhala script existed many, many years since before the English and
French adopted Latin. What I did was saving it from the massacre going on
with Unicode Sinhala.

itself that your encoding is not even
 capable of completely support in all its aspects (your system only
 supports a reduces subset of the script).

What is the basis for this nonsense?. (Little birds whispering in the
background. Watch out. They are laughing).
My solution supports the entire script, Singhala, Pali and Sanskrit plus
two rare allophones of Sanskrit as well. Tell me what it lacks and I will
add it, haha! One time you said I assigned Unicode Sinhala characters to
the 'hack' font. What I do is assigning Latin characters to Singhala
phonemes. That is called transliteration. There are no 'contextual
versions' of the same Singhala letters like you said earlier.

Ask your friends what they have more than mine in the Singhala script. Ask
them why they included only two ligatures when there are 15 such. Ask them
how many Singhala letters there are.


 Even the legacy ISCII system (used in India) is better, because it is
 supported by a published open standard, for which there's a clear and
 stable conversion from/to Unicode.

My solution is supported by two standards: ISO-8859-1 and Open Type.
ISO-8859-1 is Basic Latin plus Latin-1 Extension part of Unicode standard.

Bottom line is this: If Latin-1 is good enough for English and French, it
is good enough for Singhala too. And if Open Type is good for English and
French, it is good for Singhala too.


 2012/7/5 Naena Guru naenag...@gmail.com:
  Philippe,
 
  My last message was partial. It went out by mistake. I'll try again. It
  takes very long for this old man.
 
 
  -- Forwarded message --
  From: Naena Guru naenag...@gmail.com
  Date: Wed, Jul 4, 2012 at 10:32 PM
  Subject: Re: Romanized Singhala - Think about it again
  To: verd...@wanadoo.fr
 
 
  Hi, Philippe. Thanks for keeping engaged in the discussion. Too little
 time
  spent could lead to misunderstanding.
 
 
  On Wed, Jul 4, 2012 at 3:42 PM, Philippe Verdy verd...@wanadoo.fr
 wrote:
 
  2012/7/4 Naena Guru naenag...@gmail.com:
   Philippe Verdy, obviously has spent a lot of time
 
  Not a lot of time... Sorry.
 
   researching the web site
   and even went as far as to check the faults of the web service
 provider,
   Godaddy.com.
 
  I did not even note that your hosting provider was that company. I
  just looked at the HTTP headers to look at the MIME type and charset
  declarations. Nothing else.
 
  I know that the browser tells it. It is not a big deal, WOFF is the
  compressed TTF, but TTF gets delivered. If and when GoDaddy fixes their
  problem, the pages get delivered faster. Or I can make that fix in a
  .htaccess file. No time!
 
 
   He called my font a hack font without any proof of it.
 
  It is really a hack. Your font assigns Sinhalese characters to Latin
  letters (or some punctuations) of ISO 8859-1.
 
  My font does not have anything to do with Singhalese characters if you
 mean
  Unicode characters. You are very confusing.
  A Character in this context is a datatype. In the 80s it was one byte in
  size and used to signal not to use in arithmetic. (We still did it to
  convert between Capitals and Simple forms.) In the Unicode character
  database, a character is a numerical position. A Unicode Sinhala
 character
  is defined in Hex [0D80 - 0DFF]. Unicode Sinhala characters represent an
  incomplete hotchpotch of ideas of letters, ligatures and signs. I have
 none
  of that in the font.
 
  I say and know that Unicode Sinhala is a failure. It inhibits use of
  Singhala on the computer and the network. I do not concern me with
 fixing it
  because it cannot be fixed. Only thing I did in relation to it is to
 write
  an elaborate set of routines to *translate* (not map) between constructs
 of
  Unicode Sinhala characters and romanized Singhala. That is not in the
 font.
  The font has lookup tables.
 
  It also assigns
  contextual variants of the same abstract Sinhalese letters, to ISO
  8859-1 codes,
 
  What contexts cause what variants? Looks like you are saying Singhala
  letters cha
 
  plus glyphs for some ligatures of multiple Sinhalese
  letters to ISO 8859-1 codes, plus it reorders these glyphs so that
  they no longer match

Re: Romanized Singhala - Think about it again

2012-07-05 Thread Philippe Verdy
2012/7/5 Naena Guru naenag...@gmail.com:
 The above are not any of my complaints.
 Per Capita Income in Sri Lanka $2400. They are content with cell phones. The
 practical place for computers is the Internet Cafe. Linux is what the vast
 majority  needs.

And Linux fully supports the standard Unicode encoding of the
Sinhalese script. May be there are still some missing letters to
encode, but then it's not too late to encode them. Propose them,
formalize them. Help disambiguating the various cases.

But ask your self why Sinhalese Wikipedia works and us usable in Linux
too... There already exists free OpenType fonts for Sinhalese that are
using the stanadrd Uncode/ISO/IEC 10646 assgnments.

Di you say you can't read the Sinhalese Wikipedia on your Linux machines ?



Re: Romanized Singhala - Think about it again

2012-07-05 Thread Jean-François Colson

Le 05/07/12 10:02, Naena Guru a écrit :



On Wed, Jul 4, 2012 at 11:33 PM, Philippe Verdy verd...@wanadoo.fr 
mailto:verd...@wanadoo.fr wrote:


Anyway, consider the solutions already proposed in Sinhalese
Wikipedia. There are verious solutions proposed, including several
input methods supported there. But the purpose of these solutions is
always to generate  Sinhalese texts perfectly encoded with Unicode and
nothing else.

Thank you for the kind suggestion. The problem is Unicode Sinhala does 
not perfectly support Singhala!

What's wrong? Are there missing letters?


*The solution is for Sinhala not for Unicode!*

Or rather for Sinhala by Unicode.

**I am not saying Unicode has a bad intention but an ill-conceived 
product.

What precisely is ill-conceived?

The fault is with Lankan technocrats that took the proposal as it was 
given and ever since prevented public participation. My solution is 
'perfectly encoded with Unicode'.

No. It's an 8-bit character set independant from Unicode.



Yes thee may remain some issues with older OSes that have limited
support for standard OpenType layout tables. But there's now no
problem at all since Windows XP SP2. Windows 7 has the full support,
and for those users that have still not upgraded from Windows XP,
Windows 8 will be ready in next August with an upgrade cost of about
US$ 40 in US (valid offer currently advertized for all users upgrading
from XP or later), and certainly even less for users in India and Sri
Lanka.

The above are not any of my complaints.
Per Capita Income in Sri Lanka $2400. They are content with cell 
phones. The practical place for computers is the Internet Cafe. Linux 
is what the vast majority  needs.



And standard Unicode fonts with free licences are already available
for all systems (not just Linux for which they were initially
developed); 


Yes, only 4 rickety ones. Who is going to buy them anyway?

Why would you buy them if they're free?

 Still Iskoola Pota made by Microsoft by copying a printed font is the 
best. You check the Plain Text by mixing Singhala and Latin in the 
Arial Unicode MS font to see how pretty Plain text looks. They spent 
$2 or 20 million for someone to come and teach them how to make fonts. 
(Search ICTA.lk). Staying friendly with them is profitable. World bank 
backs you up too.
Sometime in 1990s when I was in Lanka, I tried to select a PC for my 
printer brother. We wanted to buy Adobe, Quark Express etc. The store 
keeper gave a list and asked us to select the programs. Knowing that 
they are expensive, I asked him first to tell me how much they cost. 
He said that he will install anything we wanted for free! The same 
trip coming back, in Zurich, the guys tried to give me a illicit copy 
of  Windows OS in appreciation for installing German and Italian (or 
French?) code pages on their computers.


there even exists solutions for older versions of iPhone
4. OR on Android smartphones and tablets.

Mine works in them with no special solution. It works anywhere that 
supports Open Type -- no platform discrimination

Is there any platform discrimination with Unicode Sinhala?



No one wants to get back to the situation that existed in the 1980's
when there was a proliferation of non-interoperable 8 bit encodings
for each specific platform.

I agree. Today, 14 languages, including English, French, German and 
Italian all share the same character space called ISO-8859-1.
In fact, ISO-8859-1 is not well suited for French (my native language): 
it lacks a few letters which were added to ISO-8859-15. However, I 
always use Unicode today, even for French-only texts.



Romanized Singhala uses the same. So, what's the fuss about? The font?
The problem is that only your translitteration scheme, with Latin 
letters, is supported by ISO-8859-1, not the Sinhalese letters themselves.



Consider that as the oft suggested IME. Haha!


And your solution also does not work in multilingual contexts;

If mine does not work in some multilingual context, none of the 14 
languages I mentioned above including English and French don't either.

They do because they use Latin letters, not Sinhalese letters.



it does
not work with many protocols or i18n libraries for applications. 


i18n is for multi-byte characters. Mine are single-byte characters.

OK. Do it as you want, but it won't be Unicode compliant.


As you see, the safest place is SBCS.

I don't see. Why is it safer?



Or it
requires specific constraints on web pages requiring complex styling
everywhere to switch fonts. 

Did you see http://www.lovatasinhala.com 
http://www.lovatasinhala.com/? May be you are confusing Unicode 
Sinhala and romanized Singhala. Unicode Sinhala has a myriad such 
problems.

Which problems?


That is why it should be abandoned!
Why wouldn't you try to solve the problems, whatever they could be, 
instead of proposing an entirely 

Re: Romanized Singhala - Think about it again

2012-07-05 Thread John D Burger
Naena Guru wrote:

 I know you do not care about a language of a 15 milllion people, but it 
 matters to them.

These kinds of straw man arguments are rude and counter-productive.  Such a 
characterization is highly unlikely to be true for anyone on this list, and 
you've just ensured that few of them will pay any more attention to you.

- John Burger
  MITRE



Re: Romanized Singhala - Think about it again

2012-07-05 Thread Peter Zilahy Ingerman, PhD

Seems to me that Naena Guru is demonstrating the truth of two adages:

a) A fanatic is a person who redoubles his efforts when he loses sight 
of his goal; and


b) Every movement starts with a fanatic, but for the movement to 
succeed, the fanatic must be removed from the movement.


Peter Ingerman

On 2012-07-05 09:46, John D Burger wrote:

Naena Guru wrote:


I know you do not care about a language of a 15 milllion people, but it matters 
to them.

These kinds of straw man arguments are rude and counter-productive.  Such a 
characterization is highly unlikely to be true for anyone on this list, and 
you've just ensured that few of them will pay any more attention to you.

- John Burger
   MITRE




-
No virus found in this message.
Checked by AVG - www.avg.com
Version: 2012.0.2193 / Virus Database: 2437/5112 - Release Date: 07/05/12







-
No virus found in this message.
Checked by AVG - www.avg.com
Version: 2012.0.2193 / Virus Database: 2437/5112 - Release Date: 07/05/12




Romanized Singhala - Think about it again

2012-07-04 Thread Naena Guru
Pardon me for including a CC list. These are people who showed for and
against opinion.

On this 4th of July, let me quote James Madison:
A zeal for different opinions concerning religion, concerning government,
and many other points, as well of speculation as of practice; an attachment
to different leaders ambitiously contending for pre-eminence and power; or
to persons of other descriptions whose fortunes have been interesting to
the human passions, have, in turn, divided mankind into parties, inflamed
them with mutual animosity, and rendered them much more disposed to vex and
oppress each other than to co-operate for their common good.

I gave much thought to why many here at the Unicode mailing list reacted
badly to my saying that Unicode solution for Singhala is bad. Earlier I
said the Plain Text idea is bad too. The responses came as attacks on *my*
solution than in defense of Unicode Singhala. The purpose of designating
naenaguru@‌‌gmail.com as a spammer is to prevent criticism. It is shameful
that a standards organization belonging to corporations of repute resorts
to censorship like bureaucrats and academics of little Lanka.
*
I ask you to reconsider:*
As a way of explaining Romanized Singhala, I made some improvements to
www.LovataSinhala.com http://www.lovatasinhala.com/. Mainly, it now has
near the top of each page a link that says, ’switch the script’. That
switches the base font of the body tag of the page between the Latin and
Singhala typefaces. *Please read the smaller page that pops up.*

I also verified that I hadn’t left any Unicode characters outside
ISO-8859-1 in the source code -- HTML, JavaScript or CSS. The purpose of
declaring the character set as iso-8859-1 than utf-8 is to avoid doubling
and trebling the size of the page by utf-8. I think, if you have characters
outside iso-8859-1 and declare the page as such, you get
Character-not-found for those locations. (I may be wrong).

Philippe Verdy, obviously has spent a lot of time researching the web site
and even went as far as to check the faults of the web service provider,
Godaddy.com. He called my font a hack font without any proof of it. It has
only characters relevant to romanized Singhala within the SBCS. Most of the
work was in the PUA and Look-up Tables. I am reminded of Inspector Clouseau
that has many gadgets and in the end finds himself as the culprit.

I will still read and try those other things Philippe suggests, when I get
time. What is important for me is to improve on orthography rules and add
more Indic languages -- Devanagari and Tamil coming up.

As for those who do not want to think rationally and think Unicode is a
religion, I can only point to my dilemma:
http://lovatasinhala.com/assayaa.htm

Have a Happy Fourth of July!


Re: Romanized Singhala - Think about it again

2012-07-04 Thread Doug Ewell

[removing cc list]

Naena Guru wrote:


On this 4th of July, let me quote James Madison:


[quote from Madison irrelevant to character encoding principles snipped]

I gave much thought to why many here at the Unicode mailing list 
reacted badly to my saying that Unicode solution for Singhala is bad.


Unicode encodes Latin characters in their own block, and Sinhala 
characters in their own block. Many of us disagree with a solution to 
encode Sinhala characters as though they were merely Latin characters 
with different shapes, and agree with the Unicode solution to encode 
them as separate characters. This is a technical matter.



Earlier I said the Plain Text idea is bad too.


And many of us disagree with that rather vehemently as well, for many 
reasons.


The responses came as attacks on *my* solution than in defense of 
Unicode Singhala.


It's not personal unless you wish to make it personal. You came onto the 
Unicode mailing list, a place unsurprisingly filled with people who 
believe the Unicode model is a superior if not perfect character 
encoding model, and claimed that encoding Sinhala as if it were Latin 
(and requiring a special font to see the Sinhala glyphs) is a better 
model. Are you really surprised that some people here disagree with you? 
If you write to a Linux mailing list that Linux is terrible and 
Microsoft Windows is wonderful, you will see pushback there too.


Here is a defense of Unicode Sinhala: it allows you, me, or anyone else 
to create, read, search, and sort plain text in Sinhala, optionally with 
any other script or combination of scripts in the same text, using any 
of a fairly wide variety of fonts, rendering engines, and applications.


The purpose of designating naenaguru@‌‌gmail.com as a spammer is to 
prevent criticism.


The list administrator, Sarasvati, can speak to this issue. Every 
mailing list, every single one, has rules concerning the conduct of 
posters. I note that your post made it to the list, though, so I'm not 
sure what you're on about.


It is shameful that a standards organization belonging to corporations 
of repute resorts to censorship like bureaucrats and academics of 
little Lanka.


Do not attempt to represent this as a David and Goliath battle between 
the big bad Unicode Consortium and poor little Sri Lanka or its 
citizens. This is a technical matter.



I ask you to reconsider:
As a way of explaining Romanized Singhala, I made some improvements to 
www.LovataSinhala.com. Mainly, it now has near the top of each page a 
link that says, ’switch the script’. That switches the base font of 
the body tag of the page between the Latin and Singhala typefaces. 
Please read the smaller page that pops up.


The fundamental model is still one of representing Sinhala text using 
Latin characters, and relying on a font switch. It is still completely 
antithetical to the Unicode model.


I also verified that I hadn’t left any Unicode characters outside 
ISO-8859-1 in the source code -- HTML, JavaScript or CSS. The purpose 
of declaring the character set as iso-8859-1 than utf-8 is to avoid 
doubling and trebling the size of the page by utf-8. I think, if you 
have characters outside iso-8859-1 and declare the page as such, you 
get Character-not-found for those locations. (I may be wrong).


You didn't read what Philippe wrote. Representing Sinhala characters in 
UTF-8 takes *fewer* bytes, typically less than half, compared to using 
numeric character references like #3523;#3538;#3458;#3524;#3517; 
#3517;#3538;#3520;#3539;#3512;#3495; #3465;#3524;#3517;.


Philippe Verdy, obviously has spent a lot of time researching the web 
site and even went as far as to check the faults of the web service 
provider, Godaddy.com. He called my font a hack font without any proof 
of it.


A font that places glyphs for one character in the code space defined 
for a fundamentally different character is generally referred to as a 
hack (or hacked) font. A Latin-only font that placed a glyph looking 
like 'B' in the space reserved for 'A' would also be a hacked font.


As for those who do not want to think rationally and think Unicode is 
a religion, I can only point to my dilemma:

http://lovatasinhala.com/assayaa.htm


You need to stop making this religion accusation. This is a technical 
matter.


This is the last attempt I will make to help show YOU where the water 
is.


--
Doug Ewell | Thornton, Colorado, USA
http://www.ewellic.org | @DougEwell ­




Charset declaration in HTML (was: Romanized Singhala - Think about it again)

2012-07-04 Thread Otto Stolz

Hello Naena Guru,

on 2012-07-04, you wrote:

The purpose of
declaring the character set as iso-8859-1 than utf-8 is to avoid doubling
and trebling the size of the page by utf-8. I think, if you have characters
outside iso-8859-1 and declare the page as such, you get
Character-not-found for those locations. (I may be wrong).


You are wrong, indeed.

If you declare your page as ISO-8859-1, every octet
(aka byte) in your page will be understood as a Latin-1
character; hence you cannot have any other character
in your page. So, your notion of “characters outside
iso-8859-1” is completely meaningless.

If you declare your page as UTF-8, you can have
any Unicode character (even PUA characters) in
your page.

Regardless of the charset declaration of your page,
you can include both Numeric Character References
and Character Entity References in your HTML source,
cf., e.g., http://www.w3.org/TR/html401/charset.html#h-5.3.
These may refer to any Unicode character, whatsoever.
However, they will take considerably more storage space
(and transmission bandwidth) than the UTF-8 encoded
characters would take.

Good luck,
  Otto Stolz





Re: Romanized Singhala - Think about it again

2012-07-04 Thread Philippe Verdy
2012/7/4 Naena Guru naenag...@gmail.com:
 Philippe Verdy, obviously has spent a lot of time

Not a lot of time... Sorry.

 researching the web site
 and even went as far as to check the faults of the web service provider,
 Godaddy.com.

I did not even note that your hosting provider was that company. I
just looked at the HTTP headers to look at the MIME type and charset
declarations. Nothing else.

 He called my font a hack font without any proof of it.

It is really a hack. Your font assigns Sinhalese characters to Latin
letters (or some punctuations) of ISO 8859-1. It also assigns
contextual variants of the same abstract Sinhalese letters, to ISO
8859-1 codes, plus glyphs for some ligatures of multiple Sinhalese
letters to ISO 8859-1 codes, plus it reorders these glyphs so that
they no longer match the Sinhalese logicial order.

Yes this font is a hack because it pretends to be ISO 8859-1 when it
is not. It is a specific distinct encoding which is neither ISO 859-1
and neither Unicode, but something that exists in NO existing
standard.

 It has
 only characters relevant to romanized Singhala within the SBCS. Most of the
 work was in the PUA and Look-up Tables. I am reminded of Inspector Clouseau
 that has many gadgets and in the end finds himself as the culprit.

And you have invented a Inspector Guru gadget for your private use on
your site, instead of developping a TRUE separate encoding that you
SHOULD NOT name ISO 8859-1. Try to do that, but be aware that the
ISO registry of 8-bit encodings is now frozen. You'll have to convince
the IANA registry to register your new encoding. For now it is
registered nowhere. This is a purely local creation for your site.

 I will still read and try those other things Philippe suggests, when I get
 time. What is important for me is to improve on orthography rules and add
 more Indic languages -- Devanagari and Tamil coming up.

 As for those who do not want to think rationally and think Unicode is a
 religion,

No. Unicode is a technical solution for a long problem :
interoperability of standards using open technologies. Given that you
do not want to even develop your own encoding as a registered open
standard compatible with a lot of applications (remember that all new
web standards MUST now support Unicode in at least one of its standard
UTF, you're just loosing time here)

 I can only point to my dilemma:
 http://lovatasinhala.com/assayaa.htm

 Have a Happy Fourth of July!

Next time don't cite me personnaly trying to conveince others that I
have supported or said something I did not write myself. You have
interpreted my words at your convenience, but I don't want to be
associated nominatively and publicly with your personnal
interpretations. Even if I also have my own opinions, I don't want to
cite anyone else's opinions without just quoting his own sentences
(provided that these sentences were public or that I was authorized by
him to quote his sentences in other contexts).

Stop this abuse of personalities. Thanks.



Re: Romanized Singhala - Think about it again

2012-07-04 Thread Naena Guru
Philippe, ask your friends why ordinary people Anglicize if Unicode Sinhala
is so great. See just one of many community forums: http://elakiri.com

I know you do not care about a language of a 15 milllion people, but it
matters to them.

On Wed, Jul 4, 2012 at 10:46 PM, Philippe Verdy verd...@wanadoo.fr wrote:

 You are alone to think that. Users of the Sinhalese edition of
 Wikipedia do not need your hack or even webfonts to use the website.
 It only uses standard Unicode, with very common web browsers. And it
 works as is.
 For users that are not preequiped with the necessary fonts and
 browsers, Wikipedia indicates this vey useful site:
 http://www.siyabas.lk/sinhala_how_to_install_in_english.html

I have two guys here in the US that asked me to help get rid of Unicode
Sinhala that I helped them install from that 'very useful site'. Copies of
this message goes to them. Actually, you do not need their special
installation if you have Windows 7. Windows XP needs update of Uniscribe,
and Vista too. Their installation programs are faulty and interferes with
your OS settings.



 This solves the problem at least for older version of Windows or old
 distributions of Linux (now all popular distributions support
 Sinhalese). No web fonts are even necessary (WOFT works only in
 Windows but not in older versions of Windows with old versions of IE).

You mean WEFT? Now TTF (OTF) are compressed into WOFF. I see that Microsoft
is finally supporting it.(At least my font downloads, or may be it picks up
the font in my computer? Now I am confused)


 Everything is covered : working with TrueType and OpenType, adding an
 IME if needed. And then navigating on standard Sinhalese websites
 encoded with Unicode.


Philippe, try making a web page with Unicode Sinhala.


 Note that for version of Windows with older versions than IE6 there is
 no support only because these older versions did not have the
 necessary minimum support for complex scripts. The alternative is to
 use another browser such as Firefox which uses its own independant
 renderer that does not depend on Windows Uniscribe support. But these
 users are now extremely rare. Almost everyone now uses at least XP for
 Windows (Windows 95/98 are definitely dead), or uses a Mac, or a
 smartphone, or another browser (such as Firefox, Chrome, Opera).

I agree.


 Nobody except you support your tricks and hacks. You come really too
 late truing to solve a problem that no longer exists as it has been
 solved since long for Sinhalese.

Mine is a comprehensive solution. It is a transliteration. Ask users that
compared the two. Find ordinary Singhalese. They use Unicode Sinhala to
read news web sites. The rest of the time they Anglicize or write in
English.

Everything is covered here too, buddy. Adobe apps since 2004, Apple since
2004, Mozilla since 2006, All other modern browsers since 2010. MS Office
2010. Abiword, gNumeric, Linux all the works. IE 8,9 partial. IE 10 full.
So?


 2012/7/5 Naena Guru naenag...@gmail.com:
  Hi, Philippe. Thanks for keeping engaged in the discussion. Too little
 time
  spent could lead to misunderstanding.
 
 
  On Wed, Jul 4, 2012 at 3:42 PM, Philippe Verdy verd...@wanadoo.fr
 wrote:
 
  2012/7/4 Naena Guru naenag...@gmail.com:
   Philippe Verdy, obviously has spent a lot of time
 
  Not a lot of time... Sorry.
 
   researching the web site
   and even went as far as to check the faults of the web service
 provider,
   Godaddy.com.
 
  I did not even note that your hosting provider was that company. I
  just looked at the HTTP headers to look at the MIME type and charset
  declarations. Nothing else.
 
  I know that the browser tells it. It is not a big deal, WOFF is the
  compressed TTF, but TTF gets delivered. If and when GoDaddy fixes their
  problem, the pages get delivered faster. Or I can make that fix in a
  .htaccess file. No time!
 
 
   He called my font a hack font without any proof of it.
 
  It is really a hack. Your font assigns Sinhalese characters to Latin
  letters (or some punctuations) of ISO 8859-1.
 
  My font does not have anything to do with Singhalese characters if you
 mean
  Unicode characters. You are very confusing.
  A Character in this context is a datatype. In the 80s it was one byte in
  size and used to signal not to use in arithmetic. (We still did it to
  convert between Capitals and Simple forms.) In the Unicode character
  database, a character is a numerical position. A Unicode Sinhala
 character
  is defined in Hex [0D80 - 0DFF]. Unicode Sinhala characters represent an
  incomplete hotchpotch of ideas of letters, ligatures and signs. I have
 none
  of that in the font.
 
  I say and know that Unicode Sinhala is a failure. It inhibits use of
  Singhala on the computer and the network. I do not concern me with
 fixing it
  because it cannot be fixed. Only thing I did in relation to it is to
 write
  an elaborate set of routines to *translate* (not map) between constructs
 of
  Unicode Sinhala