Re: abstract characters, semantics, meaningful transformations ... Was: Tibetan Paluta

2017-05-02 Thread Naena Guru via Unicode
Thank you, professor. You wrote exactly what one would expect from a 
professor. It is a wonderful display of your prowess in the subject.


Doctors and lawyers use Latin for concealment and self-preservation. 
Greenspan used Greenspanish. Unicode masters use Unicodish.


Indic is the name Unicode assigned to South Asian writing systems that 
are associated with Sanskrit vyaakaraNa. This is a result of what the 
good professor explains by "Unicode is not the realm of everyone; it's 
the realm of people with a certain amount of linguistic knowledge and 
computer knowledge". What is that 'certain amount' and which deity 
decides it? How do we unfortunate nincompoops decode it? Decode itself 
is beyond us, indeed.


South Asians, especially Indians who already seem to have too many gods 
to deal with, do not need, though they might be tempted to add an image 
of the exalted Unicode god behind a colorful curtain to sing praise to 
with an alms box marked M$ besides to get favors each time the high 
priest scrubs off some of its 'hairy esoteric mess' while 
surreptitiously (or, ignorantly?) adding more.


Brahmins were able to make any declaration because they were privileged. 
Similarly, Unicode experts can make declarations like, 'very few forms 
of writing are direct transcriptions of speech' and hide behind the 'in 
case' adjective 'direct' to avoid giving actual data. Of course, they 
can boldly count Sinhala as one that is not a direct transcription of 
speech. Speech getting transcribed into writing itself is a Unicodish.


Hark! The professor declares. So, boys and girls, if you want to pass 
the test memorize this, even if it is obviously false:
Printing made no difference to the fact that English has a dozen vowels 
with five letters to write them. The thorn has little impact on the 
ambiguity of English writing. The problem with printing is that it 
fossilizes the written language, and our spellings have stayed the same 
while the pronunciations have changed. And the dissociation of sound and 
writing sometimes helps English; even when two English speakers from 
different parts of the world would have trouble understanding each 
other, writing is usually not so impaired.


It is printing with the dictionary industry that fossilized writing and 
as a result, forced speech to comply. The 'certain' level of knowledge 
above is now revealed. Language, dialect, creole, migration, intermixing 
of different peoples, accent...; where do these stand? Find ye by the 
foregoing what the fossil 'ye' actually was and what caused it to get 
fossilized in this form.



On 5/2/2017 5:31 AM, David Starner wrote:
On Mon, May 1, 2017 at 7:26 AM Naena Guru via Unicode 
<unicode@unicode.org <mailto:unicode@unicode.org>> wrote:


This whole attempt to make digitizing Indic script some esoteric,
'abstract', 'semantic representation' and so on seems to me is an
attempt to make Unicode the realm of the some super humans.

Unicode is like writing. At its core, it is a hairy esoteric mess; mix 
these certain chemicals the right ways, and prepare a writing 
implement and writing surface in the right (non-trivial) ways, and 
then manipulate that implement carefully to make certain marks that 
have unclear delimitations between correct and incorrect. But in the 
end, as much of that is removed from the problem of the user as 
possible; in the case of modern word-processing system, it's a matter 
of hitting the keys and then hitting print, in complete ignorance of 
all the silicon and printing magic going on between.


Unicode is not the realm of everyone; it's the realm of people with a 
certain amount of linguistic knowledge and computer knowledge. There's 
only a problem if those people can't make it usable for the everyday 
programmer and therethrough to the average person.


The purpose of writing is to represent speech.

Meh. The purpose of writing is to represent language, which may be 
unrelated to speech (like in the case of SignWriting and mathematics) 
or somewhat related to speech--very few forms of writing are direct 
transcriptions of speech. Even the closest tend to exchange a lot of 
intonation details for punctuation that reveals different information.


English writing was massacred when printing was brought in from
Europe.

No, it wasn't. Printing made no difference to the fact that English 
has a dozen vowels with five letters to write them. The thorn has 
little impact on the ambiguity of English writing. The problem with 
printing is that it fossilizes the written language, and our spellings 
have stayed the same while the pronunciations have changed. And the 
dissociation of sound and writing sometimes helps English; even when 
two English speakers from different parts of the world would have 
trouble understanding each other, writing is usually not so impaired.




Re: abstract characters, semantics, meaningful transformations ... Was: Tibetan Paluta

2017-05-01 Thread Naena Guru via Unicode

A little humor is very good.

sarasvaþi was a sweet girl, I am sure, so much so that when she died, I 
think, those who were imagining about her beyond practical, made her 
rise up, up and fly away. Now you watch what happens to Elizabeth when 
she dies. They narrowly failed making one such with Hillary Clinton as 
she is suspected of having Parkinson's which condition her daughter says 
has an anecdotal remedy with MaryJane. Hmmm... Who went to her 
daughter's house instead of to the doctor when they suddenly fell?


As for Thoth, he is okay. Don't worry. Egyptian man => demi-god => god 
has not much of a consequence in the West dominated culture of this day.



On 5/1/2017 8:55 PM, Richard Wordingham via Unicode wrote:

On Mon, 1 May 2017 19:49:27 +0530
Naena Guru via Unicode <unicode@unicode.org> wrote:


The purpose of writing is to represent speech. It is not some secret
that demi-gods created 

Sarasvati and Thoth would be offended at being called mere demi-gods.


sound => letter that is the basis for writing.

"=>" is not a particularly phonetic notation.  It took quite a while
for letters to become the primary part of writing anywhere, and they
are not a universal phenomenon.

Richard.
Okay, Richard. Your probably have knowledge of how writing evolved in 
the whole world. Tell us how it was in South Asia. Was it like I said, 
sound => letter? I assume only to know about English and Indic in this 
respect.




abstract characters, semantics, meaningful transformations ... Was: Tibetan Paluta

2017-05-01 Thread Naena Guru via Unicode
This whole attempt to make digitizing Indic script some esoteric, 
'abstract', 'semantic representation' and so on seems to me is an 
attempt to make Unicode the realm of the some super humans.


The purpose of writing is to represent speech. It is not some secret 
that demi-gods created that we are trying to explain with 'modern' 
linguistic gymnastics. sound => letter that is the basis for writing. 
English writing was massacred when printing was brought in from Europe. 
A similar thing is happening to Indic by all this mumbo-jumbo.


I call out to NATIVE users of Indic to explain what apparently Europeans 
or Americans are discussing here.



On 5/1/2017 10:47 AM, Philippe Verdy wrote:



2017-04-29 21:21 GMT+02:00 Naena Guru via Unicode <unicode@unicode.org 
<mailto:unicode@unicode.org>>:


Just about the name paluta:
In Sanskrit, the length of vowels are measured in maaþra (a
cognate of the word 'meter'). It is the spoken length of a short
vowel. In Latin it is termed mora. Usually, you have only single
and double length vowels. A paluþa length is like when you call
out somebody from a distance. Pluta is a careless use of spelling.
Virama and Halanta are two other terms loosely used.

Anyway, Unicode is only about DISPLAYING a script: There's a shape
here; Let's find how to get it by assembling other shapes or by
creating a code point for it. What is short, long or longer in
speech is no concern for Unicode.


Wrong. Unicode is absolutely not about how to "display" any script 
(except symbols and notational symbols). Unicode does not encode 
glyphs. Unicode encodes "abstract characters" according to their 
semantics, in order to assign them properties allowing meaningful 
transformations of text and in order to allow perfoirming searches 
(with collation algorithms). What is important is their properties 
(something that ISO 10646 does not care when it started the UCS in a 
separate project, ignoring how it would be used, focusing too much on 
apparent glytphs (and introducing lot of "compatiblity characters" 
that would not have been encoded otherwise, and creating some havoc in 
logical processing.


Anyway Unciode makes some exceptions to the logical model only for 
roundtrip comptaibility with other standards that used another 
encoding model widely used, notably in Thai: these are the exception 
where there are "prepended" letters. There was some havoc also for 
some scripts in India because of roundtrip compatiblity with an Indian 
standard (criticized by many users of Tamil and some other Southern 
Indic scripts that don't follow directly the paradigm created for 
getting some limited transliteration with Devanagari: that initial 
desire was abandoned but the legacy Indic scripts in India were 
imported as is to Unicode)




Re: Tibetan Paluta

2017-04-29 Thread Naena Guru via Unicode

Just about the name paluta:
In Sanskrit, the length of vowels are measured in maaþra (a cognate of 
the word 'meter'). It is the spoken length of a short vowel. In Latin it 
is termed mora. Usually, you have only single and double length vowels. 
A paluþa length is like when you call out somebody from a distance. 
Pluta is a careless use of spelling. Virama and Halanta are two other 
terms loosely used.


Anyway, Unicode is only about DISPLAYING a script: There's a shape here; 
Let's find how to get it by assembling other shapes or by creating a 
code point for it. What is short, long or longer in speech is no concern 
for Unicode.



On 4/27/2017 1:57 PM, Srinidhi A via Unicode wrote:
The annotation of 0F85 ྅ TIBETAN MARK PALUTA says it is used for 
avagraha. However it seems this character denotes pluta instead of 
avagraha. Pluta is used for indicating elongation of vowel.
Similar character with identical glyph is encoded in Soyombo( 11A9D ) 
with name as pluta. These characters are likely derive from digit ३ as 
३ is used in Devanagari for indicating pluta.


Figure 2 of L2/16-016 shows the usage of  TIBETAN MARK PALUTA for Pluta.
What is the correct spelling in Tibetan language Paluta or Pluta?
Can Tibetan scholars clarify the usage of above character?
If 0F85 is used for Pluta ,are there any distinct characters denoting 
avagraha in Tibetan script.


Srinidhi A







Re: Go romanize! Re: Counting Devanagari Aksharas

2017-04-25 Thread Naena Guru via Unicode

Quote from below:

The word indeed means 'danger' (Pali/Sanskrit _antarāya_).  The
pronunciation is /ʔontʰalaːi/; the Tai languages that use(d) the Tai
Tham script no longer have /r/.  The older sequence /tr/ normally
became /tʰ/ (except in Lao), but the spelling has not been updated - at
least, not amongst the more literate.  The script has a special symbol
for the short vowel /o/, which it shares with the Lao script.  This
symbol is used in writing that word.  Two ways I have seen it spelt,
each with two orthographic syllables, are ᩋᩫ᩠ᨶᨲᩕᩣ᩠ᨿ on-trAy (the second
syllable has two stacks) and ᩋᩫᨶ᩠ᨲᩕᩣ᩠ᨿ o-ntrAy.  I have also seen a
form closer to Pali, namely _antarAy_, written ᩋᨶ᩠ᨲᩁᩂ᩠ᨿ a-nta-rAy.
However, I have seen nothing that shows that I won't encounter
ᩋᩢᨶ᩠ᨲᩁᩣ᩠ᨿ a-nta-rAy with the first vowel written explicitly, or even
ᩋᩢ᩠ᨶᨲᩁᩣ᩠ᨿ an-ta-rAy. How does your scheme distinguish such alternatives?

Response:
Perhaps this word is derived from Sanskrit 'anþaraða'
(Search: antarada at 
http://www.sanskrit-lexicon.uni-koeln.de/cgi-bin/tamil/recherche)

Sinhala:anþaraaðaayakayi, anþaraava, anþaraavayi, anþraava, anþraavayi Use this 
font to read the above Sinhala words: http://smartfonts.net/ttf/aruna.ttf


-=- svasþi siððham! -=-


On 4/25/2017 2:07 AM, Richard Wordingham via Unicode wrote:


On Mon, 24 Apr 2017 20:53:12 +0530
Naena Guru via Unicode<unicode@unicode.org>  wrote:


Quote by Richard:
Unless this implies a spelling reform for many languages, I'd like to
see how this works for the Tai Tham script.  I'm not happy with the
Romanisation I use to work round hostile rendering engines.  (My
scheme is only documented in variable hack_ss02 in the last script
blocks ofhttp://wrdingam.co.uk/lanna/denderer_test.htm.)  For
example, there are several different ways of writing what one might
naively record as "ontarAy".

MY RESPONSE:
Richard, I stuck to the two specifications (Unicode and Font) and
Sanskrit grammar. The akSara has two aspects, its sound (zabða,
phoneme) and its shape. (letter, ruupa). Reduce the writing system to
its consonants, vowels etc. (zabða) and assign SBCS letters/codes to
them (ruupa). SBCS provides the best technical facilities for any
language. (This is why now more than 130 languages romanize despite
Unicode). Use English letters for similar sounds in the native
speech. Now, treat all combinations as ligatures. For example, 'po'
sound in Indic has the p consonant with a sign ahead plus a sign
after.

In many Indic scripts, yes.  In Devanagari, the vowel sign is normally
a singly element classified as following the consonant.  In Thai, the
vowel sign precedes the consonant.  Tai Tham uses both a two-part sign
and a preceding sign.  The preceding sign is for Tai words and the
two-part sign for Pali words, but loanwords from Pali into the Tai
languages may retain the two part sign.


For the font, there is no difference between the way it makes
the combination 'ä', which has a sign above and the Indic having two
on either side.

For OpenType, there is.  The first can be made by providing a
simple table of where the diaeresis goes relative to the base
characters, in this case the diaeresis.  The second is painfully
complicated, for the 'p' may have other marks attached to it, so doing
it be relative positioning is painfully complicated and error-prone.
This job is given to the rendering engine, which may introduce its own
problems.

AAT and Graphite offer the font maker the ability to move the 'sign
ahead' from after the 'p' to before it.


Recall that long ago, Unicode stopped defining fixed
ligatures and asked the font makers to define them in the PUA.

While the first is true enough, I believe the second is false.  Not
every glyph has to be mapped to by a single character.  I don't do that
for contextual forms or ligatures in my font.


Spelling and speech:
There is indeed a confusion about writing and reading in Hindi, as I
have observed. Like in English and Tamil, Hindi tends to end words
with a consonant. So, there is this habit among the Hindi speakers to
drop the ending vowel, mostly 'a' from words that actually end with
it. For example, the famous name Jayantha (miserable mine too, haha!
= jayanþa as Romanized), is pronounced Jayanth by Hindi speakers. It
is a Sanskrit word. Sanskrit and languages like Sinhhala have vowel
ending and are traditionally spoken as such.

This loss is also to be found in Further India.  Thai, Lao and Khmer
now require that such a word-final vowel be written explicitly if it is
still pronounced.


Looking at the word you gave, ontarAy, it looks to me like an
Anglicized form. If I am to make a guess, its ending is like in
ontarAyi. Is it said something like, own-the-raa-yi? (danger?) If I
am right, this is a good example of decline if a writing system owing
to bad, uncaring application of technology. We are in the Digital
Age, and we need not compromise any more. In fact, we can fix errors
and decadence introduced by past technol

Go romanize! Re: Counting Devanagari Aksharas

2017-04-24 Thread Naena Guru via Unicode

Quote by Richard:
Unless this implies a spelling reform for many languages, I'd like to
see how this works for the Tai Tham script.  I'm not happy with the
Romanisation I use to work round hostile rendering engines.  (My
scheme is only documented in variable hack_ss02 in the last script
blocks of http://wrdingam.co.uk/lanna/denderer_test.htm.)  For example,
there are several different ways of writing what one might naively
record as "ontarAy".

MY RESPONSE:
Richard, I stuck to the two specifications (Unicode and Font) and Sanskrit 
grammar. The akSara has two aspects, its sound (zabða, phoneme) and its shape. 
(letter, ruupa). Reduce the writing system to its consonants, vowels etc. 
(zabða) and assign SBCS letters/codes to them (ruupa). SBCS provides the best 
technical facilities for any language. (This is why now more than 130 languages 
romanize despite Unicode). Use English letters for similar sounds in the native 
speech. Now, treat all combinations as ligatures. For example, 'po' sound in 
Indic has the p consonant with a sign ahead plus a sign after. For the font, 
there is no difference between the way it makes the combination 'ä', which has 
a sign above and the Indic having two on either side. Recall that long ago, 
Unicode stopped defining fixed ligatures and asked the font makers to define 
them in the PUA.

Spelling and speech:
There is indeed a confusion about writing and reading in Hindi, as I have 
observed. Like in English and Tamil, Hindi tends to end words with a consonant. 
So, there is this habit among the Hindi speakers to drop the ending vowel, 
mostly 'a' from words that actually end with it. For example, the famous name 
Jayantha (miserable mine too, haha! = jayanþa as Romanized), is pronounced 
Jayanth by Hindi speakers. It is a Sanskrit word. Sanskrit and languages like 
Sinhhala have vowel ending and are traditionally spoken as such.

Dictionary is a commercial invention. When Caxton brought lead types to 
England, French-speaking Latin-flaunting elites did not care about the poor 
natives. Earlier, invading Romans forced them to drop Fuþark and adopt the 
22-letter Latin alphabet. So, they improvised. Struck a line across d and made 
ð, Eth; added a sign to 'a' and made æ (Asc) and continued using Thorn (þ) by 
rounding the loop. Lead type printing hit English for the second time, ruining 
it as the spell standardizing began. Dictionaries sold. THE POWERFUL CAN RUIN 
PEOPLE'S PROPERTY BECAUSE THEY CAN IN ORDER TO MAKE MONEY. Unicode enthusiasts, 
take heed!

Looking at the word you gave, ontarAy, it looks to me like an Anglicized form. 
If I am to make a guess, its ending is like in ontarAyi. Is it said something 
like, own-the-raa-yi? (danger?) If I am right, this is a good example of 
decline if a writing system owing to bad, uncaring application of technology. 
We are in the Digital Age, and we need not compromise any more. In fact, we can 
fix errors and decadence introduced by past technologies.


RICHARD:
That sounds like a letter-assembly system.

MY RESPONSE:
Nothing assembled there, my friend.



On 4/24/2017 12:38 PM, Richard Wordingham via Unicode wrote:

On Mon, 24 Apr 2017 00:36:26 +0530
Naena Guru via Unicode <unicode@unicode.org> wrote:


The Unicode approach to Sanskrit and all Indic is flawed. Indic
should not be letter-assembly systems.

Sanskrit vyaakaraNa (grammar) explains the phonemes as the atoms of
the speech. Each writing system then assigns a shape to the
phonetically precise phoneme.

The most technically and grammatically proper solution for Indic is
first to ROMANIZE the group of writing systems at the level of
phonemes. That is, assign romanized shapes to vowels, consonants,
prenasals, post-vowel phonemes (anusvara and visarjaniiya with its
allophones) etc. This approach is similar to how European languages
picked up Latin, improvised the script and even uses Simples and
Capitals repertoire. Romanizing immediately makes typing easier and
eliminates sometimes embarrassing ambiguity in Anglicizing -- you
type phonetically on key layouts close to QWERTY. (Only four
positions are different in Romanized Sinhala layout).

If we drop the capitalizing rules and utilize caps to indicate the
'other' forms of a common letter, we get an intuitively typed system
for each language, and readable too. When this is done carefully,
comparing phoneme sets of the languages, we can reach a common set of
Latin-derived SINGLE-BYTE letters completely covering all phonemes of
all Indic.

Unless this implies a spelling reform for many languages, I'd like to
see how this works for the Tai Tham script.  I'm not happy with the
Romanisation I use to work round hostile rendering engines.  (My
scheme is only documented in variable hack_ss02 in the last script
blocks of http://wrdingam.co.uk/lanna/denderer_test.htm.)  For example,
there are several different ways of writing what one might naively
record as "ontarAy".


Next, each native script

Re: Counting Devanagari Aksharas

2017-04-23 Thread Naena Guru via Unicode
The Unicode approach to Sanskrit and all Indic is flawed. Indic should 
not be letter-assembly systems.


Sanskrit vyaakaraNa (grammar) explains the phonemes as the atoms of the 
speech. Each writing system then assigns a shape to the phonetically 
precise phoneme.


The most technically and grammatically proper solution for Indic is 
first to ROMANIZE the group of writing systems at the level of phonemes. 
That is, assign romanized shapes to vowels, consonants, prenasals, 
post-vowel phonemes (anusvara and visarjaniiya with its allophones) etc. 
This approach is similar to how European languages picked up Latin, 
improvised the script and even uses Simples and Capitals repertoire. 
Romanizing immediately makes typing easier and eliminates sometimes 
embarrassing ambiguity in Anglicizing -- you type phonetically on key 
layouts close to QWERTY. (Only four positions are different in Romanized 
Sinhala layout).


If we drop the capitalizing rules and utilize caps to indicate the 
'other' forms of a common letter, we get an intuitively typed system for 
each language, and readable too. When this is done carefully, comparing 
phoneme sets of the languages, we can reach a common set of 
Latin-derived SINGLE-BYTE letters completely covering all phonemes of 
all Indic.


Next, each native script can be obtained by making orthographic smart 
fonts that display the SBCS codes in the respective shapes of the native 
scripts.


I have successfully romanized Sinhala and revived the full repertoire of 
Sinhla + Sanskrit orthography losing nothing. Sinhala script is perhaps 
the most complex of all Indic because it is used to write both Sanskrit 
and Pali.


See this: http://ahangama.com/ (It's all SBCS underneath).
Test here: http://ahangama.com/edit.htm


On 4/20/2017 5:05 AM, Richard Wordingham via Unicode wrote:

Is there consensus on how to count aksharas in the Devanagari script?
The doubts I have relate to a visible halant in orthographic syllables
other than the first.

For example, according to 'Devanagari VIP Team Issues Report'
http://www.unicode.org/L2/L2011/11370-devanagari-vip-issues.pdf, a
derived form from Nepali श्रीमान्  should be written श्रीमान्‌को
 and not
श्रीमान्को  .  Now, if the font used has a conjunct for
SHRA, I would count the former as having 4 aksharas SH.RII, MAA, N, KO
and the latter as having 3 aksharas SH.RII, MAA, N.KO.

If the font leads to the use of a visible halant instead of the vattu
conjunct SH.RA, as happens when I view this email, would there then be
5 and 4 aksharas respectively?  A further complication is that the font
chosen treats what looks like SH, RA as a conjunct; the vowel I appears
to the left of SH when added after RA (श्रि).

Richard.





Singhala scirpt ill defined by OpenType standard

2014-04-03 Thread Naena Guru
Here is the proof that OpenType standard defined the Singhala script
wrongly. Also find a BNF grammar that describes it.
http://ahangama.com/unicode/index.htm

Thanks.
___
Unicode mailing list
Unicode@unicode.org
http://unicode.org/mailman/listinfo/unicode


Re: Dead and Compose keys (was: Re: Romanized Singhala got great reception in Sri Lanka)

2014-03-18 Thread Naena Guru
Okay, Doug.

Type this inside the yellow text box in the following page:
kaaryyaalavala yanþra pañkþi
http://www.lovatasinhala.com/puvaruva.php

Please tell me what sequence of Unicode Sinhala codes would produce what
the text box shows.



On Mon, Mar 17, 2014 at 7:56 PM, Doug Ewell d...@ewellic.org wrote:

 Naena Guru wrote:

  In the case of romanized Singhala, any processing that English
 accepts, it accepts too. For RS, you select a font to display it in
 the native script because if it is mixed with English, both are using
 the same character space, just as when English and French are mixed.


 But English and French actually *use* the same letters, or at any rate
 most of them. With your approach, it is not possible to write Sinhala in
 the Sinhala script mixed with English or French or anything else in the
 Latin script. In web pages you can resort to span style=Latin tricks,
 but this doesn't work for plain text.

 This is what people mean when they suggest that your real goal is to
 abolish the Sinhala script and just write in Latin.


 --
 Doug Ewell | Thornton, CO, USA
 http://ewellic.org | @DougEwell ­


___
Unicode mailing list
Unicode@unicode.org
http://unicode.org/mailman/listinfo/unicode


Re: Dead and Compose keys (was: Re: Romanized Singhala got great reception in Sri Lanka)

2014-03-17 Thread Naena Guru
Making a keyboard is not hard. You can either edit an existing one or make
one from scratch. I made the latest Romanized Singhala one from scratch.
The earlier one was an edit of US-International.

When you type a key on the physical keyboard, you generate what is called a
scan-code of that key so that the keyboard driver knows which key was
pressed. (During DOS days, we used to catch them to make menus.) Now, you
assign one or a sequence of Unicode characters you want to generate for the
keypress.

Use Microsft's keyboard layout creator for all versions of Windows from XP:
http://msdn.microsoft.com/en-us/goglobal/bb964665.aspx

Select the language carefully. I selected US-English for RS. That way, I
can switch between the two keyboards quickly with Ctrl+Shift. You can
change all these in the Control Panel.

Here is the keymap I made for RS in Linux:
http://ahangama.com/apiapi/singhala/linuxkb-s.php
Just scroll down for the English part. (The lines starting with double
slashes are comments and have no effect on the program)

The Macintosh key layout is easy too.

The story with iOS and Android are different but not hard either.



On Sun, Mar 16, 2014 at 6:47 PM, Doug Ewell d...@ewellic.org wrote:

 Jean-François Colson jf at colson dot eu wrote:

  The idea here was “that characters not on an ordinary QWERTY keyboard
 could be entered _using_an_ordinary_QWERTY_keyboard._” Are there any
 dead keys on an _ordinary_ (i.e. not one using an international(ized)
 driver) QWERTY keyboard?


 Not on the standard vanilla U.S. keyboard. It has to be provided by the
 OS, via a driver, just as Compose key support has to be provided by the OS.

 The standard vanilla U.S. keyboard also doesn't provide the accented
 letters and other non-ASCII letters like ð that Naena Guru uses for his
 font hack.

  If a character is available by a dead key, isn’t it on the keyboard ?


 It depends on what you mean by on the keyboard. Thanks to John Cowan's
 delightful Moby Latin keyboard layout, I can type AltGr+\ followed by 7 to
 get the fraction ⅐ (one-seventh). That character is not on the keyboard
 in any sense other than what the driver provides.

 --
 Doug Ewell | Thornton, CO, USA
 http://ewellic.org | @DougEwell ­
 ___
 Unicode mailing list
 Unicode@unicode.org
 http://unicode.org/mailman/listinfo/unicode

___
Unicode mailing list
Unicode@unicode.org
http://unicode.org/mailman/listinfo/unicode


Re: Romanized Singhala got great reception in Sri Lanka

2014-03-17 Thread Naena Guru
You have lot of stars, my friend. You ARE the Lion. Roar!!

You have lot of friends in Sri Lanka, indeed, and one thanked you highly
for the service you did for the Language too. Good for you. However, he did
not know the way to the Public Library or the nearest Buddhist temple or
the Christian Church that would have enlightened him on Singhala writing
system, so you could help you make meaningful proposal than copying
something that somebody else forwarded with a question. Did any one of them
show you Rev. Fr. A. M Gunasekara's book? No!

Who signed Mettavihari purportedly for the Singhala user group?


On Mon, Mar 17, 2014 at 4:51 AM, Michael Everson ever...@evertype.comwrote:

 On 17 Mar 2014, at 02:48, Naena Guru naenag...@gmail.com wrote:

  You are talking about something you do not know. I am a Singhalese.

 That doesn't give you any special knowledge or privilege. I know many
 people from Sri Lanka, who work in the area of computing, and who work with
 the Sinhala characters as encoded in the UCS.

 And really. The Lion of Unicode?

 My stars.

 Michael Everson

  On Sun, Mar 16, 2014 at 6:18 AM, Michael Everson ever...@evertype.com
 wrote:
 
  On 16 Mar 2014, at 04:12, Naena Guru naenag...@gmail.com wrote:
 
  Dual-script Singhala means romanized Singhala that can be displayed
 either in the Latin script or in the Singhala script using an Orthographic
 Smart Font...
 
  What a terrible, terrible idea. You are essentially promoting giving up
 writing Sinhala, in favour of a slightly-bigger-than-ASCII Latin font hack.
 
  Dual-script Singhala is the proper and complete solution on the
 computer for the Singhala script used to write Singhala, Sanskrit and Pali
 languages.
 
  No, it isn't. It's a huge step backwards, unless you propose abolishing
 the Sinhala script entirely and just writing in Latin.
 
  The government ministries, media and people welcomed it with enthusiasm
 and relief that there is something practical for Singhala. The response in
 the country was singularly positive, except for the person that
 filibustered the QA session of the presentation that spoke about the hard
 work done on Unicode Sinhala, clearly outside the subject matter of the
 presentation.
 
  That person understood the nature of data integrity. As does everyone
 who cares about the Universal Character Set.

 Michael Everson * http://www.evertype.com/


 ___
 Unicode mailing list
 Unicode@unicode.org
 http://unicode.org/mailman/listinfo/unicode

___
Unicode mailing list
Unicode@unicode.org
http://unicode.org/mailman/listinfo/unicode


Re: Romanized Singhala got great reception in Sri Lanka

2014-03-17 Thread Naena Guru
Romanized to Unicode? Romanizing is inside Unicode. English and all Western
European languages also use Unicode.

Romanized Singhala resides in Latin 1 character set, that is between U+0020
to U+00FF.
Unicode Singhala resides in the range U+0D80 to U+0DFF

There is no difference between RS and those languages except the users live
in an island far away from those others. Is there some reason you want to
convert romanized Singhala to Unicode Singhala, a terrible specification
that is already corrupting the language? I spoke to serious users such as
journalists and teachers just few weeks back. It is unfortunate that you
guys are still hanging on to it. Why?

The proof-of-concept font I made has glyph substitutions. That is how it
can apply an orthography.

Unicode Singhala is completely a botched work. It has vowels each with two
codes, one for stand-alone and one for its sign. Each consonant is
considered as having the embedded (intrinsic!) vowel a. I is not a
consonant, people. Then it has two ligatures included as basic consonants
These do not have normalizing rules, 1 because they are NOT canonical forms
as there was no precedent digital form of Singhala for backward
compatibility 2 It was submitted after Unicode closed receiving
applications for normalizing canonical forms. How on earth can you make a
sorting method for it?

When you backspace it destroys multiple keystrokes. Search and replace is
not possible, at least the way do it with English. Typing is a nightmare.

There are special rules for making Unicode Singhala fonts. The keyboards
have keys to type pieces of letters not in the code block.

As you see, this is a terrible mess and cannot be straightened, granted few
people use it, and there'll be more. What other choice do they have except
Anglicizing?. In Singhala, they say, balu valigee uµa purukee ðaalaa
hæðuvaþ nææ æðee ærennee (බලු වලිගේ උණ පුරුකේ දාලා හැදුවත් නෑ ඇදේ ඇරෙන්නේ
- I inserted all joiners, but can't guarantee if vowel signs would pop
out). It means you cannot straighten dog tail even if you put it in a
bamboo.piece. You cannot fix Unicode Singhala and sadly, it is bringing
down the language with it.


On Sun, Mar 16, 2014 at 11:05 AM, William_J_G Overington 
wjgo_10...@btinternet.com wrote:

  So, everyone, can the Romanized Singhala system be used with a QWERTY
 keyboard to produce Unicode-encoded text, thereby producing a good combined
 system?

 Could this be achieved if a text-processing software package were produced
 that could automatically perform a character string to character string
 substitution (namely Romanized Singhala character string to Unicode
 character string) that would be applied before any OpenType glyph to glyph
 substitution?

 The character string to character string substitution rules could be
 stored in a text file, such as a UTF-16 text file saved from WordPad, that
 format being what WordPad describes as a Unicode Text Document file type.

 Could this be achieved?

 If so, text entry could use an ordinary QWERTY keyboard and yet the
 resulting text would be stored using the appropriate Unicode characters for
 the script and the font would use Unicode mappings.

 William Overington

 16 March 2014

___
Unicode mailing list
Unicode@unicode.org
http://unicode.org/mailman/listinfo/unicode


Re: Dead and Compose keys (was: Re: Romanized Singhala got great reception in Sri Lanka)

2014-03-17 Thread Naena Guru
Marc,

Yes, making keyboard layouts is not difficult.

I believed that language tools are selected for each language manually when
input. I did not know that there is an automatic switching of language
tools say when you switch to French keyboard from English. That shouldn't
be difficult to make for RS, though.

In the case of romanized Singhala, any processing that English accepts, it
accepts too. For RS, you select a font to display it in the native script
because if it is mixed with English, both are using the same character
space, just as when English and French are mixed.

Spell checking, grammar checking for Unicode Singhala? There are no such
things for it there. It is in the stage of struggling to input text:
special programs, physical keyboards etc. I saw them. They have a special
IT category of employees to input Unicode Sinhala. They have special places
called Typesetting kiosks in Lanka where you go to get your résumé and term
paper printed.




On Mon, Mar 17, 2014 at 3:33 PM, Marc Durdin m...@keyman.com wrote:

  I disagree.  Making a basic keyboard layout is not hard, just like
 making a font without OpenType support is not that hard.   Making a
 keyboard layout that doesn’t force users to learn the nuances of the
 encoding of a script is more of a challenge, and making a high quality
 keyboard layout that is consistent, easy to use, and efficient is anything
 but straightforward.  Most keyboard layouts fail at one of these.



 The story for touch device input is even more challenging.  Not being
 constrained to a physical set of keys increases your flexibility.  The big
 challenge is usually the size of the display on mobile-sized devices.



 Regarding keyboard design:

 · Scan make/break codes are not really relevant to Windows
 keyboards – Windows has an abstraction layer of ‘virtual key’ codes, for
 better or worse.

 · Selecting US-English for a non-English keyboard means that all
 language tools will break with your text.  Spell checking, grammar
 checking, automatic keyboard selection, autocorrect, font selection and
 more.  That’s a big price to pay.  Conversely, selecting Singhala for your
 Romanised non-Unicode encoding will break spell checking, grammar checking,
 automatic keyboard selection, autocorrect, font selection and more.



 Marc



 *From:* Unicode [mailto:unicode-boun...@unicode.org] *On Behalf Of *Naena
 Guru
 *Sent:* Tuesday, 18 March 2014 3:08 AM
 *To:* Doug Ewell
 *Cc:* UnicoDe List
 *Subject:* Re: Dead and Compose keys (was: Re: Romanized Singhala got
 great reception in Sri Lanka)



 Making a keyboard is not hard. You can either edit an existing one or make
 one from scratch. I made the latest Romanized Singhala one from scratch.
 The earlier one was an edit of US-International.



 When you type a key on the physical keyboard, you generate what is called
 a scan-code of that key so that the keyboard driver knows which key was
 pressed. (During DOS days, we used to catch them to make menus.) Now, you
 assign one or a sequence of Unicode characters you want to generate for the
 keypress.



 Use Microsft's keyboard layout creator for all versions of Windows from XP:

 http://msdn.microsoft.com/en-us/goglobal/bb964665.aspx



 Select the language carefully. I selected US-English for RS. That way, I
 can switch between the two keyboards quickly with Ctrl+Shift. You can
 change all these in the Control Panel.



 Here is the keymap I made for RS in Linux:

 http://ahangama.com/apiapi/singhala/linuxkb-s.php

 Just scroll down for the English part. (The lines starting with double
 slashes are comments and have no effect on the program)



 The Macintosh key layout is easy too.



 The story with iOS and Android are different but not hard either.





 On Sun, Mar 16, 2014 at 6:47 PM, Doug Ewell d...@ewellic.org wrote:

 Jean-François Colson jf at colson dot eu wrote:

 The idea here was “that characters not on an ordinary QWERTY keyboard
 could be entered _using_an_ordinary_QWERTY_keyboard._” Are there any
 dead keys on an _ordinary_ (i.e. not one using an international(ized)
 driver) QWERTY keyboard?


 Not on the standard vanilla U.S. keyboard. It has to be provided by the
 OS, via a driver, just as Compose key support has to be provided by the OS.

 The standard vanilla U.S. keyboard also doesn't provide the accented
 letters and other non-ASCII letters like ð that Naena Guru uses for his
 font hack.

 If a character is available by a dead key, isn’t it on the keyboard ?


 It depends on what you mean by on the keyboard. Thanks to John Cowan's
 delightful Moby Latin keyboard layout, I can type AltGr+\ followed by 7 to
 get the fraction ⅐ (one-seventh). That character is not on the keyboard
 in any sense other than what the driver provides.

 --
 Doug Ewell | Thornton, CO, USA
 http://ewellic.org | @DougEwell ­
 ___
 Unicode mailing list
 Unicode@unicode.org
 http://unicode.org/mailman/listinfo/unicode

Re: Dead and Compose keys (was: Re: Romanized Singhala got great reception in Sri Lanka)

2014-03-17 Thread Naena Guru
Doug,

Making keyboard layouts for Unicode Singhala is hard not because of fault
of Unicode. It is the complexity of letter assembly. I have use the
Wijesekara keyboard on a 24in Olympia Singhala keyboard in 1970s. It is
radically different from US-English.

I tried to make a phonetic one to kind of relate to the English keys.
Still, you need to have many shifted keys to get common letters.


On Mon, Mar 17, 2014 at 11:38 AM, Doug Ewell d...@ewellic.org wrote:

 Naena Guru naenaguru at gmail dot com wrote:

  Making a keyboard [layout] is not hard. You can either edit an
  existing one or make one from scratch. I made the latest Romanized
  Singhala one from scratch. The earlier one was an edit of US-
  International.

 I've made a couple dozen of them myself, with MSKLC.

  When you type a key on the physical keyboard, you generate what is
  called a scan-code of that key so that the keyboard driver knows which
  key was pressed. (During DOS days, we used to catch them to make
  menus.) Now, you assign one or a sequence of Unicode characters you
  want to generate for the keypress.

 Precisely. As Marc Durdin said, you can create a keyboard layout just as
 easily for Unicode characters as for ASCII and Latin-1 characters. You
 can also assign a combination of characters to a single key.

 So it is not true that typing Unicode Sinhala requires you to learn a
 key map that is entirely different from the familiar English keyboard,
 while losing some marks and signs too. Unicode does not prescribe any
 key map. You can have whatever layout you like.

 As Marc also said, if you think there are marks and signs missing from
 Unicode, that is another matter.

 --
 Doug Ewell | Thornton, CO, USA
 http://ewellic.org | @DougEwell


___
Unicode mailing list
Unicode@unicode.org
http://unicode.org/mailman/listinfo/unicode


Re: Romanized Singhala got great reception in Sri Lanka

2014-03-17 Thread Naena Guru
Thank you, Ken.

You very nicely analyzed it. Why I said that the signs might pop out
because  I have had complaints that happening. I think this is because
implementation of proper rendering is behind in some systems.

On input, I tried to make a layout that is close to QWERTY. But failed
because of the need for too many combination keys. Keyman uses the old
typewriter keyboard Wijesekara. I saw a better one on the front page for
Singhala but did not find it further inside. Marc would know, of course.

Anyway my complaint is that Unicode Singhala is incomplete and wrong and
that it has a deleterious effect on the language, one of the oldest in the
world. What's aggravating is that they institutionalize errors as correct.
Rev. Fr. Perera warned against this 80 years ago. I suppose I wouldn't have
much to say if the 58 phonemes are used to replace the ones there. It will
not happen.


On Mon, Mar 17, 2014 at 8:36 PM, Whistler, Ken ken.whist...@sap.com wrote:

  Well, I actually don’t see. I took a look at the Sinhala you inserted in
 this

 email. I cannot tell what you did at your input end (about “inserted all
 joiners”),

 but there are no actual joiners in the text itself. It displayed just fine

 in my email (including the correct conditional formatting of the –u vowel

 applied to the ra in pu*ru*kee), without me doing anything special (or
 installing

 any hacked font). Why? Because it was transmitted in plain Unicode.



 I cut and pasted that Unicode Sinhala string into a Word document, and

 it worked just fine. The boundaries for all the syllables were correctly

 detected.



 I saved it as a plain text UTF-8 file, and it worked just

 fine. I even then read the plain text UTF-8 file into a UTF-8 aware

 programming editor, and it worked just fine. (In a programming editor,

 which doesn’t attempt complex script rendering,

 the vowels don’t apply to the consonants and no reordering is done, so

 the display isn’t correct, but each character is correctly preserved, and

 if I write it back out to a document and read it in Word or some other

 tool that has access to proper rendering, it is still fine.) And all that

 interoperability works, why? Because this is plain Unicode.



 So while I don’t doubt that people may be having serious issues with

 input methods for Sinhala, I tend to agree with Marc Durdin that you are
 confusing

 encoding with input methods. Yes, I know you know the difference,

 but it appears to me that the inescapable conclusion from your

 argumentation is that the highest priority for the design of an

 encoding system should be to make the design of input methods

 as simple as possible. And in my estimation, that is confusing encoding

 with input methods.



 The art of input methods is to hide encoding details from users, and

 instead to provide them with an abstraction that they find easy to

 use and which accords with their general understanding of the writing

 system they are using. If done correctly, then the details of the input

 method *also* recede into the background, and users then simply

 do what they want: write and edit text easily on their devices.



 --Ken



 P.S. Here is an octal dump of that text (after I inserted a closing
 parenthesis in

 the editor). Sinhala sequence highlighted. Plain Unicode in UTF-8,

 no fancy stuff, and works just fine.



 00EF  BB  BF  62  61  6C  75  20  76  61  6C  69  67  65  65
 C2

 20A0  75  C2  B5  61  20  70  75  72  75  6B  65  65  C2  A0
 C3

 40B0  61  61  6C  61  61  20  68  C3  A6  C3  B0  75  76  61
 C3

 60BE  20  6E  C3  A6  C3  A6  20  C3  A6  C3  B0  65  65  20
 C3

 000100A6  72  65  6E  6E  65  65  0D  0A  28  E0  B6  B6  E0  B6
 BD

 000120E0  B7  94  20  E0  B7  80  E0  B6  BD  E0  B7  92  E0  B6
 9C

 000140E0  B7  9A  20  E0  B6  8B  E0  B6  AB  20  E0  B6  B4  E0
 B7

 00016094  E0  B6  BB  E0  B7  94  E0  B6  9A  E0  B7  9A  20  E0
 B6

 000200AF  E0  B7  8F  E0  B6  BD  E0  B7  8F  20  E0  B7  84  E0
 B7

 00022090  E0  B6  AF  E0  B7  94  E0  B7  80  E0  B6  AD  E0  B7
 8A

 00024020  E0  B6  B1  E0  B7  91  20  E0  B6  87  E0  B6  AF  E0
 B7

 0002609A  20  E0  B6  87  E0  B6  BB  E0  B7  99  E0  B6  B1  E0
 B7

 0003008A  E0  B6  B1  E0  B7  9A  29  0D  0A  0D  0A



 As you see, this is a terrible mess and cannot be straightened, granted
 few people use it, and there'll be more. What other choice do they have
 except Anglicizing?. In Singhala, they say, balu valigee uµa
 purukee ðaalaa hæðuvaþ nææ æðee ærennee (බලු වලිගේ උණ පුරුකේ දාලා හැදුවත්
 නෑ ඇදේ ඇරෙන්නේ - I inserted all joiners, but can't guarantee if vowel
 signs would pop out). It means you cannot straighten dog tail even if you
 put it in a bamboo.piece. You cannot fix Unicode Singhala and sadly, it is
 bringing down the language with it.



___
Unicode mailing 

Re: Romanized Singhala got great reception in Sri Lanka

2014-03-16 Thread Naena Guru
of the Sanskrit hodiya. Rev. Fr. Theodore G. Perera's grammar book (1932)
and Rev. Fr. A. M. Gunasekera's book (1891) that dug up sinking Singhala
fully describe the writing system. Like most other languages, including
English before printing arrived in England, it is written phonetically.

Singhala was romanized first in 1860s by Rhys Davids, called PTS scheme, to
print Pali (Magadhi) in the Latin script. This requires letters with bars
(macron) and dots not found in common fonts. This scheme is called PTS
Pali. It is similar to IAST Sanskrit. It is impossible to type these on the
regular keyboard.

I freshly romanized Singhala by mapping its phonemes to the SAME area 13
Western European languages mapped their alphabetic letters within the
following Unicode code charts:
http://www.unicode.org/charts/PDF/U.pdf
http://www.unicode.org/charts/PDF/U0080.pdf

So, if that is creating a transcoding table all Europeans did it and I do
it too.


On Sun, Mar 16, 2014 at 12:36 AM, Philippe Verdy verd...@wanadoo.fr wrote:

 Don't you realize that what you are trying to create is completely out of
 topic of Unicode, as it is simply another new 8-bit encoding similar to
 what ISCII does for supporting multiple Indic scripts with a common
 encoding/transcoding table?

 The ISCII standard has shown its limitations, it cannot be enough to
 support all scripts correctly and completely, it has lots of unsolved
 ambiguities for tricky cases or historic orthographies, or newer
 orthographies, that the UCS encoding better supports due to its larger
 character set and more precise character properties and algorithms.

 You are in fact creating a transcoding table... Except that you are mixing
 the concepts; and the Unicode and ISO technical commitees working on the
 UCS dont need to handle new 8-bit encodings. And you'll soon experiment
 the same problems as in ISCII and all other legacy 8-bit encodings: very
 poor INTEROPERABILITY due to version tracking or complax contextual rules...

 You may still want to promote it at some government or education
 institution, in order to promote it as a national standard, except that
 there's little change it will ever happen when all countries in ISO have
 stopoed working on standardization of new 8-bit encodings (only a few ones
 are maintained; but these are the most complex ones used in China and Japan.

 Well in fact only Japan now seens to be actively updating its legacy JIS
 standard; but only with the focus of converging it to use the UCS and solve
 ambiguities or solve some technical problems (e.g. with emojis used by
 mobile phone operators). Even China stopped updating its national standard
 by publishing a final mapping table to/from the full UCS (including for
 characters still not encoded in the UCS): this simplified the work because
 only one standard needs to be maintained instead of 2.

 Note that as long there will not be any national standard supporting your
 proposed encodng, there is no chance that the font standards will adopt it.
 You may still want to register your encoding in the IANA registry, but
 you'll need to pass the RFC validation. And there are lots of technical
 details missing in your proposal so that it can work for supporting it with
 a standard mapping in fonts.

 There is better chance for you to pomote it only as a transliteration
 scheme, or as an input method for leyboard layout (both are also not in the
 scope of the Unicode and ISO/ISC 10646 standards though, they could be in
 the scope of the CLDR project, which is not by itself a standard but just a
 repository of data, supported by a few standards)... Think about it.



 2014-03-16 5:12 GMT+01:00 Naena Guru naenag...@gmail.com:

 I made a presentation demonstrating Dual-script Singhala at National
 Science Foundation of Sri Lanka. Most of the attendees were government
 employees and media representatives; a few private citizens came too.

 Dual-script Singhala means romanized Singhala that can be displayed
 either in the Latin script or in the Singhala script using an Orthographic
 Smart Font. It is easy to input (phonetically) using a keyboard layout
 slightly altered from QWERTY. The font uses Standard Ligature feature
 liga of OpenType / OpenFont standard to display glyphs of Sanskrit
 ligatures as well as many Singhala letters. The font is supported across
 all OSs: Windows, Macintosh, Linux, iOS and Android. Dual-script Singhala
 is the proper and complete solution on the computer for the Singhala script
 used to write Singhala, Sanskrit and Pali languages. The same solution can
 be applied for all Indic languages.

 The government ministries, media and people welcomed it with enthusiasm
 and relief that there is something practical for Singhala. The response in
 the country was singularly positive, except for the person that
 filibustered the QA session of the presentation that spoke about the hard
 work done on Unicode Sinhala, clearly outside the subject matter of the
 presentation

Romanized Singhala got great reception in Sri Lanka

2014-03-15 Thread Naena Guru
I made a presentation demonstrating Dual-script Singhala at National
Science Foundation of Sri Lanka. Most of the attendees were government
employees and media representatives; a few private citizens came too.

Dual-script Singhala means romanized Singhala that can be displayed either
in the Latin script or in the Singhala script using an Orthographic Smart
Font. It is easy to input (phonetically) using a keyboard layout slightly
altered from QWERTY. The font uses Standard Ligature feature liga of
OpenType / OpenFont standard to display glyphs of Sanskrit ligatures as
well as many Singhala letters. The font is supported across all OSs:
Windows, Macintosh, Linux, iOS and Android. Dual-script Singhala is the
proper and complete solution on the computer for the Singhala script used
to write Singhala, Sanskrit and Pali languages. The same solution can be
applied for all Indic languages.

The government ministries, media and people welcomed it with enthusiasm and
relief that there is something practical for Singhala. The response in the
country was singularly positive, except for the person that filibustered
the QA session of the presentation that spoke about the hard work done on
Unicode Sinhala, clearly outside the subject matter of the presentation.

The result of the survey passed around was 100% as below (translated from
Singhala):

   1. I believe that Dual-script Singhala is convenient to me as it is
   implemented similar to English - Yes
   2. Today everyone uses Unicode Sinhala. It is easy and has no problems -
   No
   3. The cost of Unicode Sinhala should be eliminated by switching to
   Dual-scrip Singhala - Yes
   4. We should amend Pali text in the Tripitaka according to rulings of
   SLS1134 - No
   5. Digitizing old books is a very important thing - Yes
   6. We should focus on making this easy-to-use Dual-script Singhala
   method a standard - Yes

Please comment or send questions.
___
Unicode mailing list
Unicode@unicode.org
http://unicode.org/mailman/listinfo/unicode


Re: interaction of Arabic ligatures with vowel marks

2014-01-08 Thread Naena Guru
Please see this page: (for IE, use v 2010 and up)
http://lovatasinhala.com/

The font is almost all ligatures. If you copy and inspect the text, you'll
notice that it is simple romanized Singhala. I am currently in Sri Lanka
demonstrating this. The people at president's office and one of the
powerful ministers have seen it. They are elated that after all, Singhala,
the most complex of 'Abigudas' is much like a Western European language and
amazingly computer and user friendly. This is contrary to how it was
portrayed to them by local academics and technocrats causing the poor
country unnecessary debt.

The ideas of Abiguda and Complex fade away if a font is made fully
understanding Unicode's description of ligatures and how they are
implemented by OpenType (now OpenFont). I believe that Arabic and Hebrew
can follow this model so that typing the script is simplified for users
without compromising orthography.


On Wed, Jun 12, 2013 at 8:39 AM, Stephan Stiller
stephan.stil...@gmail.comwrote:

 Hi,

 How is the placement of vowel marks around ligatures handled in Arabic
 text?

 Does anyone have good pointers on this topic?

 My guess is that this does not come up often (just like the topic of
 pointing for handwritten Hebrew), as vowel marks are mostly not added in
 ordinary text. Nonetheless, any text making heavy use of ligatures will
 from time to time need to add vowel marks for a foreign name or as a
 reading aid, and (as many of us know) the Quran is traditionally printed
 with vowel marks.

 I'm also wondering how font designers normally handle this. I think there
 are analogous questions for various ligature-heavy abugidas, so there must
 be an existing body of knowledge. There should be better answers than
 squeeze the vowels around the consonant clusters in whatever way seems
 most intuitive. Do traditional printing presses use extra metal types for
 such glyph clusters, or do they manually add and adjust the positioning of
 vowels?

 Stephan



___
Unicode mailing list
Unicode@unicode.org
http://unicode.org/mailman/listinfo/unicode


Re: Interoperability is getting better ... What does that mean?

2013-01-08 Thread Naena Guru
Thank you for commenting and Happy New Year.

CP-1252 is a perfectly legal web character set, and nobody is going to
argue with you if you want to use it in legal ways. (I.e. writing
Latin script in it, not Sinhala.) But .

Okay, what is implied is I am doing something illegal. Define what I am
doing that is illegal and cite the rule and its purpose of preventing what
harm to whom.

May I ask if the following two are Latin script, English or Singhala?

1. This is written in English.
2. mee laþingaþa síhalayi.


For me, both are Latin script and 1 is English and 2 is Singhala (says,'
this is romanized Singhala').

The fo;;owing are the *only* Singhala language web pages that pass HTML
validation (Challenge me):
http://www.lovatasinhala.com/
They are in romanized Singhala.

The statement,

the death of most character sets makes everyone's systems smaller and
faster

is *FALSE*. Compare the sizes of the following two files that are copies of
a newspaper article. The top part in red has few more words in romanized
Singhala in the romanized Singhala file. Notice the size of each file:
1. http://ahangama.com/jc/uniSinDemo.htm  size:38,092 bytes
2. http://ahangama.com/jc/RSDemo.htm  size:18,922 bytes
As the size of the page grows, the size of Unicode Sinhala tends to double
the size relative to its romanized Singhala version. Unicode Sinhala
characters become 50% larger when UTF-8 encoded for transmission  That is
three times the size of the romanized Singhala file. So, the Unicode
Sinhala file consumes 3 times the bandwidth needed to send the romanized
Singhala file.

more likely to correctly show them the document instead of trash

Again *demonstrably WRONG*: Unicode Sinhala is trash in a machine that does
not have the fonts. It is trash also if the font used by the OS is
improperly made, such as in iPhone. It is generally trash because the
SLS1134 standard corrupts at least one writing convention. (Brandy issue).
On the other hand, romanized Singhala is always readable whether you have
the font or not. It is not helpful to criticize Singhala related things
without making a serious effort to understand the issues. Blind men thought
different things about the elephant.

If you mean that everyone should start using 16-bit Unicode characters, I
have no objection to that. It would happen if and when all applications
implement it. I cannot fight that even if I want to. But I do not see users
of English doing anything different to what they are doing now, like my
typing now, I think, using 8-bit characters. (I can verify that by copying
it and pasting into a text editor.

I showed that the Singhala can be romanized and all the problems of
ill-conceived Unicode Indic can be eliminated by carefully studying the
grammar of the language and romanizing. (I used the word 'transliterate'
earlier, but the correct word is transcribe). I did it for Singhala and
made an Open Type font to show it perfectly in the traditional Singhala
script. So far, one RS smartfont and six Unicode fonts even after spending
$20M for a foreign expert to tell how to make fonts though it is right on
the web in the same language the expert spoke in.

My work irritates some may be because it is an affront their belief that
they know all and decide all. Some  feel let down why they could not think
of it earlier and may be write about a strange discovery like Abiguda and
write a book on the nonsense. Most of all, I think it is a just cultural
block on this side of the globe.

As for Lankan technocrats, their worry is that the purpose of ICTA would
come unraveled.  I went there in November and it was revealed to me (by one
of its employees) that its purpose is to provide a single point of contact
for foreign vendors that can use local experts as their advocates.


On Thu, Jan 3, 2013 at 12:56 AM, Leif Halvard Silli 
xn--mlform-...@xn--mlform-iua.no wrote:

 Asmus Freytag, Mon, 31 Dec 2012 06:44:44 -0800:
  On 12/31/2012 3:27 AM, Leif Halvard Silli wrote:
  Asmus Freytag, Sun, 30 Dec 2012 17:05:56 -0800:
  The Web archive for this very list, needs a fix as well …
 
 
  The way to formally request any action by the Unicode Consortium is
  via the contact form (found on the home page).

 Good idea. Done!

 Turned out to only be - it seems to me - an issue of mislabeling the
 monthly index pages as ISO-8859-1 instead of UTF-8. Whereas the very
 messages themselves are archived correctly. And thus I made the request
 that they properly label the index pages.

 Happy new year!
 --
 leif h silli





Re: Interoperability is getting better ... What does that mean?

2013-01-01 Thread Naena Guru
It used to be that during HTML 4 days ISO8859-1 was the default character
set for pages that used SBCS (those that belong to Basic Latin and Latin
Extended-1). At least that is what the Validator (http://validator.w3.org/)
said.

(By the way, Unicode is quietly suppressing Basic Latin block by removing
it from the Latin group at top of the code block page (
http://www.unicode.org/charts/) and hiding it under different names in the
lower part of the page.)

Now the validator complains correctly that some characters in those pages
do not belong to ISO-8859-1, if you use bullet points, ellipse etc. It says
they come from Windows-1252. That is true. If you declare these pages as
UFT-8, then it throws off *all* Latin-1 characters and the web pages show
character-not-found glyph.

Windows-1252 replaces all Control codes (first 32 characters) in Latin-1
page with some common characters used by Eastern European languages and
some punctuation marks.

There is one main consideration in the mind of the web developer: Make the
file as small as possible. Try this: Make a text file in Windows Notepad
and save it in ANSI, Unicode and UTF-8 formats. ANSI file (Windows-1252)
will be the smallest. Why should people make their pages larger just to
satisfy some peoples idea of perfection? It reminds me of the Plain Text
and language detection myths.


On Mon, Dec 31, 2012 at 8:44 AM, Asmus Freytag asm...@ix.netcom.com wrote:

 On 12/31/2012 3:27 AM, Leif Halvard Silli wrote:

 Asmus Freytag, Sun, 30 Dec 2012 17:05:56 -0800:
 The Web archive for this very list, needs a fix as well …



 The way to formally request any action by the Unicode Consortium is via
 the contact form (found on the home page).

 A./




Re: Compiling a list of Semitic transliteration characters

2012-09-20 Thread Naena Guru
{Sorry for this delayed response. I had it in he drafts box]

INDIC:

Speculation is one, doing is another. I have *successfully* transliterated
Singhala. Devanagari is perhaps less complicated. I haven't had the time
and money to do it. The difference between the two is not even as much as
between common Latin and Cyrillic.

Statements like,

Using Unicode is recommended in preference to any code page because it has
better language support and is less ambiguous than any of the code pages.

are trying to assert untruths, that people tend to believe without concrete
reasons. 'better language support' and 'less ambiguous'?

That statement is by Microsoft right in the registration of Windows-1252
that plainly contravenes Unicode:
http://msdn.microsoft.com/en-US/goglobal/cc305145.aspx

All languages in the Developed countries in the West including English, use
Windows-1252! And all others who use double-byte etc. use UTF-8 to
transport their codes across the web, because only Single-byte codes are
trustworthy. We used Base64 for the same end in the 90s.

I agree that following ISCII, whatever it is, might be the problem. Even
so, that is no excuse for not researching enough and fixing the problem.
After all, the claim is that Unicode provides BETTER LANGUAGE SUPPORT and
LESS AMBIGUITY -- both of them were happily presented with Unicode Sinhala.
But what happened? Why do we see constant questions regarding Unicode Indic?

I researched into how Sanskrit and Pali was transliterated during the
letterpress days and why they were so successful and without loss in
mapping. If that was not so, we wouldn't have such a comprehensive Sanskrit
dictionary like Monier-Williams available on the web. The transliteration
of Sanskrit is perfect whether ITRANS or Harvard-Kyoto.

The same is true about PTS Pali.Half a century ago, I worked with Ven.
Nyanaponika (the ultimate BuJu) when printing Pali, and I know the real
technicalities letterpress printers faced. I know why certain diacritics
were chosen and why some letters were chosen over others though they seem
more logical.

I have demonstrated in the following page how romanized Singhala, IAST and
HK match perfectly, and they are simple direct mappings (read the
Javascript):
http://www.lovatasinhala.com/liyanna-e.php#sanskrit

It is presumptuous to say, the rest of the post is irrelevant. It has the
same attitude as the statement I gave above said by Microsoft.We know
what's best for you (and remember that we hold all the strings). So do as
we say.

*halanta *means *hal anta*, ending consonant. (*hal *= consonant).
*virAma *means
closing and in grammar it means the last consonant without a vowel
following it. *mAtra *means a measure. It is an IE cognate of measure,
meter etc. In grammar it means the spoken length of a short vowel, same as
mora in Latin. All these three terms are used misleadingly in
Unicode documentation. Instead of 'language support', it mangles language.


SEMITIC LANGUAGES:
I checked the Arabic alphabet in Wikipedia. It is similar to the older
Indic writing, Brahmi in particular, where there is no consonant sign, but
it is understood in the context (a, i etc).

On Fri, Sep 7, 2012 at 1:16 PM, Richard Wordingham 
richard.wording...@ntlworld.com wrote:

 On Fri, 7 Sep 2012 11:43:59 -0500
 Naena Guru naenag...@gmail.com wrote:

  Transliteration or Romanizing
 
  My first advice is not to embark on making solutions for languages
  that you do not know. Unicode ruined Indic and Singhala by making
  'solutions' for them by not doing any meaningful research and
  ignoring well-known Sanskrit grammar and previous solutions for Indic.

 The problems, if any, are due to thinking that the Indians understood
 what they were doing when they developed ISCII.  Such problems as
 there are appear to arise from a belief that all Indic scripts are like
 Devanagari.

  I romanized Singhala, probably the most complex script among all
  Indic, and made an orthographic font that in turn shows the
  transliterated text in its native script.
  http://www.lovatasinhala.com
 
  Some reasons for romanizing:
 snip
  3. Make the language accessible to those who are not familiar with the
  script

 The rest of the post is irrelevant.  Transliterations from Semitic
 languages have been established for this reason, and possibly because
 of costs of making and setting type.  One issue at hand is that there is
 not a *single* transliteration to hand, and certainly not a single
 pan-Semitic one.  Therefore I strongly doubt that an 8-bit code would
 encompass everything that was needed.

 Richard.




Re: Compiling a list of Semitic transliteration characters

2012-09-07 Thread Naena Guru
Transliteration or Romanizing

My first advice is not to embark on making solutions for languages that you
do not know. Unicode ruined Indic and Singhala by making 'solutions' for
them by not doing any meaningful research and ignoring well-known Sanskrit
grammar and previous solutions for Indic.

I romanized Singhala, probably the most complex script among all Indic, and
made an orthographic font that in turn shows the transliterated text in its
native script.
http://www.lovatasinhala.com

Some reasons for romanizing:
1. The current solution is hard to use and incomplete
2. A user friendly method on the computer for native users for their
language
3. Make the language accessible to those who are not familiar with the
script
4. Help in linguistic studies, take advantage of text to voice technologies
etc.

In order to romanize successfully, you need to select the best character
set. Unicode is a very bad choice because its codes are at least two bytes.
Do not be fooled by statements like, 'use added bonus characters', which
lures you to the crippled double-byte area, Then some might try to scare
you by saying you are making rogue encodings. The only constructive
suggestion I had was to be aware of Latin semantics, which too was trivial

In my opinion, the best character set is ISO-8859-1, which was modified by
Microsoft to Windows-1252. The following table shows both of them merged.
The area with yellow background is the part that was modified by Microsoft.
ISO-8859-1 had machine control commands in that area.
http://www.lovatasinhala.com/eds/charsets.htm

Notice that capital and simples are separately encoded. Capitals can be
used to indicate modified or closely related sounds to the ones used on
regular keys. The AltGR or Ctrl+Alt shifted state gives you more options.
Remember that here are letters that are not seen often such as þ, ð, æ,
etc. The key positioning in relation to English sounds is more important
than the fear of their unfamiliarity. I used þ and ð in the regular
positions of t and d. and moved t and d to AltGr shifted positions. The
users hardly notice it because t and d keys are what they used in
Anglisizing anyway that now give more accurate interpretation. By the way,
starting with Anglicizing and refining it is what would be easiest for the
users to adapt.

The keyboards that have that have ISO-8859-1 characters  are
US-International, US-Extended, Dead-key Keyboard in Windows, Macintosh and
Linux systems. All these three OSs also have easy ways to add your own
customized keyboard. In my experience, it is best to strip off all keys not
used by a transliteration scheme from the customized keyboards to avoid
typos. Think if ZWJ and / or ZWNJ and non-breaking space might be useful in
the new keyboard.

As for directionality, I'd keep it L2R and if you make fonts like I did,
the transformation might incorporate the direction switch.

Good luck!


Re: Compiling a list of Semitic transliteration characters

2012-09-07 Thread Naena Guru
Thank you Phillip, so, what did you say?

On Fri, Sep 7, 2012 at 8:58 AM, Philippe Verdy verd...@wanadoo.fr wrote:

 2012/9/7 Leif Halvard Silli xn--mlform-...@xn--mlform-iua.no:
  The word Roman, can also refer to Greek. So it is best to avoid
  that term. ;-)

 The Roman empire was speaking a large set of languages (and writing in
 various scripts) from Europe to Asia and Africa, even if Latin was
 used in Rome, and written in the Latin script (but not only).

 But the conventional meaning of romanisation is that it is a
 transcription to the Latin script (independantly of the target
 language).

 The concept of transliteration, rather than transcription, is in fact
 quite new in human history : the initial need was just to write how
 languages were pronounced, with more or less approximations, to match
 the way another language is written, read and pronounced (in the
 target phonology). So a transcription has always been lossy.

 But the real difference between transcription and transliteration is
 for another role : a transliteration attempts to preserve the maximum
 of the source language phonology and meaning, avoiding most
 ambiguities. So a transliteration occurs within the same language. A
 translietteration scheme is created when a language starts changing
 its standard script in some area. But even in that case it is
 extremely rare that this conversion will be lossless : there are
 frequent adaptation of the orthography, and some historic
 orthographies in the original script (such as mute letters or more
 frequently letters whose current phonology has changed considerably so
 that the original orthography in the source script is already far away
 from the actually spoken language, or because some historic
 distinctions are no longer heard and the transliteration scheme is
 representing the letters the same way : N-to-1 is then frequent as
 well).

 For some pairs or scripts, it is impossible to be 1-to-1, because the
 scripts work very differently : alphabets are not like abjads or
 akharas, and not like ideographic scripts. So adaptation is
 unavoidable. When a language changes its standard script, there is
 also very frequently an orthographic reform on the new script, so even
 the rules of transliterations contain a lot of new exceptions, to
 match the new orthography. When this change of script is just
 motivated to ease the learning of the language by people that are
 better aware of another script, the transliteration rules will often
 be more strict. It will be much stricter if this change of script is
 motivated by technical reasons (but people are generally not very well
 trained on how to make this conversion, so they will each one use
 their own transliteration scheme, to approximate the language.

 For this reason, the distinction between lossy and lossless is not
 very relevant to make the distinction between a traditional
 transcription and a modern transliteration. My opinion if that the
 simplest conversions that try to avoid most ambiguities are just named
 transliteration and they occur within the same language in the same
 region. Transcriptions are more traditional and instead on focusing on
 the source language, they try to best approximate the phonology of
 another language in its current common orthography.

 Different needs, different rules, but even in both cases the rules are
 not followed exactly. None of them are lossless. But the distinction
 is there. There's no clearly defined separation line between
 transliteration schemes and transcription schemes. except by their
 intent to preserve a source language or best approximate another one.

 So the stadnard conversion of Chinese from Han ideographs to Bopomofo
 or Latin (with the Piyin standard) could be called translierations
 even if there's by evidence a lot of losses. Same thing about Romaji
 in Japanese. And even for Korean the standard conversion from the
 Hangul alphabet to Latin creates some ambiguities and is a bit lossy.

 Note also that a transcription also occurs within the same script :
 when you adapt an orthography to use other letters than in the
 original orthography, this is not a transliteration. For example when
 you transcript French to an English context, you'll commonly convert
 ou into oo, or will disambiguate some s into z, or some c
 into k or s. The intent is to tell English native speakers how to
 read a word written in another language (e.g. you say that the French
 word paille should be read like the English pie. This is not a
 transliteration but a transcription).

 As well, when you convert the language into a phonetic alphabet like
 IPA, the process is definitely not a transliteration but a
 transcription, even if this occurs within the same Latin script (many
 people are arguing that IPA is not part of the Latin script as it does
 not contain letters, but symbols, and it is monocameral and it
 cannot follow the common typographic rules, in addition to the fact
 that it borrows 

Re: Sinhala naming conventions

2012-07-15 Thread Naena Guru
Mahesh,

Thank you. I like this line of discussion than the constant effort to
condemn me for abstract crimes. I have not seen any standard whose
conditions stand in the way of proper implementation of Singhala on the
computer. There, my challenge stands 1. to show where I hurt Singhala by
romanizing. 2. to show how romanized Singhala violates any standard in what
specific way.
3. to show that romanized Singhala is inferior to Unicode Singhala in some
unique way that no other implementation of a language on the
computer displays.

SANSKRIT:
I use Monier-Williams steadily because since of late, I have come to doubt
Lankan 'educated' class on the use of venerable Sanskrit. No one group of
people own Sanskrit. I see it as our (South Asian) reserve for clarity of
expression. This is certainly true for Singhala.

I consider Monier-Williams first an Indian and second of British decent --
an authority on Sanskrit:
http://en.wikipedia.org/wiki/Monier_Monier-Williams

The dictionary he compiled is maintained by Cologne University (
http://en.wikipedia.org/wiki/Cologne_University_of_Applied_Sciences). It
has been online for 3 or four years now.
http://www.sanskrit-lexicon.uni-koeln.de/cgi-bin/tamil/recherche.  Before
that, I used a downloaded version. Two entries:


*lipi*f. (accord. to L. also %{lipI}) smearing , anointing c. (see
%{-kara}) ; painting , drawing L. ; writing , letters , alphabet , art or
manner of writing Ka1v. Katha1s. [902,3] ; anything written , manuscript ,
inscription , letter , document Naish. Lalit. ; outward appearance
(%{lipim-Ap} , with gen. , ` to assume the appearance of ' ;
%{citrAM@lipiM} %{nI} , ` to decorate beautifully ') Vcar.
*lipisaMkhyA*f. a number of written characters L.

There you have it  as it were straight from the horse's mouth. I feel safe
to use lipisaMkhyaa for text.

Please read my other responses inline:

On Tue, Jul 10, 2012 at 1:52 PM, Mahesh T. Pai paiva...@gmail.com wrote:

 Naena Guru said on Tue, Jul 10, 2012 at 10:31:10AM -0500,:

   Another one I found was lipisaMkhyA, which means exactly what we
   tern text.

 I am not sure of the context, but, there is a tradition of using text
 to write numbers. (samkhya = numbers)

   Tamil Nadu where Indian linguists reside is closer to Colombo than
 Delhi I

 The late Sri P. V. Narasimha Rao, the former Prime Minister, knew 6 or
 7 non-Indian languages; and IIRC, he used to give interviews to
 Spanish (or was it Portugese?). He was not a resident of Tamil Nadu,
 AFAIK.

Actually, I did not mean to put Southerners over Northerners (which latter
brought Sanskrit to the South! -- agasti during KulaSekara's time?). It is
just that I see lot of Tamil-like names mentioned. I think that there is
some language institute in Tamil Nadu which involves a lot in linguistics.


 Or is knowledge of that many languages not sufficient to qualify him
 as a linguist?

Why split hairs with 'linguist'? Indeed, he is a SaD-bhASA-parama-Izvara =
(sandhi) - SaDbhASA paramezvara. It was like the top honorific given a
linguist.


 (PS:- Ah. Well -- PV was refered to as a polyglot; not a linguist
 - you win)

Let's not get trapped by English semantics. The movers of the users of a
branch of study could adjust connotations for their benefit, or so it seems
in the IT field. Did you notice that lately the word script has been
redacted and 'character set' is used instead? What's with that?

Why do we generally understand 'transliteration' as shorthand for
transcribing from another script to Latin script? It is because, Latin
script is recognized by most people and it is the first script that is used
in the communication technology  of a given era. It was so for letterpress
printing in the 1800s and now the computer. The plain truth is that
Latin-1, second to ASCII, is the best settled set of (coded) script
(characters set) on the computer. This is exactly why I romanized Singhala
into Latin-1. Verify that by going to the following web site and clicking
on the link that says, Latin Script:
www.lovatasinhala.com

Latin-1 is not the complete repertoire of codepoints for letters *derived
from the proper Latin script*. It is the first part only. I'd say with
confidence that 99.9% of well established applications accept Latin-1 with
no problem. However, *most* of those applications *do not* accept accented
Latin characters used by PTS Pali and IAST Sanskrit (the ones with macron
bar and dots).

One of the detractors here called the Latin characters outside Latin-1
Added Bonus. I have used these letters in the LIYANNA page of the web
site. There are separate conversion sections in it for Pali and Sanskrit.
Each of them have edit boxes to enter the language in one form and to get
it in the the other(s).
Singhala:
http://www.lovatasinhala.com/liyanna.php#unisin (rom. Sing - Unicode
Sinhala)
Pali:
http://www.lovatasinhala.com/liyanna.php#pali (rom. Sing - PTS Pali)
Sanskrit:
http://www.lovatasinhala.com/liyanna.php#sanskrit (RS -. HK - IAST)

As a Singhalese

Re: Charset declaration in HTML (was: Romanized Singhala - Think about it again)

2012-07-15 Thread Naena Guru
On Tue, Jul 10, 2012 at 11:58 PM, Leif Halvard Silli 
xn--mlform-...@xn--mlform-iua.no wrote:

 Naena Guru, Tue, 10 Jul 2012 01:40:19 -0500:

  HTML5 assumes UTF-8 as the character set if you do not declare one
  explicitly. My current pages are in HTML 4.

 There is in principle no difference between what HTML5-parsers assume
 and what HTML4-parsers assume: All of them default to the default
 encoding for the locale.

I see. That is, for the transliteration, the locale should be Sinhala
(Latin). Yes. I know that it is not official. I loathe the spelling
Sinhala. Oh, well, you cannot have it all.


  Notepad forced
  me to save the file in UTF-8 format. I ran it through W3C Validator. It
  passed HTML5 test with the following warning:
 
  [image: Warning] Byte-Order Mark found in UTF-8 File.

 I assume that you used the validator at


 http://validator.w3.org.

Yes, and it validated it. I was talking about BOM in a different context.
It showed up when I opened the file in HTML-Kit that was first created in
Notepad and saved under UTF-8. HTML-Kit Tools asked me to specify the
character set. It took it. but messed up the macron and dot letters anyway.
What I was trying to emphasize was the fact that it is hard for those
people that try to make web pages in those 'character sets'. I have been
making web pages since 1990s and never had these problems because they were
written by hand in English.



 But if you instead use the most updated HTML5-compatible validators at

 http://www.validator.nu
 or  http://validator.w3.org/nu/



 then will not get any warning just because your file uses the
 Byte-Order Mark. HTML5 explicitly allows you to use the BOM.

Thanks. This too validated all seven pages as HTML5 (I upgrated from HTML
4)




  The Unicode Byte-Order Mark (BOM) in UTF-8 encoded files is known to
 cause
  problems for some text editors and older browsers. You may want to
 consider
  avoiding its use until it is better supported.

 Weasel words from the validator. The notion about older browsers is
 not very relevant. How old are they? IE6 have no problems with the BOM,
 for instance. And that is probably one of the few, somewhat relevant,
 old browsers.

As I said before BOM was no problem for me.


 As for editors: If your own editor have no problems with the BOM, then
 what? But I think Notepad can also save as UTF-8 but without the BOM -
 there should be possible to get an option for choosing when you save
 it. Else you can use the free Notepad++. And many others. In VIM, you
 set or unset the BOM via the commands

 set bomb
 set nobomb

Yes, yes. I've seen it before. I have Notepad++. It intimidated me the
first time and never used it, haha!

 --
 Leif H Silli



Re: Ewellic again (was: Re: Romanized Singhala - Think about it again)

2012-07-15 Thread Naena Guru
My error. Sorry, Doug.

On Sun, Jul 8, 2012 at 8:00 PM, Doug Ewell d...@ewellic.org wrote:

 Unicode character database goes from zero to some very big number. There
 are no holes in it to define character sets for somebody's fancy. Well,
 Doug Ewell did one for Esparanto expanding fuþorc.


 Ewellic is not futhorc. They are different scripts.

 From the Omniglot page on Ewellic (with *emphasis* added):
 The shape of Ewellic letters was *inspired by* the Runic and Cirth
 scripts, but shows greater (though still imperfect) regularity of form.

 --
 Doug Ewell | Thornton, Colorado, USA
 http://www.ewellic.org | @DougEwell ­





Re: Charset declaration in HTML (was: Romanized Singhala - Think about it again)

2012-07-15 Thread Naena Guru
Hey, Philippe,

Your input is much appreciated. So, in a nutshell, I don't have to worry.
One of these days I need to crunch down (minify) the CSS and JavaScript
pages. I left them readily readable so that techs like you could easily
read them in place in any browser without having to pretty print. The pages
are not big by any standard and they download pretty fast. Your earlier
point about WOFF is what I am going to try and tackle today (Sunday).

In the meanwhile, thanks again.

On Tue, Jul 10, 2012 at 11:32 PM, Philippe Verdy verd...@wanadoo.fr wrote:

 2012/7/10 Naena Guru naenag...@gmail.com

 I wanted to see how hard it is to edit a page in Notepad. So I made a
 copy of my LIYANNA page and replaced the character entities I used for
 Unicode Sinhala, accented Pali and Sanskrit with their raw letters. Notepad
 forced me to save the file in UTF-8 format. I ran it through W3C Validator.
 It passed HTML5 test with the following warning:

 [image: Warning] Byte-Order Mark found in UTF-8 File.

 The Unicode Byte-Order Mark (BOM) in UTF-8 encoded files is known to
 cause problems for some text editors and older browsers. You may want to
 consider avoiding its use until it is better supported.

 The BOM is the first character of the file. There are myriad hoops that
 non-Latin users go through to do things that we routinely do. This problem
 I saw right at the inception. I already know why romanizing is so good.
 Don't you?


 You should probably ignore this non-critical warning now ; it is only for
 extremely strict compatibility with deprecated softwares that should have
 been updated since long for obvious security and performance reasons.

 Those old browsers are deprecating fast (due to the massive and fast
 spread of security attacks, automatic security updates to close issues
 competely (instead of just by preventive virus detection based on code
 bahavior or code patterns which will never be complete and fast enough to
 react to these extremely frequent attacks).

 Older editors do not have the cumfort that newer editors have. The memory
 usage of these newer editors are no longer a problem (notably for web
 developers that have systems largely above what theiur average users have),
 and systems capable of running them have never been so cheap. In addition,
 memory and storage costs have dramatically decreased.

 We are more concerned about the bandwidth usage, so your web editing
 platform should include an optimisation process and converters that will
 automatically use a compact representation (numeric character references
 for example can be sent by your server as raw UTF-8, in addition the server
 can now support on-the-fly data compression over the HTTP sessions ; there
 also exists frontend proxies that will do that for you without requiring
 you to change the development/editing methods you use.

 Most text editors even in Linux can now open sucessfully UTF-8 files
 starting by a BOM without complaining. Just like Notepad does since long.
 And they allow you to change this edit mode before saving.

 Most text processors will silently discard the U+FEFF character (it should
 be safe to do that everywhere, given that U+FEFF should no longer be used
 for anything else than BOM's)

 [side node]
   But Notepad has another problem since long : it cannot sucessfully open
 a text file whose lines are terminated by LF only, it absolutely wants them
 to be converted using CR+LF sequences ; this problem is much more severe
 than the use of a leading BOM.
   As well, Excell cannot successfully decode an UTF-8 encoded CSV file.
 But it can autoamtically recognize it if you used instead the import data
 function. This is inconsistant (also it still does not allow specifying how
 to convert numbers using dots instead of commas, when running it on a
 non-English user locale, you need to manually use a search/replace
 function; it does not allow selecting the date format for CSV file imports,
 making searhd/replacements operations is not trivial on date fields ; no
 question is asked to the user, it only uses implicits defaults even when
 they are wrong, most of the time for actual cases of CSV files).
 [/side node]

 But It has nothing to do with your problem of romanization or behavior
 with Latin. BOMs are only absent from old 8-bit character sets that are no
 longer recommanded in any modern Internet protocols ; and from 7-bit ASCII
 used only for internal technical data but not for any text intended to be
 read and translated.

 Only UTF-8 support is mandatory now. And that's fine. HTTP headers or URLs
 require a specufinc encoding but webservers and designing tools can ta ke
 care of that

 Everythng else is optional and will require an explicit metadata (the
 exceptions being UTF-16 and UTF-32 which are not well suited for
 interchanges across heterogeneous networks and independant realms, but used
 mostly for internal processes, for which you absolutely don't need any byte
 order change, so for which

Re: Charset declaration in HTML (was: Romanized Singhala - Think about it again)

2012-07-10 Thread Naena Guru
Thank you Otto.

Sorry for delay in replying. I spent the entire Sunday replying Jaques
twins.

You are absolutely right about choice between ISO-8859-1 and UTF-8. I
shouldn't have said 'using ISO-8859-1 is advantageous over UTF-8' It is
efficient if your pages are written in a language that uses single byte
codepoints. When you mix multi-byte based codepoints, like you said, the
ideal is to have them in their raw form. But in practice, this is not as
easy as we think.

Actually, the trade-off is not great for me because I use only little
non-SBCS characters. Each 2-byte character would end up as six bytes in a
Hex char entity. If you want to control the look of your web site, then you
probably have to have expensive software to do it. As for poor me, I use
CSS, JavaScript and HTML inside HTML-Kit.

HTML5 assumes UTF-8 as the character set if you do not declare one
explicitly. My current pages are in HTML 4.

As I said, I use HTML-Kit (and Tools). If I have raw Unicode Sinhala in the
HTML or Javascript, it messes them and gives you character-not-found for
them on the web page. I must have character entities if I need the comfort
of HTML-Kit. There are web sites that help you process your SBCS and
multi-byte mixed text to make character entities for non Latin-1
characters. I used them when making my only page that has them (Liyanna).
Stop and think why there are such websites. (Search text to unicode). The
world outside Latin-1 is a harsh one.

If I want to have raw Unicode Sinhala, PTS Pali or IAST Sanskrit, I have to
use Notepad instead of HTML-Kit. It is hard to code without color-coded
text.

I wanted to see how hard it is to edit a page in Notepad. So I made a copy
of my LIYANNA page and replaced the character entities I used for Unicode
Sinhala, accented Pali and Sanskrit with their raw letters. Notepad forced
me to save the file in UTF-8 format. I ran it through W3C Validator. It
passed HTML5 test with the following warning:

[image: Warning] Byte-Order Mark found in UTF-8 File.

The Unicode Byte-Order Mark (BOM) in UTF-8 encoded files is known to cause
problems for some text editors and older browsers. You may want to consider
avoiding its use until it is better supported.

The BOM is the first character of the file. There are myriad hoops that
non-Latin users go through to do things that we routinely do. This problem
I saw right at the inception. I already know why romanizing is so good.
Don't you?

UTF-8 encoding is this RFC:
http://www.ietf.org/rfc/rfc2279.txt

This is the table it gives on the way UTF-8 encoding works:
 - 007F   0xxx   ASCII
 0080- 07FF   110x 10xx === Latin -1 plus higher
 0800-    1110 10xx 10xx == Unicode Sinhala

0001 -001F    0xxx 10xx 10xx 10xx
0020 -03FF    10xx 10xx 10xx 10xx 10xx
0400 -7FFF    110x 10xx ... 10xx

Observe that Latin 'a' transforms from UCS-2 to two coded bytes with UTF-8
and Unicode Sinhala Ayanna goes from two to three.
Unicode Sinhala: 0D80 - 0DFF
a = Hex 61 = Bin 0110 0001 -
UTF-8 Template: 110x 10xx
UTF-8 Encoding: 1101 1011 = Hex C1 A1

ayanna = Hex 0D85 = Bin  11011000 0101 -
UTF-8 Template: 1110 10xx 10xx
UTF-8 encoding: 1110 10110110 1101 = Hex E0 B6 85

Thanks for your input. It is appreciated.


On Wed, Jul 4, 2012 at 2:25 PM, Otto Stolz otto.st...@uni-konstanz.dewrote:

 Hello Naena Guru,

 on 2012-07-04, you wrote:

 The purpose of
 declaring the character set as iso-8859-1 than utf-8 is to avoid doubling
 and trebling the size of the page by utf-8. I think, if you have
 characters
 outside iso-8859-1 and declare the page as such, you get
 Character-not-found for those locations. (I may be wrong).


 You are wrong, indeed.

 If you declare your page as ISO-8859-1, every octet
 (aka byte) in your page will be understood as a Latin-1
 character; hence you cannot have any other character
 in your page. So, your notion of “characters outside
 iso-8859-1” is completely meaningless.

 If you declare your page as UTF-8, you can have
 any Unicode character (even PUA characters) in
 your page.

 Regardless of the charset declaration of your page,
 you can include both Numeric Character References
 and Character Entity References in your HTML source,
 cf., e.g., 
 http://www.w3.org/TR/html401/**charset.html#h-5.3http://www.w3.org/TR/html401/charset.html#h-5.3
 .
 These may refer to any Unicode character, whatsoever.
 However, they will take considerably more storage space
 (and transmission bandwidth) than the UTF-8 encoded
 characters would take.

 Good luck,
   Otto Stolz





Re: Sinhala naming conventions

2012-07-10 Thread Naena Guru
I did not see the original message here and may be off the subject.
However, this seems to be about making computer related words.

Singhala is much more Indic than Sriramana thinks. (Oldest Brahmi was found
in Lanka). Its phoneme inventory is near Devanagari except Devanagari has
some Dravidian phonemes in addition and Singhala has the (OE) æ sound.
Sanskrit is the common core. Though Tamil is Dravidian, they (and Germans)
tend to be authoritative in Sanskrit. In this light, it makes eminent
sense, and if Sanskrit is preferred, that Indians compile the base names
for terms, and Lankans follow. Lankans, perhaps because the technocracy is
a guarded very small closed group, they tend not to be too deliberative,
and stamp down public criticism to guard their mistakes.

English is not strange to South Asian natives. I do  not know why we need
to hurry localizing where that locality is mostly academic (unless there is
a commercial advantage for globalizing business, which could be damaging,
like it happened with Unicode Sinhala).

The ideal is for the terms to evolve by allowing the public to participate.
Let it take the natural course of time. Why should it be any different to
how it evolves in America for English? In Lanka, there is no discussion and
cooperation, only arrogant imposition.

In our experience, we have seen Lankan bureaucrats make howlers with
impunity. For example, mRdukaMga = 'soft parts' for software. (I am using
HK-Sanskrit transcription). Instead, I use 'anavya' (as an Indian friend
suggested) a perfect fit for the idea of program as I understand it as a
programmer. Then antarjAla = inter-net: but there is a subtle though an
important difference between connotations of net and network. We have
intuitive new Singhala words like පාපැදිය paapædiya for bicycle meaning
what-you-pedal. There are words like තරු වැල þaru væla (the stars) or තරු
කැල þaru kæla (those stars) where වැල and කැල suggest the entire collection
and a given collection respectively. May be they could be used when
considering network.

They published in SLS1134 a rule that says brandy should be transcribed
as බ්‍ර‌ැන්ඩි, which stands for the sound brunDi. Going by that erroneous
rule, the Sanskrit words, krUra and bahuzruta correctly spelled as ක්‍ර‌ෑර
and බහුශ්‍ර‌ැත would now be pronounced as krææra and bahuzræþa (using OE).
And Pali brūhi (HK Sans) brUhi would now be spoken as bææhi.

Thanks to Monier-Williams,  at this ripe old age I discovered 'lipi
lekhana' to mean stationery than documents. No wonder some Indian linguists
politely say that there is Buddhist Sanskrit that misuse the language. I
think they mean us. Another one I found was lipisaMkhyA, which means
exactly what we tern text.

Tamil Nadu where Indian linguists reside is closer to Colombo than Delhi I
think we should not consider political boundaries when it comes to matters
regarding language and let India take the lead. (However IMO, using English
terms is best, but not like 'dongle' for flash drive as they do in Lanka).

On Mon, Jul 9, 2012 at 10:14 PM, Shriramana Sharma samj...@gmail.comwrote:

 Changing the subject line.

 On Tue, Jul 10, 2012 at 7:19 AM, Harshula harsh...@gmail.com wrote:
  0D9A ක sinhala letter ka
  = SINHALA LETTER ALPAPRAANA KAYANNA

 Hi -- while I agree with Michael that it would be better to have had a
 uniform naming standard across all Indic scripts which are perhaps
 more globally and country-neutrally termed as Brahmic scripts (and
 Sinhala is certainly a Brahmic scripts) I think it is not totally
 inappropriate to use the native names since the main target audience
 is indeed the native users. Unicode is a global standard, and it would
 not be inappropriate to acquiesce to native users' perceptions in
 cosmetic matters such as user names (since these would not result in
 technical problems).

 The South East Asian scripts despite being Brahmic are certainly not
 named after the Indic pattern. So why not Sinhala use its native
 conventions as well? The Indic naming pattern was largely a result of
 the GOI's desire to have a uniform naming across *Indic* (!= Brahmic,
 right now) scripts to facilitate production of cross-script pan-Indic
 software. if they had given the states free rein in naming stuff, we
 would have had quite confusing naming standards within Indic scripts
 itself I'm sure. Many Tamilians would probably like to see TAMIL
 LETTER LLLA named ZHA, but the ZHA of native Tamil perception would
 not correspond to the ZHA of Bengali (or Assamese or whatever)
 perception, resulting in confusion for software makers and hence
 unsatisfactory software and discontent all around!

 One will certainly agree that there is (quite naturally) less
 interconnection of India and Sri Lanka than between the Indian states
 themselves. It is quite unlikely the GOI is going to invest money in
 producing Sinhala fonts/software. This being so, if the Sri Lankan
 Govt wished to have their native names in the global standard to
 facilitate 

Re: Sinhala naming conventions

2012-07-10 Thread Naena Guru
Michael is right.

For instance, there is no such thing called AL in Singhala. It is baby
babble dropping h from hal. 'hal' means consonant. I leaned the sign as hal
kiriima. AL has been given official status by indifferent technocrats.
Recall that they said there are no Singhala numerals when the books
published from 1800s to 2001 showed them clearly.

A Comprehensive Grammar of the Sinhalese Language by A.M. Gunasekara
(1891) Published by Asian Education Services - New Delahi and Madras India.
හෙළ හොඩියේ වතගොත - පැරණි පොත් සමාගම - 2001
(hela hodiyee vaþagoþa - pærani poþ samaagama - 2001)
SLR 130.00 (about US$ 1.00)

I got down the first book through Alibris.com and I think I emailed or
called for the second one. This second one is all you need as a font maker,
Michael.
Publisher:
Paerani Poth Samaagama
198 Highlevel Road
Nugegoda, Sri Lanka
Phone 01-852911
Email: mob...@slt.net

On Tue, Jul 10, 2012 at 2:29 AM, Michael Everson ever...@evertype.comwrote:

 On 10 Jul 2012, at 08:13, Shriramana Sharma wrote:

  On Tue, Jul 10, 2012 at 12:39 PM, Harshula harsh...@gmail.com wrote:
  I haven't heard any complaints from native Sinhala speakers about the
  existing descriptions: e.g.
  
  0D9A ක SINHALA LETTER ALPAPRAANA KAYANNA
  = sinhala letter ka
 
  The existing names are fine for Sinhala speakers. Michael feels it would
 have been better if they had been named after the *Indian* Indic scripts.

 No, my view is that it would have been better if all the *Brahmic* scripts
 had the same naming conventions.

 Michael Everson * http://www.evertype.com/






Re: Sorting Pali in Tibetan Script

2012-07-08 Thread Naena Guru
Sorry to disconcert you. sir.

I just gave the Singhala order. Singhala is the script of Pali. The l
and ḷ together makes sense. It is not inSanskrit anyway. But the  aṃ,
a, ā order is not Singhala or Sanskrit order, not in any grammar book.
Anusvara comes after vowels and diphthings. But you are determined to apply
your own order. That's your choice. This I wrote generally for others to
note.

On Sat, Jul 7, 2012 at 8:41 PM, Richard Wordingham 
richard.wording...@ntlworld.com wrote:

 On Sat, 7 Jul 2012 17:43:41 -0500
 Naena Guru naenag...@gmail.com wrote:

  This is the Pali sorting order in PTS Pali. The Last letter is the
  retroflex L:
  a ā i ī u ū
  e o
  aṃ aaṃ iṃ iiṃ uṃ uuṃ
  eṃ oṃ
  k kh g gh ṅ
  c ch j jh ñ
  ṭ ṭh ḍ ḍh ṇ
  t th d dh n
  p ph b bh m
  y r l v
  s h
  ḷ

 Disconcertingly, no.  ḷ follows l, sorted as a primary difference.
 Otherwise, the consonants do follow the normal Indic script order.

 Also, the first three vowels are aṃ, a, ā.

 This information wasn't what I needed, but thank you for the attempt
 to help.

 Richard.





Re: Sorting Pali in Tibetan Script

2012-07-08 Thread Naena Guru
Here is a more authoritative answer:

!. Pali is a subset of Sanskrit
2. The Sanskrit sorting order is as Monier-Wiliams gives and it matches
vyAkaraNa books.
http://www.sanskrit-lexicon.uni-koeln.de/scans/MWScan/tamil/index.html

You can place dark el where ever you want to because Sanskrit does not have
it.

Please respect the tradition.

Thank you.

On Sun, Jul 8, 2012 at 6:40 PM, Naena Guru naenag...@gmail.com wrote:

 Sorry to disconcert you. sir.

 I just gave the Singhala order. Singhala is the script of Pali. The l
 and ḷ together makes sense. It is not inSanskrit anyway. But the  aṃ,
 a, ā order is not Singhala or Sanskrit order, not in any grammar book.
 Anusvara comes after vowels and diphthings. But you are determined to apply
 your own order. That's your choice. This I wrote generally for others to
 note.

 On Sat, Jul 7, 2012 at 8:41 PM, Richard Wordingham 
 richard.wording...@ntlworld.com wrote:

 On Sat, 7 Jul 2012 17:43:41 -0500
 Naena Guru naenag...@gmail.com wrote:

  This is the Pali sorting order in PTS Pali. The Last letter is the
  retroflex L:
  a ā i ī u ū
  e o
  aṃ aaṃ iṃ iiṃ uṃ uuṃ
  eṃ oṃ
  k kh g gh ṅ
  c ch j jh ñ
  ṭ ṭh ḍ ḍh ṇ
  t th d dh n
  p ph b bh m
  y r l v
  s h
  ḷ

 Disconcertingly, no.  ḷ follows l, sorted as a primary difference.
 Otherwise, the consonants do follow the normal Indic script order.

 Also, the first three vowels are aṃ, a, ā.

 This information wasn't what I needed, but thank you for the attempt
 to help.

 Richard.






Re: Sorting Pali in Tibetan Script

2012-07-07 Thread Naena Guru
This is the Pali sorting order in PTS Pali. The Last letter is the
retroflex L:
a ā i ī u ū
e o
aṃ aaṃ iṃ iiṃ uṃ uuṃ
eṃ oṃ
k kh g gh ṅ
c ch j jh ñ
ṭ ṭh ḍ ḍh ṇ
t th d dh n
p ph b bh m
y r l v
s h
ḷ


On Sat, Jul 7, 2012 at 12:14 PM, Richard Wordingham 
richard.wording...@ntlworld.com wrote:

 Can someone please advise me as to the sorting of Pali as Pali in
 Tibetan script.  I need a prompt response rather than a complete
 treatment.  It is possible that I have been misunderstood what I have
 been able to pull together.

 What I understand is the following:

 (a) The retroflex lateral ('LLA' in most Unicode encodings) is written
 U+0F63 TIBETAN LETTER LA, U+0F39 TIBETAN MARK TSA -PHRU, as at
 http://www.tipitaka.org/tibt/ .

 (b) For Pali, the retroflex lateral should be sorted as though a full
 letter, rather than as letter plus subscript.  This is general
 international practice, embodied in scripts that have LLA encoded as an
 independent letter, such as Sinhalese (backed up by SLS 1134:2004) and
 Thai (many dictionaries).

 (c) The long vowel II sorts at a primary level between the short vowel I
 and the short vowel U - general practice in Indic scripts, and captured
 by ISO 14651 and the Default Unicode Collation Element Table (DUCET).

 Now if I am correct, this does have an interesting processing effect.
 The syllable LLII, in NFD, will be written U+0F63, U+0F71 TIBETAN
 VOWEL SIGN AA, U+0F72 TIBETAN VOWEL SIGN I, U+0F39, so to collate LLII
 on the basis of the constituent consonant and vowel requires the
 discontiguous contraction U+0F63, U+0F39 and then the contraction
 U+0F71, U+0F72 from the skipped characters.  Version 6.1.0 of the
 Unicode Collation Algorithm requires the ability to do exactly this.
 However, it has been proposed that Version 6.2.0 *prohibit* this
 ability.

 The treatment of LL.HA could be interesting, but is not of urgent
 interest.

 Doubts are cast on my analysis by the rules for Tibetan collation
 given, for example at
 http://developer.mimer.com/collations/tibetan/Chilton_slides.pdf ,
 which states that U+0F71 is given a secondary weight and makes no
 mention of the long vowels, and certainly makes no mention of any LLA.

 If the desciption there is correct and complete, it seems that I should
 see a sort order

 LI U+0F63, U+0F72  LLI U+0F63, U+0F72, U+0F39 
 LII U+0F63, U+0F71, U+0F72  LLII U+0F63, U+0F71, U+0F72, U+0F39.

 Is this the correct order for sorting as Tibetan?  The diacritics do
 seem to apply back-to-front.

 Richard.




Re: Romanized Singhala - Think about it again

2012-07-07 Thread Naena Guru
Thank you Goliath.

On another subject, I think the script you dreamed of as a boy is very
nearly fuþorc. foþorc is the (Old) English alphabet.

Thank you.

On Wed, Jul 4, 2012 at 1:54 PM, Doug Ewell d...@ewellic.org wrote:

 [removing cc list]


 Naena Guru wrote:

  On this 4th of July, let me quote James Madison:


 [quote from Madison irrelevant to character encoding principles snipped]


  I gave much thought to why many here at the Unicode mailing list reacted
 badly to my saying that Unicode solution for Singhala is bad.


 Unicode encodes Latin characters in their own block, and Sinhala
 characters in their own block. Many of us disagree with a solution to
 encode Sinhala characters as though they were merely Latin characters with
 different shapes, and agree with the Unicode solution to encode them as
 separate characters. This is a technical matter.

 I see the problem. This is what confused Philippe too. This is primarily a
transliteration. Transliterations go from one script to another. Not one
Unicode code block (I said code page earlier with an old habit) to another.
So, let's take the font issue out for the time being and concentrate on the
transliteration.

A transliteration scheme is a solution for a problem and has a technology
platform it is made for. Older (predecessor of) IAST Sanskrit and PTS Pali
were solutions made with letterpress printing in mind. They used dots and
bars for accents because they could be improvised easily in the street-side
printing presses. That was 1800s. Suddenly with computers, accented letters
became hard to get. HK Sanskrit made Sanskrit friendly for the computer by
limiting it to ASCII. Now, after electronic communication became cleaner,
we expanded the 7-bit set to full-byte set. Now iso-8859-1 set is available
everywhere.




  Earlier I said the Plain Text idea is bad too.


 And many of us disagree with that rather vehemently as well, for many
 reasons.


  The responses came as attacks on *my* solution than in defense of Unicode
 Singhala.


 It's not personal unless you wish to make it personal. You came onto the
 Unicode mailing list, a place unsurprisingly filled with people who believe
 the Unicode model is a superior if not perfect character encoding model,
 and claimed that encoding Sinhala as if it were Latin (and requiring a
 special font to see the Sinhala glyphs) is a better model. Are you really
 surprised that some people here disagree with you? If you write to a Linux
 mailing list that Linux is terrible and Microsoft Windows is wonderful, you
 will see pushback there too.

 Here is a defense of Unicode Sinhala: it allows you, me, or anyone else to
 create, read, search, and sort plain text in Sinhala, optionally with any
 other script or combination of scripts in the same text, using any of a
 fairly wide variety of fonts, rendering engines, and applications.


  The purpose of designating naenaguru@‌‌gmail.com as a spammer is to
 prevent criticism.


 The list administrator, Sarasvati, can speak to this issue. Every mailing
 list, every single one, has rules concerning the conduct of posters. I note
 that your post made it to the list, though, so I'm not sure what you're on
 about.


  It is shameful that a standards organization belonging to corporations of
 repute resorts to censorship like bureaucrats and academics of little Lanka.


 Do not attempt to represent this as a David and Goliath battle between the
 big bad Unicode Consortium and poor little Sri Lanka or its citizens. This
 is a technical matter.


  I ask you to reconsider:
 As a way of explaining Romanized Singhala, I made some improvements to
 www.LovataSinhala.com. Mainly, it now has near the top of each page a
 link that says, ’switch the script’. That switches the base font of the
 body tag of the page between the Latin and Singhala typefaces. Please read
 the smaller page that pops up.


 The fundamental model is still one of representing Sinhala text using
 Latin characters, and relying on a font switch. It is still completely
 antithetical to the Unicode model.


  I also verified that I hadn’t left any Unicode characters outside
 ISO-8859-1 in the source code -- HTML, JavaScript or CSS. The purpose of
 declaring the character set as iso-8859-1 than utf-8 is to avoid doubling
 and trebling the size of the page by utf-8. I think, if you have characters
 outside iso-8859-1 and declare the page as such, you get
 Character-not-found for those locations. (I may be wrong).


 You didn't read what Philippe wrote. Representing Sinhala characters in
 UTF-8 takes *fewer* bytes, typically less than half, compared to using
 numeric character references like #3523;#3538;#3458;#3524;#**3517;
 #3517;#3538;#3520;#3539;#**3512;#3495; #3465;#3524;#3517;.


  Philippe Verdy, obviously has spent a lot of time researching the web
 site and even went as far as to check the faults of the web service
 provider, Godaddy.com. He called my font a hack font without any proof of
 it.


 A font

Re: Romanized Singhala - Think about it again

2012-07-07 Thread Naena Guru
On Thu, Jul 5, 2012 at 6:51 AM, Philippe Verdy verd...@wanadoo.fr wrote:

 2012/7/5 Naena Guru naenag...@gmail.com:
 
 
  On Wed, Jul 4, 2012 at 11:33 PM, Philippe Verdy verd...@wanadoo.fr
 wrote:
 
  Anyway, consider the solutions already proposed in Sinhalese
  Wikipedia. There are verious solutions proposed, including several
  input methods supported there. But the purpose of these solutions is
  always to generate  Sinhalese texts perfectly encoded with Unicode and
  nothing else.
 
  Thank you for the kind suggestion. The problem is Unicode Sinhala does
 not
  perfectly support Singhala! The solution is for Sinhala not for Unicode!
 I
  am not saying Unicode has a bad intention but an ill-conceived product.
 The
  fault is with Lankan technocrats that took the proposal as it was given
 and
  ever since prevented public participation. My solution is 'perfectly
 encoded
  with Unicode'.
 
 
  Yes thee may remain some issues with older OSes that have limited
  support for standard OpenType layout tables. But there's now no
  problem at all since Windows XP SP2. Windows 7 has the full support,
  and for those users that have still not upgraded from Windows XP,
  Windows 8 will be ready in next August with an upgrade cost of about
  US$ 40 in US (valid offer currently advertized for all users upgrading
  from XP or later), and certainly even less for users in India and Sri
  Lanka.
 
  The above are not any of my complaints.
  Per Capita Income in Sri Lanka $2400. They are content with cell phones.
 The
  practical place for computers is the Internet Cafe. Linux is what the
 vast
  majority  needs.
 
 
  And standard Unicode fonts with free licences are already available
  for all systems (not just Linux for which they were initially
  developed);
 
  Yes, only 4 rickety ones. Who is going to buy them anyway? Still Iskoola
  Pota made by Microsoft by copying a printed font is the best. You check
 the
  Plain Text by mixing Singhala and Latin in the Arial Unicode MS font to
 see
  how pretty Plain text looks. They spent $2 or 20 million for someone to
 come
  and teach them how to make fonts. (Search ICTA.lk). Staying friendly with
  them is profitable. World bank backs you up too.
  Sometime in 1990s when I was in Lanka, I tried to select a PC for my
 printer
  brother. We wanted to buy Adobe, Quark Express etc. The store keeper
 gave a
  list and asked us to select the programs. Knowing that they are
 expensive, I
  asked him first to tell me how much they cost. He said that he will
 install
  anything we wanted for free! The same trip coming back, in Zurich, the
 guys
  tried to give me a illicit copy of  Windows OS in appreciation for
  installing German and Italian (or French?) code pages on their computers.
 
  there even exists solutions for older versions of iPhone
  4. OR on Android smartphones and tablets.
 
  Mine works in them with no special solution. It works anywhere that
 supports
  Open Type -- no platform discrimination
 
 
  No one wants to get back to the situation that existed in the 1980's
  when there was a proliferation of non-interoperable 8 bit encodings
  for each specific platform.
 
  I agree. Today, 14 languages, including English, French, German and
 Italian
  all share the same character space called ISO-8859-1. Romanized Singhala
  uses the same. So, what's the fuss about? The font? Consider that as the
 oft
  suggested IME. Haha!
 
 
  And your solution also does not work in multilingual contexts;
 
  If mine does not work in some multilingual context, none of the 14
 languages
  I mentioned above including English and French don't either.
 
  it does
  not work with many protocols or i18n libraries for applications.
 
  i18n is for multi-byte characters. Mine are single-byte characters. As
 you
  see, the safest place is SBCS.
 
  Or it
  requires specific constraints on web pages requiring complex styling
  everywhere to switch fonts.
 
  Did you see http://www.lovatasinhala.com? May be you are confusing
 Unicode
  Sinhala and romanized Singhala. Unicode Sinhala has a myriad such
 problems.
  That is why it should be abandoned! Please look at the web site and say
 it
  more coherently, if I misunderstood you.

 You are once again confusing the Sinhalese language wit hthe Sinhalese
 script. May be Latin-1 is a good and sufficient script for
 transcribing the language. But Unicode is not made for standardizing
 transliterations. The script is what is being encoded, the way it is.
 Even if this script is deffective on some aspect for the language. As
 long as your transliteration scheme using Latin letters encodings is
 showing Latin letters, it will be fine.

You are very kind. So now I have fulfilled your order by providing a link
on the right side of the page to get rid of the Singhala font.


 But a font that represents Latin letters using Sinhalese glyphs is
 definitely broken. It will not work within multilingual contexts
 except when using many font switches

Re: Romanized Singhala - Think about it again

2012-07-05 Thread Naena Guru
 of the entire Latin
repertoire. You don't have to tell that to me. I have traveled quite a bit
in the IT world. Don't be surprised if it is more than what you've seen.
(Did you forget that earlier you accused me of using characters outside
ISO-8859-1 while claiming I am within it? That is because you saw IAST and
PTS displayed. They use those wonderful letters symbols and diacritics you
are trying to tout. Is there a problem with Asians using ISO-8859-1 code
space even for transliteration?


 The bonus will be that you can still write the Sinhalese
 language with a romanisation like yours,

Bonus?

but there's no need to
 reinvent the Sinhalese script

Singhala script existed many, many years since before the English and
French adopted Latin. What I did was saving it from the massacre going on
with Unicode Sinhala.

itself that your encoding is not even
 capable of completely support in all its aspects (your system only
 supports a reduces subset of the script).

What is the basis for this nonsense?. (Little birds whispering in the
background. Watch out. They are laughing).
My solution supports the entire script, Singhala, Pali and Sanskrit plus
two rare allophones of Sanskrit as well. Tell me what it lacks and I will
add it, haha! One time you said I assigned Unicode Sinhala characters to
the 'hack' font. What I do is assigning Latin characters to Singhala
phonemes. That is called transliteration. There are no 'contextual
versions' of the same Singhala letters like you said earlier.

Ask your friends what they have more than mine in the Singhala script. Ask
them why they included only two ligatures when there are 15 such. Ask them
how many Singhala letters there are.


 Even the legacy ISCII system (used in India) is better, because it is
 supported by a published open standard, for which there's a clear and
 stable conversion from/to Unicode.

My solution is supported by two standards: ISO-8859-1 and Open Type.
ISO-8859-1 is Basic Latin plus Latin-1 Extension part of Unicode standard.

Bottom line is this: If Latin-1 is good enough for English and French, it
is good enough for Singhala too. And if Open Type is good for English and
French, it is good for Singhala too.


 2012/7/5 Naena Guru naenag...@gmail.com:
  Philippe,
 
  My last message was partial. It went out by mistake. I'll try again. It
  takes very long for this old man.
 
 
  -- Forwarded message --
  From: Naena Guru naenag...@gmail.com
  Date: Wed, Jul 4, 2012 at 10:32 PM
  Subject: Re: Romanized Singhala - Think about it again
  To: verd...@wanadoo.fr
 
 
  Hi, Philippe. Thanks for keeping engaged in the discussion. Too little
 time
  spent could lead to misunderstanding.
 
 
  On Wed, Jul 4, 2012 at 3:42 PM, Philippe Verdy verd...@wanadoo.fr
 wrote:
 
  2012/7/4 Naena Guru naenag...@gmail.com:
   Philippe Verdy, obviously has spent a lot of time
 
  Not a lot of time... Sorry.
 
   researching the web site
   and even went as far as to check the faults of the web service
 provider,
   Godaddy.com.
 
  I did not even note that your hosting provider was that company. I
  just looked at the HTTP headers to look at the MIME type and charset
  declarations. Nothing else.
 
  I know that the browser tells it. It is not a big deal, WOFF is the
  compressed TTF, but TTF gets delivered. If and when GoDaddy fixes their
  problem, the pages get delivered faster. Or I can make that fix in a
  .htaccess file. No time!
 
 
   He called my font a hack font without any proof of it.
 
  It is really a hack. Your font assigns Sinhalese characters to Latin
  letters (or some punctuations) of ISO 8859-1.
 
  My font does not have anything to do with Singhalese characters if you
 mean
  Unicode characters. You are very confusing.
  A Character in this context is a datatype. In the 80s it was one byte in
  size and used to signal not to use in arithmetic. (We still did it to
  convert between Capitals and Simple forms.) In the Unicode character
  database, a character is a numerical position. A Unicode Sinhala
 character
  is defined in Hex [0D80 - 0DFF]. Unicode Sinhala characters represent an
  incomplete hotchpotch of ideas of letters, ligatures and signs. I have
 none
  of that in the font.
 
  I say and know that Unicode Sinhala is a failure. It inhibits use of
  Singhala on the computer and the network. I do not concern me with
 fixing it
  because it cannot be fixed. Only thing I did in relation to it is to
 write
  an elaborate set of routines to *translate* (not map) between constructs
 of
  Unicode Sinhala characters and romanized Singhala. That is not in the
 font.
  The font has lookup tables.
 
  It also assigns
  contextual variants of the same abstract Sinhalese letters, to ISO
  8859-1 codes,
 
  What contexts cause what variants? Looks like you are saying Singhala
  letters cha
 
  plus glyphs for some ligatures of multiple Sinhalese
  letters to ISO 8859-1 codes, plus it reorders these glyphs so that
  they no longer match

Romanized Singhala - Think about it again

2012-07-04 Thread Naena Guru
Pardon me for including a CC list. These are people who showed for and
against opinion.

On this 4th of July, let me quote James Madison:
A zeal for different opinions concerning religion, concerning government,
and many other points, as well of speculation as of practice; an attachment
to different leaders ambitiously contending for pre-eminence and power; or
to persons of other descriptions whose fortunes have been interesting to
the human passions, have, in turn, divided mankind into parties, inflamed
them with mutual animosity, and rendered them much more disposed to vex and
oppress each other than to co-operate for their common good.

I gave much thought to why many here at the Unicode mailing list reacted
badly to my saying that Unicode solution for Singhala is bad. Earlier I
said the Plain Text idea is bad too. The responses came as attacks on *my*
solution than in defense of Unicode Singhala. The purpose of designating
naenaguru@‌‌gmail.com as a spammer is to prevent criticism. It is shameful
that a standards organization belonging to corporations of repute resorts
to censorship like bureaucrats and academics of little Lanka.
*
I ask you to reconsider:*
As a way of explaining Romanized Singhala, I made some improvements to
www.LovataSinhala.com http://www.lovatasinhala.com/. Mainly, it now has
near the top of each page a link that says, ’switch the script’. That
switches the base font of the body tag of the page between the Latin and
Singhala typefaces. *Please read the smaller page that pops up.*

I also verified that I hadn’t left any Unicode characters outside
ISO-8859-1 in the source code -- HTML, JavaScript or CSS. The purpose of
declaring the character set as iso-8859-1 than utf-8 is to avoid doubling
and trebling the size of the page by utf-8. I think, if you have characters
outside iso-8859-1 and declare the page as such, you get
Character-not-found for those locations. (I may be wrong).

Philippe Verdy, obviously has spent a lot of time researching the web site
and even went as far as to check the faults of the web service provider,
Godaddy.com. He called my font a hack font without any proof of it. It has
only characters relevant to romanized Singhala within the SBCS. Most of the
work was in the PUA and Look-up Tables. I am reminded of Inspector Clouseau
that has many gadgets and in the end finds himself as the culprit.

I will still read and try those other things Philippe suggests, when I get
time. What is important for me is to improve on orthography rules and add
more Indic languages -- Devanagari and Tamil coming up.

As for those who do not want to think rationally and think Unicode is a
religion, I can only point to my dilemma:
http://lovatasinhala.com/assayaa.htm

Have a Happy Fourth of July!


Re: Romanized Singhala - Think about it again

2012-07-04 Thread Naena Guru
Philippe, ask your friends why ordinary people Anglicize if Unicode Sinhala
is so great. See just one of many community forums: http://elakiri.com

I know you do not care about a language of a 15 milllion people, but it
matters to them.

On Wed, Jul 4, 2012 at 10:46 PM, Philippe Verdy verd...@wanadoo.fr wrote:

 You are alone to think that. Users of the Sinhalese edition of
 Wikipedia do not need your hack or even webfonts to use the website.
 It only uses standard Unicode, with very common web browsers. And it
 works as is.
 For users that are not preequiped with the necessary fonts and
 browsers, Wikipedia indicates this vey useful site:
 http://www.siyabas.lk/sinhala_how_to_install_in_english.html

I have two guys here in the US that asked me to help get rid of Unicode
Sinhala that I helped them install from that 'very useful site'. Copies of
this message goes to them. Actually, you do not need their special
installation if you have Windows 7. Windows XP needs update of Uniscribe,
and Vista too. Their installation programs are faulty and interferes with
your OS settings.



 This solves the problem at least for older version of Windows or old
 distributions of Linux (now all popular distributions support
 Sinhalese). No web fonts are even necessary (WOFT works only in
 Windows but not in older versions of Windows with old versions of IE).

You mean WEFT? Now TTF (OTF) are compressed into WOFF. I see that Microsoft
is finally supporting it.(At least my font downloads, or may be it picks up
the font in my computer? Now I am confused)


 Everything is covered : working with TrueType and OpenType, adding an
 IME if needed. And then navigating on standard Sinhalese websites
 encoded with Unicode.


Philippe, try making a web page with Unicode Sinhala.


 Note that for version of Windows with older versions than IE6 there is
 no support only because these older versions did not have the
 necessary minimum support for complex scripts. The alternative is to
 use another browser such as Firefox which uses its own independant
 renderer that does not depend on Windows Uniscribe support. But these
 users are now extremely rare. Almost everyone now uses at least XP for
 Windows (Windows 95/98 are definitely dead), or uses a Mac, or a
 smartphone, or another browser (such as Firefox, Chrome, Opera).

I agree.


 Nobody except you support your tricks and hacks. You come really too
 late truing to solve a problem that no longer exists as it has been
 solved since long for Sinhalese.

Mine is a comprehensive solution. It is a transliteration. Ask users that
compared the two. Find ordinary Singhalese. They use Unicode Sinhala to
read news web sites. The rest of the time they Anglicize or write in
English.

Everything is covered here too, buddy. Adobe apps since 2004, Apple since
2004, Mozilla since 2006, All other modern browsers since 2010. MS Office
2010. Abiword, gNumeric, Linux all the works. IE 8,9 partial. IE 10 full.
So?


 2012/7/5 Naena Guru naenag...@gmail.com:
  Hi, Philippe. Thanks for keeping engaged in the discussion. Too little
 time
  spent could lead to misunderstanding.
 
 
  On Wed, Jul 4, 2012 at 3:42 PM, Philippe Verdy verd...@wanadoo.fr
 wrote:
 
  2012/7/4 Naena Guru naenag...@gmail.com:
   Philippe Verdy, obviously has spent a lot of time
 
  Not a lot of time... Sorry.
 
   researching the web site
   and even went as far as to check the faults of the web service
 provider,
   Godaddy.com.
 
  I did not even note that your hosting provider was that company. I
  just looked at the HTTP headers to look at the MIME type and charset
  declarations. Nothing else.
 
  I know that the browser tells it. It is not a big deal, WOFF is the
  compressed TTF, but TTF gets delivered. If and when GoDaddy fixes their
  problem, the pages get delivered faster. Or I can make that fix in a
  .htaccess file. No time!
 
 
   He called my font a hack font without any proof of it.
 
  It is really a hack. Your font assigns Sinhalese characters to Latin
  letters (or some punctuations) of ISO 8859-1.
 
  My font does not have anything to do with Singhalese characters if you
 mean
  Unicode characters. You are very confusing.
  A Character in this context is a datatype. In the 80s it was one byte in
  size and used to signal not to use in arithmetic. (We still did it to
  convert between Capitals and Simple forms.) In the Unicode character
  database, a character is a numerical position. A Unicode Sinhala
 character
  is defined in Hex [0D80 - 0DFF]. Unicode Sinhala characters represent an
  incomplete hotchpotch of ideas of letters, ligatures and signs. I have
 none
  of that in the font.
 
  I say and know that Unicode Sinhala is a failure. It inhibits use of
  Singhala on the computer and the network. I do not concern me with
 fixing it
  because it cannot be fixed. Only thing I did in relation to it is to
 write
  an elaborate set of routines to *translate* (not map) between constructs
 of
  Unicode Sinhala

Re: Offlist: complex rendering

2012-06-18 Thread Naena Guru
 was through
US-International and shapes were automatic.

There is a very important function of the font other than the visual help
for typing and reading. It also upholds and preserves dying orthographies.
It should be developed into a true orthographic font. (Help and time?).

Singhala has three orthographies. Sinhala words do not have joined letters.
Sanskrit and Pali have a set of double and treble conjoint letters. In
addition, Pali eliminates the hal sign within the words and allows only the
halant. Pali needs its own fonts.

Below are links to two files that show the first paragraph of this web page:
http://www.divaina.com/2012/06/17/scholast.html
Unicode Sinhala:
http://ahangama.com/sing/DBS.htm http://ahangama.com/sing/DSS.htm (4 kB)
Romanized Singhala:
http://ahangama.com/sing/DSS.htm (1 kB)

Compare the shape formation and the sizes of the files. How much bandwidth
is taken for the Unicode Sinhala file to go as UFT-8? 6kB! Six times the
romanized file. Beyond that, imagine how that Unicode Singhala page was
made from scratch and how many more steps were needed to get there than
Latin-1 page. If you closely inspect the original page from Divaina, you
see that they did not input Unicode Sinhala directly but used an
intermediary step. (There are two stray English letters). These are things
that matter for ordinary citizens, not university dons paid and venerated
by those same poor citizens.

I have only limited time. So please do not expect me to reply to the entire
barrage of attack. (I think I am already banned as a spammer for repeating,
or not to hear the unpleasant truth?).

Thank you.

PS:
I added Donald Gaminitilake to the list as he had a third solution for
SInghala and was one I watched pummeled by the crowd in Lanka.

On Mon, Jun 18, 2012 at 2:09 AM, Ruvan Weerasinghe a...@ucsc.cmb.ac.lkwrote:

 Not that it is of much consequence, but even the website cited for
 'Anglicizing' Sinhala (http://www.sinhalaya.com/) now has a majority of
 content in UNICODE! I couldn't really find posts using images in this site
 - not that I tried hard to find them!

 5 years ago, this argument may have had some attention by those not
 conversant with UNICODE, but today, with the multitude of blogs, wiki's
 (Sinhala wikipedia has 7k entries now in case you didn't know), google
 search in Sinhala, only the mischievous could insist we need to go back.

 Incidentally, please visit the following link to see the hit counts for
 common Sinhala words in a google search in August/September 2009. Clicking
 any of the words will give you a google hit count for that word today. This
 should give a good idea of the spread of UNICODE Sinhala:

 http://ucsc.cmb.ac.lk/wiki/index.php/LTRL:Web_Statistics_(Online_Sinhala)/August_September


 (e.g. the word අනුරාධපුරය which had 2500 - 4000 hits that time, has over
 32,000 today; the word ඉදිරිපත් which had 60k+ hits then has over 600k hits
 now)

 Regards.


 Ruvan Weerasinghe
 University of Colombo School of Computing
 Colombo 00700,
 Sri Lanka.

 Web:http://www.ucsc.lk
 Phone:  +94112158953; Fax:+94112587239

 --

 *From: *Harshula harsh...@gmail.com
 *To: *Naena Guru naenag...@gmail.com, jc ahang...@gmail.com
 *Cc: *Tom Gewecke t...@bluesky.org, unicode@unicode.org, Tissa
 Dharmagunaratne tiss...@aol.com, Ranjith Ruberu 
 antonyrub...@hotmail.com, Kusum Perera kusum...@gmail.com, Bertie
 Fernando bertieferna...@hotmail.com, Ruvan Weerasinghe 
 a...@ucsc.cmb.ac.lk, Gihan Dias gi...@cse.mrt.ac.lk, Wasantha
 Deshapriya wasan...@icta.lk
 *Sent: *Monday, June 18, 2012 10:07:14 AM
 *Subject: *Re: Offlist: complex rendering


 Hi JC,

 You have been making the same allegations for more than half a decade.
 Now you have moved on to a new forum, the Unicode Consortium. The
 reality is that all the professionals and academics that work in
 computational linguistics, Sinhala localization, etc in Sri Lanka are
 on-board with Unicode Sinhala. We are seeing research and applications
 developed on top of Unicode Sinhala.

 IIRC, back then you were unable to demonstrate the shortcomings of
 Unicode Sinhala that your scheme solved. If you have complaints about
 operating systems that do not implement Unicode Sinhala correctly,
 please contact the specific company.

 cya,
 #

 On Fri, 2012-06-15 at 01:45 -0500, Naena Guru wrote:
  Tom,
 
  Thank you for taking an interest in this matter.
 
  You said,
 
  Mapping multiple scripts to Latin-1 codepoints is contrary to the most
  basic principles of Unicode and represents a backwards technology leap
  of 20 years or more.
 
 
  Well, do you otherwise agree that the transliteration is good? It can
  be typed easily, and certainly not like the Unicode Indic
  transliteration that is only good for Aliens to discover some day.
 
  Unicode has a principle about shapes assigned to characters. It is the
  opposite of what you said. At the time I started this project Unicode
  version 2 specifically said that it does

Re: Offlist: complex rendering

2012-06-15 Thread Naena Guru
Tom,

Thank you for taking an interest in this matter.

You said,

Mapping multiple scripts to Latin-1 codepoints is contrary to the most
basic principles of Unicode and represents a backwards technology leap of
20 years or more.

Well, do you otherwise agree that the transliteration is good? It can be
typed easily, and certainly not like the Unicode Indic transliteration that
is only good for Aliens to discover some day.

Unicode has a principle about shapes assigned to characters. It is the
opposite of what you said. At the time I started this project Unicode
version 2 specifically said that it does not define shapes. That is the
reason I tried it.

Think of it as a help for the person that types. I tested it on real
people. They are unaware that the underlying codes are that of Latin. They
are surprised and elated.

So, if you are so averse to changing the shapes of Latin-1, what would you
say about Fraktur and Gaelic that the standard specifically said are based
on Latin-1 but have different shapes?

You said,
It doesn't seem realistic to me that it could ever see acceptance, and I'm
a bit surprised that you continue to devote your talents to promoting it.
Is there some reason you consider it to be promising nonetheless?

(Thank you for calling me talented. I am not).

It depends on whose acceptance you are talking about. You'll understand if
you are a Singhalese, Tom. The leap 20 years back is what we need. Unicode
parked us in a cul de sac. BTW, I haven't even started to promote it. I
want the IT community to say this works, as it really does.

Think why people Anglicize in this very popular web site:
www.sinhalaya.com
There are many such. (try elakiri.com)

You will see some Unicode Sinhala, but most posts are written using hack
fonts and made into graphics to post. The Lankan government is so worried
that they have launched a program to teach English to everyone perhaps
seeing the demise of Singhala due to digital creep. (Wisdom of
politicians!).

Also look at the web site of the IT agency of the government:
http://www.icta.lk/
How much prominence did they give the language of the 70%?

The bureaucrats are giving themselves medals. (See the pictures). They are
making laws forcing the government employees to use Unicode Singhala,
because they are reluctant. It's a Third World country. The literacy rate
is 90% plus, not a little India. But the people are docile. They depend on
the government to tell what todo. The bureaucracy in return depends on the
West to tell them what is right. The technocrats call themselves යුනිකේත
(Love UNI!)

Yes, Tom, I do have a very good reason. I know it because I am a
Singhalese. It is *practical* and being accepted and commended by everyone
that I showed it to. If English, German, Spanish, Icelandic, Danish etc.
use Latin-1, and if Singhala *can* perfectly map to Latin-1, why shouldn't
it? That is called transliteration. Recall that English fully romanized
about year 600.

Singhala is a minority language that is scheduled to be executed, and
Unicode is unwittingly the reason.

Brahmi probably is Old Singhala. The oldest Brahmi was found in Shree
Langkaa  (Sri Lanka) 2-3 centuries before it was seen in India. Some say
Singhalese founded the Mayans. (What a chauvinist!). So, let's give it a
boost before World Ends.

I need the support of Unicode, which is like World Government for
Laangkans. This is what I want Unicode to judge:


   - Is the transliteration practical?
   - Do I have a round trip conversion with precious Unicode Sinhala?

Help us, Tom.

This message is getting too long.I can list pros and cons of Dual-script
Singhala and Unicode Sinhala to convince any techie why we should forget
Unicode Sinhala.

Let me end with a quote from SICP
http://mitpress.mit.edu/sicp/full-text/book/book.html
Educators, generals, dieticians, psychologists, and parents program.
Armies, students, and some societies are programmed. An assault on large
problems employs a succession of programs, most of which spring into
existence en route. These programs are rife with issues that appear to be
particular to the problem at hand. To appreciate programming as an
intellectual activity in its own right you must turn to computer
programming; you must read and write computer programs -- many of them. It
doesn't matter much what the programs are about or what applications they
serve. What does matter is how well they perform and how smoothly they fit
with other programs in the creation of still greater programs. *The
programmer must seek both perfection of part and adequacy of collection.*

Do we want to be programmed or be programmers? Is the collection adequate?

Best regards,

JC


On Thu, Jun 14, 2012 at 8:08 AM, Tom Gewecke t...@bluesky.org wrote:

 naenaguru wrote:

 Map sounds to QWERTY extended key layouts adding non-English letters -
 Result: strict, rule based alphabet extending from ASCII to Latin-1 -


 Mapping multiple scripts to Latin-1 codepoints is contrary to the most
 basic 

Re: complex rendering (was: Re: Mandombe)

2012-06-13 Thread Naena Guru
Extremely interesting! Ordinary people need practical solutions. Thank you
for trying.

I made the first smartfont for Singhala in 2004. (Perhaps the first ever
for any written language -- a little bragging there). Only SIL.org WorldPad
could show it then. Now, starting with Windows Notepad, Office 2010,
AbiWord and all the browsers *except IE* render the complex letters
perfectly. (Er, Google Chrome, nearly there):
http://www.lovatasinhala.com (hand coded)
and
http://www.ahangama.com/ (WordPress blog).

I think Anderson should try Open Type and MS Volt like I did.

Like he, I am no typographer. It is the programming that is most important
in getting out the orthography. It took me 9 months day and night to get it
to where it is now. Redesigning glyphs or making other font faces are
trivial, now that the rendering rules are defined.

My approach was as follows (applies to Indic and perhaps Arabic and Hebrew
too):
Start observing Anglicizing -
Map sounds to QWERTY extended key layouts adding non-English letters -
Result: strict, rule based alphabet extending from ASCII to Latin-1 -
Font: Program all shapes, base as well as conjoint, to the PUA -
Program look-up tables according to the orthography rules in the grammar of
the language.



On Mon, Jun 11, 2012 at 5:27 PM, vanis...@boil.afraid.org wrote:

 From: Szelp, A. Sz. a.sz.szelp_at_gmail.com
  On Mon, Jun 11, 2012 at 10:58 AM, Stephan Stiller
  sstiller_at_stanford.eduwrote:
 
  
   This is interesting only if the encodable elements would be different -
   remember, Unicode is not a font standard.
  
   +1; rendering can be so much more complex than encoding. I'd really
 like
   to see a successful renderer for Nastaliq, (vertical) Mongolian, or
   Duployan. (What *are* the hardest writing systems to render?)
  
  
  Vertical mongolian does not seem to be harder to render _conceptually_
  than, let's say, simple arabic. It's more the architectural limitations
 of
  rendering engines that seem to limit its availability, and the
 intermixing
  with horizontal text. For Nastaliq, Thomas Milo's DecoType is miraculous:
  it's hard, but given the good job they did, obviously not impossible. —
  Well, I don't know about Duployan.
 
  /Sz

 I guess this is my invitation to chime in. I'm close to releasing a beta
 of a
 Graphite engine for (Chinook repertoire) Duployan, using a PUA encoding.
 By the
 release of 6.3/7.0, we should have a working implementation of Unicode
 Duployan/shorthand rendering for Graphite enabled applications. Like a
 Nastaliq
 implementation, it's convoluted and involved, but not impossible. It will
 not,
 however, be nearly as beautiful as DecoType; I'm not a designer at heart,
 and a
 Duployan implementation as stunning as Milo's Nastaliq will require the
 skills
 of people several orders of magnitude more talented than I.

 -Van Anderson





Re: Unicode, SMS and year 2012

2012-04-28 Thread Naena Guru
Hi Cristian,

This is a bit of a deviation from the issues you raise, but it relates to
the subject in a different way.

The SMS char set does not seem to follow Unicode. How I see Unicode is as a
set of character groups, 7-bit, 8-bit (extends and replaces 7-bit), 16-bit,
and CJKV that use some sort of 16-bit paring. As Unicode says, they are
just numeric codes assigned to letters or whatever other ideas. It is the
task if the devices to decide what they are and show them

You say that there are only two character sets in GSM: 7-bit, which is a
reassignment of codes to a select Latin letter shapes, and 16-bit for the
rest. It appears as if they decided that a certain set of letters are
common to some preferred markets, and that it is efficient to reassign the
established Unicode characters to this newly selected letter shapes. Had
they simply used the 8-bit ISO-8859-1 set, the number of characters per SMS
would limit to 140 instead of 160. (Is that why Twitter limits the # of
chars to 140?). Of course, that would not have included some users whose
letters are 16-bit characters under Unicode.

I made a comprehensive transliteration for the Singhala script
(Singhala+Sanskrit+Pali). It shows perfectly when 'dressed' with a
smartfont. The following are two web sites that illustrate this solution
(every character is ISO-8859-1, except for the occasional ZWNJ, which
actually should be 8-bit NBH that somebody decided to leave undefined. Use
any browser except IE. IE does not understand Open Type)
http://www.lovatasinhala.com (hand coded)
http://www.ahangama.com/  (WordPress blog)

All Indic languages could be transliterated this way. it makes Indic
similar to Latin based European languages with intuitive typing and
orthographic results, which Unicode Sinhala can't do. It takes about half
the bandwidth to transmit that the double-byte set. I just noticed that
transliterated Singhala would not be fully covered with SMS 7-bit because
some Unicode 8-bit characters are not in this set.

Looking at my iPhone, I see that the International icon brings up
key-layout plus font pairs. I think what they should do is to separate
fonts and key-layouts.This way, the user could select the key layout for
input and whatever font they want to use to show it. The next thing I am
going to say made many readers here very angry, but may I say it again?

The idea of Last Resort Font that makes basic editors Plain Text is a ploy
to brag that the computer can show all the world's languages that most you
cannot read anyway. The text runs of foreign languages should show as
series of Glyph Not Found character or the specific hint glyph of a
language. The user of a foreign language would know where to download fonts
of their native language. In the small market of Singhala, no font is
present that goes typographically well with Arial Unicode. There is no
incentive or money to make beautiful fonts for a minority language like
Singhala. The plain text result for Singhala is ugly. The OS makers
unnecessarily made hodge-podge Last Resort Fonts

I hope both the mobile device industry and the PC side separate fonts and
characters and allow the users to decide the default font sets in their
devices. This is eminently rational because the rendering of the font
happens locally, whereas the characters travel across the network. This
will also help those who like me who understand that their language is
better served by a transliteration solution than a convoluted double-byte
solution that discourages the natives to use their script.

Actually, this is causing bilingual Singhalese to abandon their native
language. The government is making special emphasis on English, as Singhala
is terribly difficult to use in the modern setting. This is a grave problem
for a society of near 100% literacy rate, and just a few million.


On Fri, Apr 27, 2012 at 3:06 AM, Cristian Secară or...@secarica.ro wrote:

 Few years ago there was a discussion here about Unicode and SMS
 (Subject: Unicode, SMS, PDA/cellphones). Then and now the situation is
 the same, i.e. a SMS text message that uses characters from the GSM
 character set can include 160 characters per message (stream of 7 bit ×
 160), whereas a message that uses everything else can include only 70
 characters per message (stream of UCS2 16 bit × 70).

 Although my language (Romanian) was and is affected by this
 discrepancy, then I was skeptical about the possibility to improve
 something in the area, mostly because at that time both the PC and
 mobile market suffered about other critical language problems for me
 (like missing gliphs in fonts, or improper keyboard implementation).

 Things evolved and now the perspectives are much better. Regarding the
 SMS, at that time Richard Wordingham pointed that the SCSU might be a
 proper solution for the SMS encoding [when it comes to non-GSM
 characters].

 Recently I studied as much aspects as I could about the SMS
 standardization, in a step that I started approx a 

Re: combining: half, double, triple et cetera ad infinitum

2011-11-14 Thread Naena Guru
Unicode was created for a commercial reason, particularly for the benefit
of its directors. The idea of Plain Text is not anything practical but was
used as a means of attracting supporters, who for the most part hadn't had
any experience with computers.

The following line is Unicode text:
මේ අකුරු ලියා ඇත්තේ යුනිකෝඩ් අකුරෙනි.

I bet most of you see it as a row of Character-not-found glyph. Some would
see it in the non-Latin script, but yet separated into meaningless
components that go to make letters.

So, the headache you are talking about within Latin becomes a tumor that
makes you insane when outside the SBCS.

On Mon, Nov 14, 2011 at 6:37 AM, QSJN 4 UKR qsjn4...@gmail.com wrote:

 Why did the Unicode Consortium think that combination of one base
 character and few combining is possible, and combination of few base
 characters with one combining character is not?
 E.g. U+0483 tilda has to cover a number. Whole number! Not one
 figure!! What for have we it, if we can't use it right way?
 Half combining marks are about decision... And what if we deal not
 with simple horizontal line? what if we need the same tilda above
 number or titlo above abbreviation? Text with numbers and
 abbreviations is not a plane text?
 Idea (bad as I think, but there is many other stupid things in
 Unicode): add some new combining characters of new ccc = word-above,
 word-below, word-over (ZWS works as word separator if needed).




Re: Subtitles in Indic/Arabic/other scripts requiring CTL

2011-11-14 Thread Naena Guru
Shriramana,

The question you raise relates to the problem of font rendering. According
to the Open Type standard, each script, i.e. Latin, Tamil etc. have their
own rules on how letters are constructed and displayed. For instance, when
you write 'ke' in Tamil, the kavanna is preceded by the kombu. That is a
'Tamil specific' rule. The same rule is applied separately to Davanagari,
Sinhala etc., because each language has its own code page.

All this makes each script to have its own set of font rendering rules. If
some application does not know Tamil font rendering rules, then you won't
get the result you want.

Getting back to the example I started, kombu is a sign not tied directly to
a vowel and Kavanna is a base letter. Why on earth did Kombu get a
codepoint equal in status to a letter? Initially, the Indic problem was
interpreted taking Latin alphabetic writing system as the basis. We simply
define the alphabet and string the letters (shapes) together to make the
text.

When they tried to encode Indic, they were completely thrown off the track
by the way vyaakarana books showed the hodiya in the native script. They
should have looked at transliteration schemes such as HK-Sanskrit.

Hodiya is the phoneme chart built around Sanskrit. In it, the hal letters
or consonants were displayed as shapes of ka, ga  etc. Originally in Indic,
both the pure consonant and the letter with 'a' implicit had the same form.
The text in the vyaakarana book itself, if anyone cared to read, treated
hal as consonants and not as some mysterious thing that carried an 'a' that
had to be smoked out with a virama. Originally people did not use the
virama or halant (or pulli in Tamil) inside words. Halant and virama means
this mark indicates that the last letter of the word is a consonant.

In Singhala, the Tamil word 'uppu' would be written u+p-pu, where p and pu
*touched* indicating that the p is a pure consonant. When you write megam,
the ending m would have a hal kiriimee lakuna (pulli, the virama). If it
did not have the pulli, then it would be megama. I am not sure how Tamil
dealt with this, but searching for 'Tamil palm leaf book', you can find
many pictures of old manuscripts that you could inspect to see how Tamil
had it.

In planning the Unicode code page, we look at Tamil and decide, okay, 'ke'
needs a kombu and kavanna. Let's then give a codepoint to kombu as well.
Now write k followed by e and we get kavanna followed by kombu -- wrong
order. Therefore, we write a special rule for Tamil saying that if you
follow k with e, replace k already on screen with e and k, in that
transposed order, and some more reasoning and complications get the result.

Unicode is a disaster for our languages. You cannot sort, backspace delete,
search and replace and so on as as we do with English -- no telephone books
at least for Singhala, my native language.

Every application knows or supposed to know Latin font rendering rules. You
test your program first with Latin, right? The latest rules for font making
are in the Open Type standard published in 2003 by Microsoft. Nearly all
new versions of common programs understand Latin script rules in the Open
Type standard except Internet Explorer.

The solution is to first transliterate Tamil into the Latin script and to
write a smartfont. Then nearly all recent version of programs will let you
show Tamil correctly, except Internet Explorer. Internet Explorer does not
understand the Open Type standard. (Microsoft wrote the standard).

People in the West and those who came to West do not know finer points
about our languages. It is our onus to get our solution, Sri Ramana, and it
is not hard.

I did this for Singhala. See the following web site. It is all romanized
Sinhala in the background and shown in the complex native script. Do not
use Internet Explorer. If you use Firefox, you may have to click on a
second page to see the native Singhala smartfont dress the Latin script.
This obviates the need for a separate code page for Singhala:
http://www.lovatasinhala.com/liyanna.php


On Fri, Nov 11, 2011 at 8:27 AM, Shriramana Sharma samj...@gmail.comwrote:

 My father (Dr T Vasudevan, cc-ed here) is working on a spoken (actually
 video) tutorial project using Indian languages as the instruction medium.
 He would like to add Indic language subtitles in (obviously) Indic scripts.
 For now, Tamil, both as the language he is working on (as it is our mother
 tongue) as well as the Indic script which is simplest in terms of CTL.

 However it seems current video players (at least two OSS ones -- MPlayer
 and VideoLAN that they are using in that project) do not support CTL.

 In speaking about this with another friend of mine he was wondering
 whether Arabic people have worked on getting Arabic subtitles, as it would
 also involve CTL.

 Can anyone shed light on CTL in subtitles? Perhaps Arabic or hopefully
 Indic?

 --
 Shriramana Sharma




Re: Purpose of plain text (WAS: Re: combining: half, double, triple et cetera ad infinitum)

2011-11-14 Thread Naena Guru
If you use the same underlying text code, you could still differentiate
languages by fonts. It is a different way of looking at the problem. This
is true in the case of transliterated Indic. I haven't given thought as to
how that could be implemented with BIDI. When you compare benefits Indic
would obtain through transliteration against having their own code sets,
you see that living in the SBCS has clear advantages over DBCS. I have
done, tested and proven it.In simple terms it reduces your dependency.

As for making money, some people are better placed to make money and
produce things for the main reason of making money. Others might spend
effort and their own money knowing they have no good chance of making
money, but their work benefits other people or helps them extricate them
from a bad situation that they have fallen into.

Making solutions should not solely for making solutions for getting rich.
We only need to look around to see what such selfish action has done.
According to marketing strategy, if you can create a need so you can craft
the solution. Look at TV commercials.

Thank you.

On Mon, Nov 14, 2011 at 10:18 AM, CE Whitehead cewcat...@hotmail.comwrote:

  Hi.
 From: Naena Guru naenaguru_at_gmail.com
 Date: Mon, 14 Nov 2011 09:30:40 -0600

   Unicode was created for a commercial reason, particularly for the
 benefit
   of its directors.
 I expect that it benefits more than just its directors (I don't expect
 anyone to act completely pro-bono; people have to get something out of what
 they do; sorry for commenting on this).
  The idea of Plain Text is not anything practical but was

  used as a means of attracting supporters, who for the most part hadn't
 had
   any experience with computers.
 I do believe that plain text is used in input forms, and the bidi
 algorithm at least becomes quite important here.  (Someone correct me if
 this is in error.)
  . . .


 Best,



 --C. E. Whitehead
 cewcat...@hotmail.com




Re: Purpose of plain text (WAS: Re: combining: half, double, triple et cetera ad infinitum)

2011-11-14 Thread Naena Guru
If it came out as Unicode has its only goal as money making, that is not
what I meant to say. Nothing can be such. You sell something for the
buyer's benefit, right? I apologize if you feel hurt over it. However, it
is probably the main objective. Who works for nothing except odd crazies
like me?

Many voluntary contributions benefit the owners of Unicode. Unicode is open
for people like me to speak at. In contrast, the agency in Sri Lanka (ICTA)
is closed to the public. Their inspiration is the communist system. That
forces me to work on my own. My situation is like helping someone who
bought a lemon thinking it is a great Cadillac.I say that the bus others
are traveling in is good for him too.

When years back I asked why ligatures formed inside Notepad and not inside
Word, I had the clear reply that it is owing to a business decision.

Let me try to clearly say what I want to say:
1. Unicode came up with the idea of one codepoint for one letter of any
language.
2. The justification was that on one text stream you could show all the
different languages. At least that is what I understood.
3. The above 2 is not practical and does not work even now after so many
years
4. Why Indic code pages do not work so well for text processing is not the
fault of Unicode but that of the user groups
5. However, technology arrived at those countries too late to for actual
users, not bureaucrats, to understand the mistakes
6. Therefore, I say that there was an undue push by Unicode to complete the
standard, by issuing ultimatums for registering ligatures etc.

Having said all that, all is not so bad. I say transliterate to Latin and
make smartfonts. It is a proven success.

You said, This thread presumes that display is, by orders of magnitude,
the most
important aspect of text processing. That is perfectly met by using
smartfonts.
As for, every other operation that could be performed on text is
secondary is beautifully met with fonts too.
I do not understand what you meant by jury-rigged to accommodate visual
display order. Did you mean using unexpected shapes for Latin codes? If
you meant that, how do you justify earlier versions of Unicode standard
giving specific explanation about codepoints that they do not represent
shapes and Fraktur and Gaelic could very well use Latin as their underlying
codes?

I think the ability to use text in the computer in the way you expect text
to behave in it is very important. For instance, if you have shape
representations mapped to code clusters, scanned text could be more
accurately digitized.

May peace be with you.


On Mon, Nov 14, 2011 at 2:51 PM, Doug Ewell d...@ewellic.org wrote:

 This thread presumes that display is, by orders of magnitude, the most
 important aspect of text processing, and that every other operation that
 could be performed on text is secondary.  Different writing systems all
 represented as font changes, using the same character codes.  Backing
 text streams jury-rigged to accommodate visual display order.  This is
 where the real insanity lies, for anyone who needs to do more with text
 than display it.

 The continued accusations that Unicode has been, and is only, a
 money-making endeavor for someone is too reprehensible and too far
 removed from reality to merit a response.

 --
 Doug Ewell | Thornton, Colorado, USA | RFC 5645, 4645, UTN #14
 www.ewellic.org | www.facebook.com/doug.ewell | @DougEwell ­





Re: Continue: Glaring mistake in the code list for South Asian Script / Naena Guru

2011-11-14 Thread Naena Guru
On Thu, Nov 10, 2011 at 3:14 AM, delex r del...@indiatimes.com wrote:

 Naena Guru wrote:

 I have not read the entire thread of this conversation. It looks as if
 the debate has reached a level of acrimony.
 It is available since 1st week of September, 2011. Since 8th
 September,2011 to be exact. You may go through it and various responses
 counter responses .Thanks for your interest. “Acrimony” …. Probably not.  I
 want to better call it a “sincere criticism”

Good attitude


  was abandoned because Indians could not come to a consensus. As a
 result, the rest of Indian languages including Tamil and Singhala were not
 even taken up for consideration.
 I am an Indian. Are you too?

Singhalese.


 Unicode had a different idea. The driving principle of Unicode was the
 Plain Text idea. “Every letter of every language on earth would have its
 own proud code point uniquely its own. [Imagine a time capsule when we go
 down in flames
 Yes. I accept it was a good and proud idea of the present generation of
 mankind.
  Unicode will be there to tell about the great human civilization]!”
 Ya , perhaps this is the reason I am so eagerly writing so that we
 (Unicode) record and  tell the right and correct things to the future
 generations. If after an Armageddon , no computer is left in this world,
 there is a chance some written documents may survive so that they can start
 from scratch. “They should know that “Assamese” language is not written in
 “Bengali” script.” It is actually the other way round. That is just replace
 “Assamese” with “Bengali” and “Bengali” with “Assamese” in the above
 quotation.

No idea what you are talking about


 Now, Unicode wanted the help of ISO to get to the authorities of those
 countries where government bureaucracies, not businesses mattered.
 Our authorities actually recognized the uniqueness of the “Assamese”
 script with unique and extra characters. Check ISCII  (Indian standard
 codes for Information……) documentation

Not my subject


 ..If you know how Third World governments work, you know what I mean.
 Lots of leg pulling among others

Yes, yes.



 The problem is that the Unicode scheme has divided the world and
 categorized scripts as computer friendly and barbarian.
 Well I am trying to tell Unicode that “Assamese”  is  not a barbarian
 language without a script.

Good luck, but see this:
http://www.lovatasinhala.com/index.php
That is complex and also simple Latin text.

You do not need to struggle with Unicode if you simply transliteration and
make a font. They, and officials at home all get indignant when you point
out their errors.

This method has least dependency, and stays with the most favored
languages, though some might not like the lowly company.


Re: combining: half, double, triple et cetera ad infinitum

2011-11-14 Thread Naena Guru
Thank you Asmus,

I made many here very angry. They have put in a lot toward bringing up the
standard, and therefore, it is understandable. Evidence in these things can
never be proven, it only makes people madder. Besides, I worded wrongly to
give the impression that it is the only motive. That is not so. Many
volunteers sweated through. On the other hand, no company would send people
to work at Unicode if they did not have an economic interest.

I should not have even said anything here as I know that there is an
alternative approach that does not hurt Unicode and hopefully its fans. It
is the combination of the most viable and stable portion of Unicode, which
is Latin-1 or SBCS and the Open Type standard. Standards compliant and
elegant solution. See it here:
www.lovatasinhala.com/
Do not use IE. If you use Firefox, sometimes you need to pick another page
to see the complex script.

On Mon, Nov 14, 2011 at 2:29 PM, Asmus Freytag asm...@ix.netcom.com wrote:

  On 11/14/2011 7:30 AM, Naena Guru wrote:

 Unicode was created for a commercial reason, particularly for the benefit
 of its directors.


 This statement, not backed up by evidence, indicates a rather rudimentary
 understanding of the forces that were behind the creation of the universal
 character set. Coming as it does without details, it doesn't add much to
 the discussion.

 I can understand that some people are frustrated, when, over two decades
 after the basic design was hashed out, the implementation is still not
 seamless.

 The reason for that has to be sought in the inherent complexity of writing
 systems. Even apparently very simple writing systems can have surprising
 complexity when you try to support all areas of use and in high quality
 typography.

 Unicode is not just a character set, it provides a common framework for
 organizing and formalizing much knowledge about writing systems. As a
 result, there is now far more information and more accessible information
 about writing systems than when Unicode was started. Had all this
 information been available 20 to 25 years ago, there's a fair chance that
 the design of a universal character set would have differed in some aspects
 from what we now know as Unicode.

 But even with the benefit of such hindsight, the sheer complexity of
 writing systems remains. This complexity means that there will most likely
 never be any implementation that supports all writing systems to their
 fullest (highest quality). Every practical implementation will subset
 somewhere, and that means there's no guarantee that any two implementations
 will faithfully interoperate.

 It might be that a different design would have made some implementations
 easier, but I strongly suspect that the limitations I described here are
 fundamental, so that the expectation would be that a different design would
 have merely made other tasks more difficult while making certain ones
 easier.

 A./



Re: combining: half, double, triple et cetera ad infinitum

2011-11-14 Thread Naena Guru
For you to be able to read Sinhala Unicode two things must happen. First,
like for all text, the font has to be present. The systems sold outside Sri
Lanka do not have Sinhala Unicode, which is understandable because the
community is very small.

Then there are cases where even if you have the font, the characters do not
combine. You all may have seen the Sinhala font, but how do I know you if
you could actually read it? Here is a screen shot someone sent me two days
back asking me to fix it. I could not help the person, and that is tragic.

https://docs.google.com/leaf?id=0B0OoSILYvguZZDMzYzE1MDUtMjk2OS00YWI3LTk3NGItMDAyYzY2MmYwNWEyhl=en_US

The first characters you see in this screen shot is ශ්‌රී. These are two
letters that should combine to form this letter: ශ්‍රී . Combinations do
not happen anywhere and it is gibberish for the reader.

As far as I can understand, he has Uniscribe installed through 'files for
complex scripts'.

On Mon, Nov 14, 2011 at 6:00 PM, Tom Gewecke t...@bluesky.org wrote:


 On Nov 14, 2011, at 3:39 PM, Naena Guru wrote:

 
  I made many here very angry.

 Not into anger myself, but you certainly lost my respect when you insisted
 we could not read Unicode Sinhala text on our machines

Question remains.


Re: Re: Continue: Glaring mistake in the code list for South Asian Script//Reply to Kent Karlsson

2011-11-03 Thread Naena Guru
I have not read the entire thread of this conversation. It looks as if the
debate has reached a level of acrimony. We need to inspect the background
to this entire controversy to come to a rational understanding.

In the nineties, there was a tug-o-war between ISO and Unicode on how to
digitize languages like Indic. The ISO-8859 committee wanted to share the
code points from u0080 to u00FF or the Lain-1 Supplement among the Western
Europeans and anyone else that wanted to define their codes. ISO-8859-12
(Devanagari) was abandoned because Indians could not come to a consensus.
As a result, the rest of Indian languages including Tamil and Singhala were
not even taken up for consideration.

Unicode had a different idea. The driving principle of Unicode was the
Plain Text idea. “Every letter of every language on earth would have its
own proud codepoint uniquely its own. [Imagine a time capsule when we go
down in flames. Unicode will be there to tell about the great human
civilization]!”

During a meeting with Europeans held in America, Unicode’s idea won.
(Juicier version of this story was on Unicode web site that is since
removed).  Now, Unicode wanted the help of ISO to get to the authorities of
those countries where government bureaucracies, not businesses mattered.

What motivations did Americans, Europeans and South Asians have about this
entire endeavor? Americans wanted to expand business to the outside world
(the good word is Globalizing). The Europeans had their own domineering
commercial control over the former colonies, yet with a sentimental
affection.

In South Asia, where literacy rate was very low (except Sri Lanka), only a
minority of English educated elite had any exposure to computers. The Indic
region  was still struggling working with the typewriter.  There was no
public awareness in what was happening in America called Unicode. If at
all, it was some piece of news that some esoteric thing was going on where
Americans with their great scientific prowess were going to give them
something grand, a white elephant, perhaps. At least in Sri Lanka, the
motivation was the World Bank’s offer of loans. They got a European who
cannot read Sinhala to sign for the standard. Other than the promise of
money and jobs, they had no idea where they were going. If you know how
Third World governments work, you know what I mean.

With twelve years of making the Arial font etc. we arrived at the
conclusion that one default font with all letters of the world is not
practical, and kind of ridiculous. We now have SBCS, DBCS etc. , to spread
several scripts, and at least in Sri Lanka, bureaucrats as surrogates to
give excuses and reassurances to the public. (Computers have gone 16-bit,
Only standard is Unicode, others are hack jobs). They are contemplating to
make laws to force Unicode Sinhala on people!

All mature software are written for SBCS. There is no incentive for those
whose programs work so well to hazard re-writing their programs to
accommodate DBCS. Compiler makers are encouraging programmers to globalize
and write their programs for Unicode, but who is going to pay to overhaul
30 to 40 year programs that have stabilized and reached near perfection?
Besides, software piracy was the order of the day outside America.

Presently, Indian and Lankan general public has arrived at a point when
they are able to use their languages on the computer. People want to
communicate using their computers in their native languages. Unicode Indic
is very hard to use. It is nothing like English. It requires so many new
things like word processing programs, physical keyboards and such. Unicode
came way before they were ready, and they have now become victims. It is
long past the Korean debacle. Now Unicode is set in concrete.

The problem is that the Unicode scheme has divided the world and
categorized scripts as computer friendly and barbarian.

I do not know about CJKV, but Indic would have been much better off had
they made their standards within SBCS. I tested this for Sinhala and it is
a great success. See it at the following link (Please do not use Internet
Explorer because it does not support full-font downloading or Open Type
font rendering):
http://www.lovatasinhala.com/

Sinhala is one of, if not *the* most complex of Indic and has two major
orthographies, Mixed Sinhala and Pali. I studied the division of vyaakarana
(grammar) of Sinhala / Sanskrit writing and made a comprehensive
transliteration on to ISO-8859-1 with no loss. And then I made a smartfont
to dress the Latin encodings in the native script. People use this system
unknowing that the underlying code is ‘English’. It is a refinement of
Anglicizing. You type the way you speak and the orthographic font shows it
magically in its full complexity. It has been in existence since 2005
despite the government bureaucrats disparaging it.

Sinhala includes Sanskrit at the core of its phoneme chart or Hodiya. It
also covers Pali. You use this system just like 

How is NBH (U0083) Implemented?

2011-08-01 Thread Naena Guru
The Unicode character NBH (No Break Here: U0083) is understood as a hidden
character that is used to keep two adjoining visual characters from being
separated in operations such as word wrapping. It seems to be similar to ZWJ
(Zero Width nonJoiner: U200C) in that it can prevent automatic formation of
a ligature as programmed in a font. However, it seems to me that an NBH
evokes a question mark (?) Is this an oversight by implementers or am I
making wrong assumptions?

There is also the NBSP (No-break Space: U00A0), which I think has to be
mapped to the space character in fonts, that glues two letters together by a
space. If you do not want a space between two letters and also want to
prevent glyph substitutions to happen, then NBH seems to be the correct
character to use.

NBH is more appropriate for use within ISO-8859-1 characters than ZWNJ,
because the latter is double-byte. Programs that handle SBCS well ought to
be afforded the use of NBH as it is a SBCS character. Or, am I completely
mistaken here?

Thank you.