Re: Indic Devanagari Query

2003-01-29 Thread Keyur Shroff
Hi Aditya,

--- Aditya Gokhale [EMAIL PROTECTED] wrote:
 I had few query regarding representation of Devanagari script in
 Unicode
 (Code page - 0x0900 - 0x097F). Devanagari is a writing script, is used in
 Hindi, Marathi and Sanskrit languages. I have following questions - 
 
 
 In the same script code page, how do I use these two different Glyphs, to
 represent the same character ? Is there any way by which I can do it in
 an Open type font and Free type font implementation ?

Yes, it is certainly possible with OpenType font. Please note that FreeType
is not a font format but it is a rendering library used to rasterize
different kind of fonts including TrueType and OpenType fonts.

In an Opentype font, you can include all glyphs with alternate shapes and
then select one of them depending upon the script and language. Application
should specify script and language tag while sending character codes to the
opentype rendering library/engine. All substitution will be taken place
depending on the language and/or script selection. There should be a
default script in the font. Similarly there will be a default language for
that script which will be used as fallback language if application does not
specify which language to be used for processing.

From the list of alternate glyphs you may want to use the glyph for default
language for an entry in cmap table. This default glyph can be substituted
by alternate glyph depending upon the language specification. You have to
use GSUB table and write language dependent lookup for substitution.

 
 2. Implementation Query - 
 In an implementation where I need to send / process Hindi, Marathi
 and Sanskrit data, how do I differentiate between languages (Hindi,
 Marathi and Sanskrit). Say for example, I am writing a translation
 engine, and I want to translate a document having Hindi, Marathi and
 Sanskrit Text in it, how do I know from the code points between 0x0900
 and 0x097F, that the data under perusal is Hindi / Marathi / Sanskrit ?

Unicode is not divided into code pages. Unlike few old encodings there is
only one code page for entire Unicode standard. However, for better
readability and quick user reference the entire chart has been divided into
different sections which you might interpret as code pages.

 I would suggest that we should give different code pages for Marathi,
 Hindi and Sanskrit. May be current code page of Devanagari can be traded
 as Hindi and two new code pages for Marathi and Sanskrit be added. This
 could solve these issues. If there is any better way of solving this, any
 one suggest.


Unicode gives code points to script only and not language. In fact it is
not desirable to give code points to individual languages falling under the
same script. Also, Unicode encodes characters which have abstract meaning
and properties. Unicode does not encode glyphs. The shapes of glyphs shown
in the Unicode chart have been given just for convenience and not actually
represent the shapes to be used in the font. The shape of the glyph for a
Unicode character may vary from one font to another. Since it is already
possible to select proper glyph(s) depending upon language selection, this
scheme is suitable for all Indian languages.


 
 
 3. Character codes for jna, shra, ksh - 
 
 In Sanskrit and Marathi jna, shra and ksh are considered as separate
 characters and not ligatures. How do we take care of this ? Can I get
 over all views on the matter from the group ? In my opinion they should
 be given different code points in the specific language code page.
 Please find below the character glyphs - 
 
 jna
 shra
 ksh

All of the above can be composed through following consonant clusters:
  jna - ja halant nya
  shra - sha halant ra
  ksh - ka halant ssha

The point that the above sequences are considered as characters in some of
the Indian languages has merit. If there is demand from native speakers
then a proposal can be submitted to Unicode. There is a predefined
procedure for proposal submission. Once this is discussed with concerned
people and agreed upon then these ligatures can be added in Devanagari
script itself because Devenagari script represent all three languages you
mentioned namely Sanskrit, Marathi, and Hindi. Meanwhile you can write
rules for composing them from the consonant clusters.

Regards,
Keyur


__
Do you Yahoo!?
Yahoo! Mail Plus - Powerful. Affordable. Sign up now.
http://mailplus.yahoo.com




Re: Indic Devanagari Query

2003-01-29 Thread Keyur Shroff
Hi,

Forgot to reply implementation query. The reply is inline.

--- Aditya Gokhale [EMAIL PROTECTED] wrote:
 2. Implementation Query - 
 In an implementation where I need to send / process Hindi, Marathi
 and Sanskrit data, how do I differentiate between languages (Hindi,
 Marathi and Sanskrit). Say for example, I am writing a translation
 engine, and I want to translate a document having Hindi, Marathi and
 Sanskrit Text in it, how do I know from the code points between 0x0900
 and 0x097F, that the data under perusal is Hindi / Marathi / Sanskrit ?
 I would suggest that we should give different code pages for Marathi,
 Hindi and Sanskrit. May be current code page of Devanagari can be traded
 as Hindi and two new code pages for Marathi and Sanskrit be added. This
 could solve these issues. If there is any better way of solving this, any
 one suggest.

Instead of changing/recommending change in an encoding standard, your
problem can best be solved in your application. You can use tags in your
text to specify language. Unicode also facilitates tagging your text but
its use in Unicode is highly discouraged. So you can use some language
similar to xml or html to specify language boundary. Then parse your text,
identify the language boundaries, and do further processing depending upon
the language.

If you don't want to use tags in your text then you can predict language by
using some heuristic. This heuristic can be used on some language
properties which may be different for all three languages. In this case
your processing will be divided into two phases. First phase involves
applying some heuristic rule to identify language bounadaries from plain
text and the second is actually processing text for translation. But beware
that the result will not be accurate all the time with such heuristic
processing. Hence use of tags is recommended.

Regards,
Keyur


__
Do you Yahoo!?
Yahoo! Mail Plus - Powerful. Affordable. Sign up now.
http://mailplus.yahoo.com




Re: Indic Devanagari Query

2003-01-29 Thread Aditya Gokhale

Hello,
Thanks for the reply. I will check the points as you said, as far as the
font issues are considered. We all know how jna,shra and ksh are formed in
UNICODE and ISCII, but the point I wanted to make was, if we have to sort /
search / process the data in Devanagari script, then we have to keep track
of at least three characters and not one. This becomes tedious, thought not
impossible. If single
code point is present it will be very easy to process.
With regards, to predict language by using some heuristic, in my
opinion it is a very risky solution, at least when I don't have much
information at stage one of my application. I am running OCR engine on a
Devanagari page, then based on the formatting, tagging the language. So I
think tagging, as I am doing right now is a better solution. I also agree
with the views expressed by Asmus Freytag, that if we go on including all
the 6000 languages, it will be extremely impossible to cross-correlate these
'code pages'.

-Aditya






RE: Indic Devanagari Query

2003-01-29 Thread Marco Cimarosti
Aditya Gokhale wrote:
 Hello Everybody,
 I had few query regarding representation of Devanagari 
 script in Unicode

All your questions are FAQ's, so I'll just reference the entries which
answers them.

 (Code page - 0x0900 - 0x097F). Devanagari is a writing 
 script, is used in Hindi, Marathi and Sanskrit languages. I 
 have following questions - 

Unicode has no code pages:
http://www.unicode.org/faq/basic_q.html#18

 1. In Marathi and Sanskrit language two characters glyphs of 
 'la' and 'sha' are represented differently as shown in the 
 image below - 
  (First glyph is 'la' and second one is 'sha')
 as compared to Hindi where these character glyphs are 
 represented as shown in the image below - 
 (First glyph is 'la' and second one is 'sha')

Unicode encodes (abstract) characters, not glyphs:
http://www.unicode.org/faq/han_cjk.html#3

(This FAQ is in the Chinese/Japanese/Korean section because it is more often
raised for Chinese ideograms.)

 In the same script code page, how do I use these two 
 different Glyphs, to represent the same character ? Is there 
 any way by which I can do it in an Open type font and Free 
 type font implementation ?

Unicode's requirements for fonts:
http://www.unicode.org/faq/font_keyboard.html#1

A few links to OpenType stuff:
http://www.unicode.org/faq/font_keyboard.html#4

 2. Implementation Query - 
 In an implementation where I need to send / process 
 Hindi, Marathi and Sanskrit data, how do I differentiate 
 between languages (Hindi, Marathi and Sanskrit). Say for 
 example, I am writing a translation engine, and I want to 
 translate a document having Hindi, Marathi and Sanskrit Text 
 in it, how do I know from the code points between 0x0900 and 
 0x097F, that the data under perusal is Hindi / Marathi / Sanskrit ?

What you need here is some sort of language tagging:
http://www.unicode.org/faq/languagetagging.html

 I would suggest that we should give different code pages 
 for Marathi, Hindi and Sanskrit. May be current code page of 
 Devanagari can be traded as Hindi and two new code pages for 
 Marathi and Sanskrit be added. This could solve these issues. 
 If there is any better way of solving this, any one suggest.

Characters are encoder per scripts, not per languages:
http://www.unicode.org/faq/basic_q.html#17

 3. Character codes for jna, shra, ksh - 
 
 In Sanskrit and Marathi jna, shra and ksh are considered as 
 separate characters and not ligatures. How do we take care of 
 this ? Can I get over all views on the matter from the group 
 ? In my opinion they should be given different code points in 
 the specific language code page.
 Please find below the character glyphs - 

Unicode encodes Indic analytically:
http://www.unicode.org/faq/indic.html#17

 thanks,

For more details about Devanagari in Unicode, see Chapter 9 of the Standard:
http://www.unicode.org/uni2book/ch09.pdf

_ Marco




Re: Indic Devanagari Query

2003-01-29 Thread Keyur Shroff

--- Asmus Freytag [EMAIL PROTECTED] wrote:

 
 All of the above can be composed through following consonant clusters:
jna - ja halant nya
shra - sha halant ra
ksh - ka halant ssha
 
 The point that the above sequences are considered as characters in some
 of
 the Indian languages has merit. If there is demand from native speakers
 then a proposal can be submitted to Unicode. There is a predefined
 procedure for proposal submission. Once this is discussed with concerned
 people and agreed upon then these ligatures can be added in Devanagari
 script itself because Devenagari script represent all three languages
 you
 mentioned namely Sanskrit, Marathi, and Hindi. Meanwhile you can write
 rules for composing them from the consonant clusters.
 
 I wouldn't go so far. The fact that clusters belong together is something
 
 that can be handled by the software. Collation and other data processing 
 needs to deal with such issues already for many other languages. See 
 http://www.unicode.org/reports/tr10 on the collation algorithm.

I beg to differ with you on this point. Merely having some provision for
composing a character doesn't mean that the character is not a candidate
for inclusion as separate code point. India is a big country with millions
of people geographically divided and speaking variety of languages.
Sentiments are attached with cultures which may vary from one geographical
area to another. So when one of the many languages falling under the same
script dominate the entire encoding for the script, then other group of
people may feel that their language has not been represented properly in
the encoding. While Unicode encodes scripts only, the aim was to provide
sufficient representation to as many languages as possible. 

In Unicode many characters have been given codepoints regardless of the
fact that the same character could have been rendered through some compose
mechanism. This includes Indic scripts as well as other scripts. For
example, in Devanagari script some code points are allocated to characters
(ConsonantNukta) even though the same characters could be produced with
combination of the consonant and Nukta. Similarly, in Latin-1 range
[U+0080-U+00FF] there are few characters which can be produced otherwise.
That is why the text should be normalized to either pre-composed or
de-composed character sequence before going for further processing in
operations like searching and sorting.

Also, many times processing of text depends on the smallest addressable
unit of that language. Again as discussed in earlier e-mails this may vary
from one language to another in the same script. Consider a case when a
language processor/application wants to count the number of characters in
some text in order to find number of keystrokes required to input the text.
Further assume that API functions used for this purpose are based on either
WChar (wide characters) or UTF-8. In this case it is very much necessary
that you assign the character, say Kssha, to the class consonant. Since
assignment to this class consonant applies to single code point (the
smallest addressable unit) and not to the sequence of codes, it is very
much necessary to have single code point for the character Kssha.

This is my understanding. Please enlighten me if I am wrong.

Regards,
Keyur


__
Do you Yahoo!?
Yahoo! Mail Plus - Powerful. Affordable. Sign up now.
http://mailplus.yahoo.com




Suggestions in Unicode Indic FAQ

2003-01-29 Thread Keyur Shroff
Hello,

There are few discrepancies in Indic FAQ. Though it was reported earlier by
Andy White, I see they still have place there in the FAQ. I also clarified
it but by mistake I sent the mail to Yahoo groups where this mailing list
is archived and hence my mail never reached to this mailing list. You can
refer to the link http://groups.yahoo.com/group/unicode/message/16352


The following are the suggestions.

SUGGESTION-1:

In the FAQ
   http://www.unicode.org/faq/indic.html#2
it is mentioned that 

ISCII:   Unicode:
Halant + Halant  Halant + ZWJ

produce similar result. This is wrong. In ISCII, Halant+Halant is known as
explicit halant and its Unicode equivalent sequence is Halant+ZWNJ. So ZWJ
should be replaced by ZWNJ.


SUGGESTION-2:

In the FAQ
   http://www.unicode.org/faq/indic.html#16

It is mentioned that following are equivalent

ISCII Unicode
KA halant INV KA virama ZWJ
RA halant INV RAsup (i.e., repha)

In fact there is no way in Unicode to produce RAsup directly, i.e., without
using base consonant. The sequence RA virama ZWJ will actually produce
half-RA (or eyelash-RA) which is used commonly in Marathi. eyelash-RA can
also be produced with the sequence RA Halant Nukta sequence both in ISCII
(known as soft halant) and Unicode (just for conformance with ISCII).

Also, in the same answer the following sequence is recommended.

ISCII Unicode
INV halant RA SPACE virama RA (RAsub)



SUGGESTION-3:

Use of SPACE character as consonant may create problem for state machine
which finds language/syllable boundary. In fact we need a codepoint for one
invisible consonant (similar to INV in ISCII) in Unicode which can solve
this problem with Unicode.

After inclusion of INV character the following can be recommended.

ISCII Unicode
KA halant INV KA virama INV
RA halant INV RA virama INV (i.e., repha)
INV halant RA INV virama RA (RAsub)

The INV character in Unicode can also be used for displaying dependent
vowel matras without dotted circle.

Unicode
INV Vowel sign O
INV Vowel sign AI

etc. This can replace existing definition of SPACE as invisible consonant
depending on the context.

Any other pointers!!?

- Keyur


__
Do you Yahoo!?
Yahoo! Mail Plus - Powerful. Affordable. Sign up now.
http://mailplus.yahoo.com




Re: Indic Devanagari Query

2003-01-29 Thread John Cowan
Keyur Shroff scripsit:

 Sentiments are attached with cultures which may vary from one geographical
 area to another. So when one of the many languages falling under the same
 script dominate the entire encoding for the script, then other group of
 people may feel that their language has not been represented properly in
 the encoding. 

Indeed, they may have such beliefs, but those beliefs are based on two
incorrect notions: that what the charts show is normative, and that the
codepoint is the proper unit of processing.

 In Unicode many characters have been given codepoints regardless of the
 fact that the same character could have been rendered through some compose
 mechanism. 

In every case this was done for backward compatibility with existing
encodings.  No new codepoints of this type will be added in future.

 That is why the text should be normalized to either pre-composed or
 de-composed character sequence before going for further processing in
 operations like searching and sorting.

The collation algorithm makes allowance for these points.
It will be quite typical to tailor the algorithm to take language-specific
rules into account.

 Also, many times processing of text depends on the smallest addressable
 unit of that language. Again as discussed in earlier e-mails this may vary
 from one language to another in the same script. Consider a case when a
 language processor/application wants to count the number of characters in
 some text in order to find number of keystrokes required to input the text.

This will not work without knowledge of the keyboard layout in any case.
To enter Latin-1 characters on the Windows U.S. keyboard requires 5 keystrokes,
but they are represented by one or two Unicode characters.

-- 
Henry S. Thompson said, / Syntactic, structural,   John Cowan
Value constraints we / Express on the fly. [EMAIL PROTECTED]
Simon St. Laurent: Your / Incomprehensible http://www.reutershealth.com
Abracadabralike / schemas must die!http://www.ccil.org/~cowan




Re: Indic Devanagari Query

2003-01-29 Thread Michael Everson
At 02:13 -0800 2003-01-29, Keyur Shroff wrote:

I beg to differ with you on this point. Merely having some provision for
composing a character doesn't mean that the character is not a candidate
for inclusion as separate code point.


Yes, it does.


India is a big country with millions of people geographically 
divided and speaking variety of languages. Sentiments are attached 
with cultures which may vary from one geographical area to another. 
So when one of the many languages falling under the same script 
dominate the entire encoding for the script, then other group of 
people may feel that their language has not been represented 
properly in the encoding.

A lot of these feelings are simply WRONG, and that has to be faced. 
The syllable KSSA may be treated as a single letter, but this does 
not change the fact that it is a ligature of KA and SSA and that it 
can be represented in Unicode by a string of three characters.

In Unicode many characters have been given codepoints regardless of the
fact that the same character could have been rendered through some compose
mechanism. This includes Indic scripts as well as other scripts. For
example, in Devanagari script some code points are allocated to characters
(ConsonantNukta) even though the same characters could be produced with
combination of the consonant and Nukta.


There are historical and compatibility reasons that most of this 
stuff, as well as the similar stuff in the Latin range, were encoded. 
At one point some years ago the line was drawn, normalization was 
enacted, and that was that.

Also, many times processing of text depends on the smallest addressable
unit of that language. Again as discussed in earlier e-mails this may vary
from one language to another in the same script. Consider a case when a
language processor/application wants to count the number of characters in
some text in order to find number of keystrokes required to input the text.


I can't think of any reason why this would be useful. And what if you 
were not typing, but speaking to your computer? Then there would be 
no keystrokes at all!

Further assume that API functions used for this purpose are based on either
WChar (wide characters) or UTF-8. In this case it is very much necessary
that you assign the character, say Kssha, to the class consonant. Since
assignment to this class consonant applies to single code point (the
smallest addressable unit) and not to the sequence of codes, it is very
much necessary to have single code point for the character Kssha.


We are not going to encode KSSA as a single character. It is a 
ligature of KA and SSA, and can already be represented in Unicode. 
You need to handle this consonant issue with some other protocol.
--
Michael Everson * * Everson Typography *  * http://www.evertype.com



RE: Suggestions in Unicode Indic FAQ

2003-01-29 Thread Marco Cimarosti
Keyur Shroff wrote:
 In the FAQ
http://www.unicode.org/faq/indic.html#16
 
 It is mentioned that following are equivalent
 
 ISCII Unicode
 KA halant INV KA virama ZWJ
 RA halant INV RAsup (i.e., repha)

The last line is really bizarre! I would agree that it is plain wrong...

What is supposed to appear in column Unicode is the Unicode *encoding*
equivalent to the RA halant INV in the ISCII column. But RAsup (i.e.,
repha) is the description of a *glyph*.

 In fact there is no way in Unicode to produce RAsup directly, 
 i.e., without using base consonant. [...]

I agree. This issue has been raised several times, and several viable
solutions have been proposed, but I don't remember that Unicode officials
ever showed to even acknowledge the problem.

But probably this has been noted down and discussed. I hope to see an
official solution in TUS 4.0.

 SUGGESTION-3:
 
 Use of SPACE character as consonant may create problem for 
 state machine which finds language/syllable boundary.
 In fact we need a codepoint for one invisible consonant
 (similar to INV in ISCII) in Unicode which can solve
 this problem with Unicode.
 
 After inclusion of INV character the following can be recommended.
 
 ISCII Unicode
 KA halant INV KA virama INV
 RA halant INV RA virama INV (i.e., repha)
 INV halant RA INV virama RA (RAsub)

Why not representing INV with a double ZWJ? E.g.:

ISCII Unicode
KA halant INV KA virama ZWJ ZWJ
RA halant INV RA virama ZWJ ZWJ (i.e., repha)
INV halant RA ZWJ ZWJ virama RA (RAsub)

This has the advantage that the most common sequences will work OK also on
old display engines implemented *before* the double-ZWJ convention is
introduced.

E.g., sequence KA virama ZWJ ZWJ works well also on an old engine, for the
simple reason that the first ZWJ is enough to do the work, and  the second
ZWJ is invisible.

Of course, an old engine will still display a RA[eyelash] for RA virama
ZWJ ZWJ, but that is not worse than displaying RA+virama followed by a
white box, which is what would happen with your new INV character.

_ Marco




RE: Suggestions in Unicode Indic FAQ

2003-01-29 Thread Kent Karlsson

 The [new] INV character in Unicode can also be used for displaying dependent
 vowel matras without dotted circle.

A space followed by a dependent vowel sign should display just the
dependent vowel sign, no dotted circle.  Indeed, (except for a show
invisibles mode, or a character chart display mode) no (Indic or other)
text that does not contain the *character* DOTTED CIRCLE should ever
display a dotted circle as part of the displayed text. Systems that
do display a dotted circle (in normal display mode) where there is
no such *character* in the displayed text are buggy!

/Kent K

(B.t.w. the chart dotted circle glyph for combining characters
look a bit different from the (normal) glyph for DOTTED CIRLCE.)




RE: Suggestions in Unicode Indic FAQ

2003-01-29 Thread Keyur Shroff

--- Marco Cimarosti [EMAIL PROTECTED] wrote:

 Why not representing INV with a double ZWJ? E.g.:
 
   ISCII Unicode
   KA halant INV KA virama ZWJ ZWJ
   RA halant INV RA virama ZWJ ZWJ (i.e., repha)
   INV halant RA ZWJ ZWJ virama RA (RAsub)
 
 This has the advantage that the most common sequences will work OK also
 on
 old display engines implemented *before* the double-ZWJ convention is
 introduced.
 
 E.g., sequence KA virama ZWJ ZWJ works well also on an old engine, for
 the
 simple reason that the first ZWJ is enough to do the work, and  the
 second ZWJ is invisible.
 
 Of course, an old engine will still display a RA[eyelash] for RA
 virama
 ZWJ ZWJ, but that is not worse than displaying RA+virama followed by a
 white box, which is what would happen with your new INV character.

Certainly. This looks more promising because even RAsub has two alternate
forms. One form is used with consonants KA, KHA, GHA, etc and the other
form is used with consonants TTA, TTHA, DDA, DDHA, etc. With your ZWJ based
scheme we can insert as many ZWJ as we wish to produce all possible
alternate forms!

But sometimes a user may want visual representation of these symbols in two
different ways: with dotted circle and without dotted circle. Example of
this could be RAsup on top of dotted circle and RAsup on top of space
character. Current use of space character to eliminate dotted circle is
really painful and may create problems in determining language and syllable
boundaries. The main problem with space character is that unlike
ZWJ/ZWNJ/Dotted Circle, it falls within the range of other important script
Latin. Finally it may affect all important text processing which uses
Unicode characters to find language boundaries. Use of INV character in one
shot can solve all these problems. We can put it in consonant class which
can help text processing applications. Moreover, it will be difficult for
all possible to provide upward compatibility all the time even though it is
desirable. Implementation of Unicode will need to be upgraded with every
introduction of new glyphs or rules. Otherwise applications have to
explicitly declare the version of Unicode used in implementation.

- Keyur


__
Do you Yahoo!?
Yahoo! Mail Plus - Powerful. Affordable. Sign up now.
http://mailplus.yahoo.com




RE: Indic Devanagari Query

2003-01-29 Thread Kent Karlsson


  I wouldn't go so far. The fact that clusters belong together is something
  that can be handled by the software. Collation and other data processing 
  needs to deal with such issues already for many other languages. See 
  http://www.unicode.org/reports/tr10 on the collation algorithm.
 
 I beg to differ with you on this point. Merely having some provision for
 composing a character doesn't mean that the character is not a candidate
 for inclusion as separate code point. 

At this point, having some provision for composing a particular letter
is very much preventing it from being encoded at a separate code position.
This is due mostly to the fixation of normal forms (except for very rare
error corrections).

 In Unicode many characters have been given codepoints regardless of the
 fact that the same character could have been rendered through some compose
 mechanism. This includes Indic scripts as well as other scripts. For

For legacy reasons, yes.  These reasons no longer apply for
not-yet-encoded compositions.

 Also, many times processing of text depends on the smallest addressable
 unit of that language. Again as discussed in earlier e-mails this may vary
 from one language to another in the same script. Consider a case when a
 language processor/application wants to count the number of characters in
 some text in order to find number of keystrokes required to input the text.

You cannot find the number of keystrokes that way.  Not even 
if you know which keyboard (and disregarding backspace).  E.g.
รค can be produced by one or two (or more, if you count hex input)
keystrokes on (most) Swedish keyboards.

 Further assume that API functions used for this purpose are based on either
 WChar (wide characters) or UTF-8. In this case it is very much necessary
 that you assign the character, say Kssha, to the class consonant. Since
 assignment to this class consonant applies to single code point (the
 smallest addressable unit) and not to the sequence of codes, it is very
 much necessary to have single code point for the character Kssha.

No, that is not the case.  E.g. Hungarian (Magyar) has gy, ny, ly
(and more) as letters (look in a Hungarian dictionary, and its headings).
Similarly, Albanian has dh, rr, th (and more) as letters. None of
these combinations are candidates for single code point allocation.  For 
compatibility reasons the Dutch ij got a single code point, but it
is better to just use i followed by j (though that has some
difficulties; e.g. the titlecase of ijs is IJs, not Ijs).

/Kent K





RE: Suggestions in Unicode Indic FAQ

2003-01-29 Thread Keyur Shroff

--- Kent Karlsson [EMAIL PROTECTED] wrote:
 
 A space followed by a dependent vowel sign should display just the
 dependent vowel sign, no dotted circle.  Indeed, (except for a show
 invisibles mode, or a character chart display mode) no (Indic or
 other)
 text that does not contain the *character* DOTTED CIRCLE should ever
 display a dotted circle as part of the displayed text. Systems that
 do display a dotted circle (in normal display mode) where there is
 no such *character* in the displayed text are buggy!

In Indic scripts any sign that appear in text not in conjunction with a
valid consonant base may be rendered with dotted circle as fallback
mechanism (Section 5.14 Rendering Nonspacing Marks
http://www.unicode.org/uni2book/ch05.pdf). Any system implementing this as
default behaviour should not be considered buggy. What should be the
default rendering behaviour (i.e., show hidden or not) may vary from one
script to another script and also depends on implementation policy. 

For scripts other than Indic scripts, it may be useful to render the
nonspacing mark without dotted circle because even after rendering it as an
overlap glyph, the result is recognizable. However, for Indic scripts use
of dotted circle is very useful as default behaviour since it gives
immediate feedback to the user that there may be some defective combining
character in the text. Most of the time such errors are unintentional
rather than intentional.

Unicode has provision to remove this dotted circle. Space character is used
to give indication to fallback mechanism that no dotted circle should be
used while rendering this stand alone sign which is normally attached to
other characters. This is useful when sometimes user want to display the
sign without any circle. Also, with this scheme it is possible to show some
combining marks with dotted circle and some without dotted circle.

- Keyur


__
Do you Yahoo!?
Yahoo! Mail Plus - Powerful. Affordable. Sign up now.
http://mailplus.yahoo.com




RE: Suggestions in Unicode Indic FAQ

2003-01-29 Thread Marco Cimarosti
Keyur Shroff wrote:
 But sometimes a user may want visual representation of these 
 symbols in two different ways: with dotted circle and
 without dotted circle.

Why not using a dotted circle character explicity, when you want to see one?

 Example of
 this could be RAsup on top of dotted circle and RAsup on top of space
 character. Current use of space character to eliminate dotted 
 circle is really painful and may create problems in determining 
 language and syllable boundaries.

Languages or syllable boundaries have nothing to do with this. These special
sequences should *never* be part of any syllabe or word in any language:
they are just a way of showing the shape of a glyph, to be used when, e.g.,
talking about typography or spelling.

 The main problem with space character is that unlike
 ZWJ/ZWNJ/Dotted Circle, it falls within the range of other 
 important script Latin. 

Plain wrong! White-space characters and punctuation do not belong to any
script: character such as  , ! and ? are used for many scripts and
languages. Even the danda punctuation, which is in the Devanagari range,
does not belong to Devanagari: it is also used for other Indic scripts.

 Use of INV character in one shot can solve all these
 problems. We can put it in consonant class which
 can help text processing applications. [...]

How can calling a consonant something which has nothing to do with
consonants help anybody doing anything?

_ Marco




Re: Indic Devanagari Query

2003-01-29 Thread Christopher John Fynn
 Michael Everson wrote:

 At 02:13 -0800 2003-01-29, Keyur Shroff wrote:
 I beg to differ with you on this point. Merely having some provision for
 composing a character doesn't mean that the character is not a candidate
 for inclusion as separate code point.
 
 Yes, it does.
 
 India is a big country with millions of people geographically 
 divided and speaking variety of languages. Sentiments are attached 
 with cultures which may vary from one geographical area to another. 
 So when one of the many languages falling under the same script 
 dominate the entire encoding for the script, then other group of 
 people may feel that their language has not been represented 
 properly in the encoding.

 A lot of these feelings are simply WRONG, and that has to be faced. 
 The syllable KSSA may be treated as a single letter, but this does 
 not change the fact that it is a ligature of KA and SSA and that it 
 can be represented in Unicode by a string of three characters.

Of course an anomoly is that KSSA *is* encoded in the Tibetan 
block at U+0F69. In normal Tibetan or Dzongkha words KSSA 
U+0F69 (or the combination U+0F40 U+0FB5) does not occur  
- AFAIK it  is *only* used when writing Sanskrit words containing 
KSSA in Tibetan script.  

I had thought that the argument for including KSSA as a seperate
character in the Tibetan block (rather than only having U+0F40 and 
U+0FB5) was originally for compatibility / cross mapping with 
Devanagari and other Indic scripts.  

- Chris






Re: Indic Devanagari Query

2003-01-29 Thread Rick McGowan
Aditya Gokhale wrote:

 1. In Marathi and Sanskrit language two characters glyphs of
 'la' and 'sha' are represented differently as shown in the
 image below -

Actually, for everyone's information: these allographs for Marathi were  
recently brought to our attention, and Unicode 4.0 will have a mention of  
the allographs, including pictures of the variant glyphs.

Rick





RE: Suggestions in Unicode Indic FAQ

2003-01-29 Thread Kent Karlsson

Keyur Shroff wrote
 Kent Karlsson [EMAIL PROTECTED] wrote:
  
  A space followed by a dependent vowel sign should display just the
  dependent vowel sign, no dotted circle.  Indeed, (except for a show
  invisibles mode, or a character chart display mode) no (Indic or
  other)
  text that does not contain the *character* DOTTED CIRCLE should ever
  display a dotted circle as part of the displayed text. Systems that
  do display a dotted circle (in normal display mode) where there is
  no such *character* in the displayed text are buggy!
 
 In Indic scripts any sign that appear in text not in 
 conjunction with a
 valid consonant base may be rendered with dotted circle as fallback
 mechanism (Section 5.14 Rendering Nonspacing Marks
 http://www.unicode.org/uni2book/ch05.pdf).

I don't know where you find support for that position in that text.
Can you please quote?  There are no invalid base consonants for
any dependent vowel (for Indic scripts; similarly for any other script).

 Any system implementing this as
 default behaviour should not be considered buggy.

Indeed they are.  And it should certainly not be default behaviour.

Any combining characters can be placed on any base characters without
there being any dotted circles displayed.  In particular, any combining
Devanagari characters (note: including, in principle, several dependent
vowels, even if that does not occur in any (existing) orthography) can
be placed on any Devanagari base character as well as SPACE (and other
punctuation). What should result is a reasonable composed glyph, no
dotted circle in sight (except in show invisibles mode, which I'm not
discussing here). Spelling errors should be indicated otherwise, since
they are of a very different nature.

 For scripts other than Indic scripts, it may be useful to render the
 nonspacing mark without dotted circle because even after 
 rendering it as an
 overlap glyph, the result is recognizable. However, for Indic 
 scripts use
 of dotted circle is very useful as default behaviour since it gives
 immediate feedback to the user that there may be some 
 defective combining
 character in the text. Most of the time such errors are unintentional
 rather than intentional.

No combination of base + combining characters is defective per se.
Even if the scripts are different within the combining sequence.
(Note also that the 0300 block of combining characters are script
independent.) Spelling errors is something else entirely.

 Unicode has provision to remove this dotted circle.

I'm not sure what you are talking about here.

 Space 
 character is used
 to give indication to fallback mechanism that no dotted 
 circle should be
 used while rendering this stand alone sign which is normally 
 attached to
 other characters. This is useful when sometimes user want to 
 display the
 sign without any circle. Also, with this scheme it is 
 possible to show some
 combining marks with dotted circle and some without dotted circle.

The fallback mechanisms talked about in section 5.14 of TUS 3.0 is
the use of less than ideal (typographically!) mechanisms to display
an *approximation* of the glyph(s) for the combining sequence.

An exceedingly bad approximation is displaying a dotted circle as a
fake base (again: disregarding show invisibles, or chart modes,
which, however, should be consistent and show a dotted circle fake
base for ALL combining characters occurring in the text).  The use
of this exceedingly bad approximation (in normal display mode) does
in no way indicate that the combining sequence is at all defective.
It may indicate that the display engine (or the font) is defective...

/Kent K





RE: Indic Devanagari Query

2003-01-29 Thread Marco Cimarosti
Christopher John Fynn wrote:
 I had thought that the argument for including KSSA as a seperate
 character in the Tibetan block (rather than only having U+0F40 and 
 U+0FB5) was originally for compatibility / cross mapping with 
 Devanagari and other Indic scripts.  

Which is not a valid reason either, considering that U+0F69 and the
combination U+0F40 U+0FB5 are *canonically* equivalent. This means that
normalizing applications are not allowed to treat U+0F69 differntly from
U+0F40 U+0FB5, including displaying them differently or mapping them
differently to something else.

_ Marco




RE: Suggestions in Unicode Indic FAQ

2003-01-29 Thread Keyur Shroff

--- Marco Cimarosti [EMAIL PROTECTED] wrote:
 Keyur Shroff wrote:
  But sometimes a user may want visual representation of these 
  symbols in two different ways: with dotted circle and
  without dotted circle.
 
 Why not using a dotted circle character explicity, when you want to see
 one?

Note that whenever I mention the word combining mark I am really talking
about vowel signs (matras) and other modifiers in Indic scripts which is
script dependent. I am sorry if I have confused you with the combining
diacritical marks in the block [U+0300-U+036F] which I really didn't mean.

Let me give a proper example this time. Consider a Vowel Sign E [U+0947]
appearing after any non-consonant character. This sign is generally
attached to the consonants. It has zero advance width with negative left
side bearing in the font. Clearly, since in this case the sign is not
preceded by any consonant base, it has to be rendered using one of the
mechanisms specified in fallback rendering of non-spacing marks. If we
render it with space, as you said, then we have to insert space character
at the time of fallback rendering (which can be taken care in rendering
pipeline) even though space character is not present in backing store of
the application. Now in order to render it with dotted circle if we
introduce the circle in the text before this sign then also the circle is
invalid base for this Vowel Sign E. As a result, again fallback rendering
will take place with rendering circle and the vowel sign positionally
separate. In this case first dotted circle will apear which will be
followed by vowel sign (matra) on top of space character.

If you know any other way to solve this problem then please explain. Also
let me know if I have misinterpreted the text written in Unicode standard.


 
  Example of
  this could be RAsup on top of dotted circle and RAsup on top of space
  character. Current use of space character to eliminate dotted 
  circle is really painful and may create problems in determining 
  language and syllable boundaries.
 
 Languages or syllable boundaries have nothing to do with this. These
 special
 sequences should *never* be part of any syllabe or word in any language:
 they are just a way of showing the shape of a glyph, to be used when,
 e.g., talking about typography or spelling.

Then how can we rake care of fallback mechanism?


Thanks for taking pain for answering my queries :-)

- Keyur



__
Do you Yahoo!?
Yahoo! Mail Plus - Powerful. Affordable. Sign up now.
http://mailplus.yahoo.com




urban legends just won't go away!

2003-01-29 Thread Barry Caplan
http://archive.devx.com/free/tips/tipview.asp?content_id=4151

Who knew in this day and age flipping bits to change case is still publishable (this 
is from today!)

Barry Caplan
www.i18n.com
Vendor Showcase: http://Showcase.i18n.com


--

Use Logical Bit Operations to Changing Character Case


This is a simple example demonstrating my own personal method.

// to lower case
  public char lower(int c)
  {
   return (char)((c = 65  c = 90) ? c |= 0x20 : c);
  }

//to upper case
  public char upper(int c)
  {
return (char)((c = 97  c =122) ?  c ^= 0x20 : c);
  }
/*
 If I would I could create a method for converting an entire
string to lower, like this:
*/
  public String getLowerString(String s)
  {
 char[] c = s.toCharArray();
 char[] cres = new char[s.length()];
 for(int i=0;ic.length;++i)
 cres[i] = lower(c[i]);
 return String.valueOf(cres);
  }
/*
even converting in capital:
*/
  public String capital(String s)
  {
 return
String.valueOf(upper(s.toCharArray()[0])).concat(s.substring(1));
  }
/* using it*/
public static void main(String args[])
  {
 x xx = new x();
 System.out.println(xx.getLowerString(LOWER:  + FRAME));
 System.out.println(xx.upper('f'));
 System.out.println(xx.capital(randomaccessfile));
}