At 11:54 PM 1/28/03 -0800, Keyur Shroff wrote:
--- Aditya Gokhale <[EMAIL PROTECTED]> wrote:

>
> 2. Implementation Query -
>     In an implementation where I need to send / process Hindi, Marathi
> and Sanskrit data, how do I differentiate between languages (Hindi,
> Marathi and Sanskrit). Say for example, I am writing a translation
> engine, and I want to translate a document having Hindi, Marathi and
> Sanskrit Text in it, how do I know from the code points between 0x0900
> and 0x097F, that the data under perusal is Hindi / Marathi / Sanskrit ?

Unicode is not divided into code pages. Unlike few old encodings there is
only one code page for entire Unicode standard. However, for better
readability and quick user reference the entire chart has been divided into
different sections which you might interpret as code pages.
This seems similar to the question, how can one tell from text using characters in the ranges 0020-007E and 00A0-00FF whether the text is in Danish, German or French?

It turns out that there are several kinds of approaches. One class of approaches looks at the different distribution of letters for the different languages. Letter frequency, pair and triplet distribution, and in some sense also the 'short word' method are all of this type. For languages that use the same script, but that are otherwise not too similar, such methods work well.

Another class of approaches uses unique letters and other unique features of each language to make the distinction. Some aspects of the short word method could be classed here as well.

In the case at hand, if all three languages share the same alphabet in full, then the first class of methods must be used. Automatic recognition of a language can be combined with keeping track of the keyboard layout used to type a document and other information about a user's or document's context in order to make determination of the language reliable and easy for the user.


>     I would suggest that we should give different code pages for Marathi,
> Hindi and Sanskrit. May be current code page of Devanagari can be traded
> as Hindi and two new code pages for Marathi and Sanskrit be added. This
> could solve these issues. If there is any better way of solving this, any
> one suggest.
There are 6000 languages (or so). If all were written and with an average of 100 characters each, encoding each character separately for each language would mean 600,000 characters. It is true that many languages are not written, but many writing systems used for a variety of languages have much more than 100 symbols.

Allowing each language a duplicate copy of the script it uses means forcing everybody to now precisely what language each word is in, otherwise the character codes are wrong. And what about quoted foreign words, borrowed foreign words, or foreign words that are almost, but not quite assimilated. Which 'code pages' would these use?

Finally, it would then be necessary to cross-correlate these 'code pages' for some form or searches. At possibly up to 6,000 of them, there are nearly 18 million possible correlation between all these code pages. In short, that was the problem that Unicode was invented to solve in the first place.

Sharing a single encoding of a script across all languages that use it, is usually not a problem. The technology that can handle the fine details of display and other issues that arise from this approach exists (e.g. Opentype for fonts and the rendering engines that support the relevant features).

> 3. Character codes for jna, shra, ksh -
>
> In Sanskrit and Marathi jna, shra and ksh are considered as separate
> characters and not ligatures. How do we take care of this ? Can I get
> over all views on the matter from the group ? In my opinion they should
> be given different code points in the specific language code page.
> Please find below the character glyphs -
>
> jna
> shra
> ksh

All of the above can be composed through following consonant clusters:
  jna -> ja halant nya
  shra -> sha halant ra
  ksh -> ka halant ssha

The point that the above sequences are considered as characters in some of
the Indian languages has merit. If there is demand from native speakers
then a proposal can be submitted to Unicode. There is a predefined
procedure for proposal submission. Once this is discussed with concerned
people and agreed upon then these ligatures can be added in Devanagari
script itself because Devenagari script represent all three languages you
mentioned namely Sanskrit, Marathi, and Hindi. Meanwhile you can write
rules for composing them from the consonant clusters.
I wouldn't go so far. The fact that clusters belong together is something that can be handled by the software. Collation and other data processing needs to deal with such issues already for many other languages. See http://www.unicode.org/reports/tr10 on the collation algorithm.

A./

Reply via email to