[Bug 27055] Devanagari and Arabic combining character handling

bugzilla-daemon Thu, 05 Jan 2012 22:29:05 -0800

https://bugzilla.wikimedia.org/show_bug.cgi?id=27055


Siddhartha Ghai <[email protected]> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
                 CC|                            |[email protected]

--- Comment #3 from Siddhartha Ghai <[email protected]> 2012-01-06 
06:28:55 UTC ---
(In reply to comment #1)
> The discussion can be seen here, but here are the diacritics and characters
> provided to me:
> 
> 
> Hindi:
> First of all, the pairs with nuqta (a dot underneath) and without it should be
> searchable the same way Roman letters with diacritics and without are
> searchable.
>     * क़/क ख़/ख ग़/ग ज़/ज फ़/फ ड़/ड ढ़/ढ 
> The letters are not identical but So that if a user typed खून, ख़ून would also
> be listed.
>     * Words containing diacritics ॉ (candra), ् (virama) should be equal to
> those without them: चॉकलेट / चाकलेट, सन् / सन. Similar to the way English 
> words
> entries with a space are equal to those having a hyphen (-) between them. 
> ----
> Arabic:
>     * Different forms of alif: ا, أ‎, إ‎, ﺁ‎ and ٱ‎‎ should be searchable
> together, e.g. أمس and امس, etc.
>     * Words containing any of these diacritics could be searchable as if they
> don't have them and the other way around: 
> ـَ fatHa, ـِ kasra, ـُ Damma, ـْ sukuun, ـّ shadda, ـٰ dagger 'alif. 
> ----
>     * ـٌ tanwiin al-Damm (تنوين الضم) 
>     * ـٍ tanwiin al-kasr (تنوين الكسر) 
>     * ـً tanwiin al-fatH (تنوين الفتح) 
> ----
> Persian often uses a zero-width nonjoiner (& # x200C;) as in ویکی‌پدیا. People
> who don’t know how to access it tend to substitute a space: ویکی پدیا. It’s a
> misspelling, but lots of people can’t help it.
> 
> In languages like Khmer and Thai that do not use word spaces, there is often a
> zero-width space (& # x200B;) as in តើអ្នកនិយាយភាសាអង់គ្លេសទេ. More often
> than not, it is simply left out (តើអ្នកនិយាយភាសាអង់គ្លេសទេ). Both spellings 
> are
> correct.
> 
> I think Anatoli neglected to mention the word-final Arabic pair ه/ة. The final
> letter ة may be typed as ه.

Actually चॉकलेट can also be written as चौकलेट or चोकलेट . However, everything
other than चॉकलेट is grammatically incorrect. But, if equivalence is to be
added, it should be चॉकलेट and चौकलेट, not चाकलेट. Reason being that a lot of
unwanted equivalences would be introduced as well, like हॉल (hall) and हाल
(condition someone is in).

The handling for halant/viram is correctly stated as equivalence. However,
there is more to it. Five characters in hindi when followed by halant, can be
replaced by an anuswara on the next character. All five represent nasal sounds,
which can be represented by anuswara. For example, सङ्गीत/संगीत, सम्वत/संवत

The five characters are ङ ञ ण न म

But not all cases of anuswara can be equated to each one, since each has a
different sound.
There is a grammatical rule which decides this. The rule depends on the
character next to these five characters. On a case basis:

क ख ग घ are preceded by ङ
च छ ज झ are preceded by ञ
ट ठ ड ढ are preceded by ण
त थ द ध are preceded by न
प फ बी भ are preceded by म

Note that this is similar the utf8 encoding order. The four alphabets come in
the stated order before before the respective nasal alphabet.

So, if I type in सन् , I would expect संतान to show up, but not संभव.

However, this limitation of equating is an ideal case with perfect grammar. In
actual usage, न् has been used in place of ङ् ञ् and ण् but not म् since it is
an entirely different sound. So, if I type in सन्, I would also expect संगीत,
संजय, संडे to show up, but still not संभव. Hope I have clarified this clearly
enough.

PS:The nuqta stuff is correct.

-- 
Configure bugmail: https://bugzilla.wikimedia.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are on the CC list for the bug.
_______________________________________________
Wikibugs-l mailing list
[email protected]
https://lists.wikimedia.org/mailman/listinfo/wikibugs-l

[Bug 27055] Devanagari and Arabic combining character handling

Reply via email to