[Bug 27055] Devanagari and Arabic combining character handling

2014-02-20 Thread bugzilla-daemon
https://bugzilla.wikimedia.org/show_bug.cgi?id=27055

Nik Everett  changed:

   What|Removed |Added

   Keywords||utf8

-- 
You are receiving this mail because:
You are the assignee for the bug.
You are on the CC list for the bug.
___
Wikibugs-l mailing list
Wikibugs-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikibugs-l


[Bug 27055] Devanagari and Arabic combining character handling

2014-02-14 Thread bugzilla-daemon
https://bugzilla.wikimedia.org/show_bug.cgi?id=27055

Chad H.  changed:

   What|Removed |Added

   Assignee|rain...@eunet.rs|wikibugs-l@lists.wikimedia.
   ||org

-- 
You are receiving this mail because:
You are the assignee for the bug.
You are on the CC list for the bug.
___
Wikibugs-l mailing list
Wikibugs-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikibugs-l


[Bug 27055] Devanagari and Arabic combining character handling

2014-02-14 Thread bugzilla-daemon
https://bugzilla.wikimedia.org/show_bug.cgi?id=27055

Chad H.  changed:

   What|Removed |Added

 CC||dga...@wikimedia.org,
   ||innocentkil...@gmail.com,
   ||neverett+bugzilla@wikimedia
   ||.org
  Component|MWSearch|CirrusSearch

--- Comment #6 from Chad H.  ---
Needs reassessment with Cirrus.

-- 
You are receiving this mail because:
You are on the CC list for the bug.
___
Wikibugs-l mailing list
Wikibugs-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikibugs-l


[Bug 27055] Devanagari and Arabic combining character handling

2014-02-13 Thread bugzilla-daemon
https://bugzilla.wikimedia.org/show_bug.cgi?id=27055

--- Comment #5 from Andre Klapper  ---
(In reply to Dave Ross from comment #1)
> Arabic:
> * Different forms of alif: ا, أ‎, إ‎, ﺁ‎ and ٱ‎‎ should be searchable
> together, e.g. أمس and امس, etc.

امس : search=امس
https://ar.wikipedia.org/w/index.php?title=%D8%AE%D8%A7%D8%B5%3A%D8%A8%D8%AD%D8%AB&profile=default&search=امس&fulltext=Search&uselang=en
There is a page named "امس" on this wiki.

أمس :
https://ar.wikipedia.org/w/index.php?title=%D8%AE%D8%A7%D8%B5%3A%D8%A8%D8%AD%D8%AB&profile=default&search=أمس&fulltext=Search&uselang=en
Create the page "أمس" on this wiki!

أمس :
https://ar.wikipedia.org/w/index.php?title=%D8%AE%D8%A7%D8%B5%3A%D8%A8%D8%AD%D8%AB&profile=default&search=أمس&fulltext=Search&uselang=en&srbackend=CirrusSearch
Create the page "أمس" on this wiki!

-- 
You are receiving this mail because:
You are on the CC list for the bug.
___
Wikibugs-l mailing list
Wikibugs-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikibugs-l


[Bug 27055] Devanagari and Arabic combining character handling

2012-01-05 Thread bugzilla-daemon
https://bugzilla.wikimedia.org/show_bug.cgi?id=27055

--- Comment #4 from Siddhartha Ghai  2012-01-06 
06:32:59 UTC ---
Bug 33548 is related to this. Its about the appearance of devanagari diacritics
in the "did you know" results.

-- 
Configure bugmail: https://bugzilla.wikimedia.org/userprefs.cgi?tab=email
--- You are receiving this mail because: ---
You are on the CC list for the bug.

___
Wikibugs-l mailing list
Wikibugs-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikibugs-l


[Bug 27055] Devanagari and Arabic combining character handling

2012-01-05 Thread bugzilla-daemon
https://bugzilla.wikimedia.org/show_bug.cgi?id=27055

Siddhartha Ghai  changed:

   What|Removed |Added

 CC||siddhartha.g...@gmail.com

--- Comment #3 from Siddhartha Ghai  2012-01-06 
06:28:55 UTC ---
(In reply to comment #1)
> The discussion can be seen here, but here are the diacritics and characters
> provided to me:
> 
> 
> Hindi:
> First of all, the pairs with nuqta (a dot underneath) and without it should be
> searchable the same way Roman letters with diacritics and without are
> searchable.
> * क़/क ख़/ख ग़/ग ज़/ज फ़/फ ड़/ड ढ़/ढ 
> The letters are not identical but So that if a user typed खून, ख़ून would also
> be listed.
> * Words containing diacritics ॉ (candra), ् (virama) should be equal to
> those without them: चॉकलेट / चाकलेट, सन् / सन. Similar to the way English 
> words
> entries with a space are equal to those having a hyphen (-) between them. 
> 
> Arabic:
> * Different forms of alif: ا, أ‎, إ‎, ﺁ‎ and ٱ‎‎ should be searchable
> together, e.g. أمس and امس, etc.
> * Words containing any of these diacritics could be searchable as if they
> don't have them and the other way around: 
> ـَ fatHa, ـِ kasra, ـُ Damma, ـْ sukuun, ـّ shadda, ـٰ dagger 'alif. 
> 
> * ـٌ tanwiin al-Damm (تنوين الضم) 
> * ـٍ tanwiin al-kasr (تنوين الكسر) 
> * ـً tanwiin al-fatH (تنوين الفتح) 
> 
> Persian often uses a zero-width nonjoiner (& # x200C;) as in ویکی‌پدیا. People
> who don’t know how to access it tend to substitute a space: ویکی پدیا. It’s a
> misspelling, but lots of people can’t help it.
> 
> In languages like Khmer and Thai that do not use word spaces, there is often a
> zero-width space (& # x200B;) as in តើអ្នកនិយាយ​ភាសាអង់គ្លេស​ទេ. More often
> than not, it is simply left out (តើអ្នកនិយាយភាសាអង់គ្លេសទេ). Both spellings 
> are
> correct.
> 
> I think Anatoli neglected to mention the word-final Arabic pair ه/ة. The final
> letter ة may be typed as ه.

Actually चॉकलेट can also be written as चौकलेट or चोकलेट . However, everything
other than चॉकलेट is grammatically incorrect. But, if equivalence is to be
added, it should be चॉकलेट and चौकलेट, not चाकलेट. Reason being that a lot of
unwanted equivalences would be introduced as well, like हॉल (hall) and हाल
(condition someone is in).

The handling for halant/viram is correctly stated as equivalence. However,
there is more to it. Five characters in hindi when followed by halant, can be
replaced by an anuswara on the next character. All five represent nasal sounds,
which can be represented by anuswara. For example, सङ्गीत/संगीत, सम्वत/संवत

The five characters are ङ ञ ण न म

But not all cases of anuswara can be equated to each one, since each has a
different sound.
There is a grammatical rule which decides this. The rule depends on the
character next to these five characters. On a case basis:

क ख ग घ are preceded by ङ
च छ ज झ are preceded by ञ
ट ठ ड ढ are preceded by ण
त थ द ध are preceded by न
प फ बी भ are preceded by म

Note that this is similar the utf8 encoding order. The four alphabets come in
the stated order before before the respective nasal alphabet.

So, if I type in सन् , I would expect संतान to show up, but not संभव.

However, this limitation of equating is an ideal case with perfect grammar. In
actual usage, न् has been used in place of ङ् ञ् and ण् but not म् since it is
an entirely different sound. So, if I type in सन्, I would also expect संगीत,
संजय, संडे to show up, but still not संभव. Hope I have clarified this clearly
enough.

PS:The nuqta stuff is correct.

-- 
Configure bugmail: https://bugzilla.wikimedia.org/userprefs.cgi?tab=email
--- You are receiving this mail because: ---
You are on the CC list for the bug.
___
Wikibugs-l mailing list
Wikibugs-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikibugs-l


[Bug 27055] Devanagari and Arabic combining character handling

2011-09-06 Thread bugzilla-daemon
https://bugzilla.wikimedia.org/show_bug.cgi?id=27055

Niklas Laxström  changed:

   What|Removed |Added

 CC||niklas.laxst...@gmail.com

--- Comment #2 from Niklas Laxström  2011-09-06 
11:28:46 UTC ---
Just adding a note that stripping diacritics from latin letters is not always
the correct thing to do. It is obvious that we need to support different models
for different languages.

-- 
Configure bugmail: https://bugzilla.wikimedia.org/userprefs.cgi?tab=email
--- You are receiving this mail because: ---
You are on the CC list for the bug.
___
Wikibugs-l mailing list
Wikibugs-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikibugs-l


[Bug 27055] Devanagari and Arabic combining character handling

2011-09-06 Thread bugzilla-daemon
https://bugzilla.wikimedia.org/show_bug.cgi?id=27055

Santhosh Thottingal  changed:

   What|Removed |Added

   Keywords||i18n
 CC||santhosh.thottingal@gmail.c
   ||om

-- 
Configure bugmail: https://bugzilla.wikimedia.org/userprefs.cgi?tab=email
--- You are receiving this mail because: ---
You are on the CC list for the bug.

___
Wikibugs-l mailing list
Wikibugs-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikibugs-l


[Bug 27055] Devanagari and Arabic combining character handling

2011-02-05 Thread bugzilla-daemon
https://bugzilla.wikimedia.org/show_bug.cgi?id=27055

--- Comment #1 from Dave Ross  2011-02-05 15:41:31 UTC 
---
The discussion can be seen here, but here are the diacritics and characters
provided to me:


Hindi:
First of all, the pairs with nuqta (a dot underneath) and without it should be
searchable the same way Roman letters with diacritics and without are
searchable.
* क़/क ख़/ख ग़/ग ज़/ज फ़/फ ड़/ड ढ़/ढ 
The letters are not identical but So that if a user typed खून, ख़ून would also
be listed.
* Words containing diacritics ॉ (candra), ् (virama) should be equal to
those without them: चॉकलेट / चाकलेट, सन् / सन. Similar to the way English words
entries with a space are equal to those having a hyphen (-) between them. 

Arabic:
* Different forms of alif: ا, أ‎, إ‎, ﺁ‎ and ٱ‎‎ should be searchable
together, e.g. أمس and امس, etc.
* Words containing any of these diacritics could be searchable as if they
don't have them and the other way around: 
ـَ fatHa, ـِ kasra, ـُ Damma, ـْ sukuun, ـّ shadda, ـٰ dagger 'alif. 

* ـٌ tanwiin al-Damm (تنوين الضم) 
* ـٍ tanwiin al-kasr (تنوين الكسر) 
* ـً tanwiin al-fatH (تنوين الفتح) 

Persian often uses a zero-width nonjoiner (& # x200C;) as in ویکی‌پدیا. People
who don’t know how to access it tend to substitute a space: ویکی پدیا. It’s a
misspelling, but lots of people can’t help it.

In languages like Khmer and Thai that do not use word spaces, there is often a
zero-width space (& # x200B;) as in តើអ្នកនិយាយ​ភាសាអង់គ្លេស​ទេ. More often
than not, it is simply left out (តើអ្នកនិយាយភាសាអង់គ្លេសទេ). Both spellings are
correct.

I think Anatoli neglected to mention the word-final Arabic pair ه/ة. The final
letter ة may be typed as ه.

-- 
Configure bugmail: https://bugzilla.wikimedia.org/userprefs.cgi?tab=email
--- You are receiving this mail because: ---
You are on the CC list for the bug.
___
Wikibugs-l mailing list
Wikibugs-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikibugs-l