[Bug 27055] Devanagari and Arabic combining character handling
https://bugzilla.wikimedia.org/show_bug.cgi?id=27055 Nik Everett changed: What|Removed |Added Keywords||utf8 -- You are receiving this mail because: You are the assignee for the bug. You are on the CC list for the bug. ___ Wikibugs-l mailing list Wikibugs-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikibugs-l
[Bug 27055] Devanagari and Arabic combining character handling
https://bugzilla.wikimedia.org/show_bug.cgi?id=27055 Chad H. changed: What|Removed |Added Assignee|rain...@eunet.rs|wikibugs-l@lists.wikimedia. ||org -- You are receiving this mail because: You are the assignee for the bug. You are on the CC list for the bug. ___ Wikibugs-l mailing list Wikibugs-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikibugs-l
[Bug 27055] Devanagari and Arabic combining character handling
https://bugzilla.wikimedia.org/show_bug.cgi?id=27055 Chad H. changed: What|Removed |Added CC||dga...@wikimedia.org, ||innocentkil...@gmail.com, ||neverett+bugzilla@wikimedia ||.org Component|MWSearch|CirrusSearch --- Comment #6 from Chad H. --- Needs reassessment with Cirrus. -- You are receiving this mail because: You are on the CC list for the bug. ___ Wikibugs-l mailing list Wikibugs-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikibugs-l
[Bug 27055] Devanagari and Arabic combining character handling
https://bugzilla.wikimedia.org/show_bug.cgi?id=27055 --- Comment #5 from Andre Klapper --- (In reply to Dave Ross from comment #1) > Arabic: > * Different forms of alif: ا, أ, إ, ﺁ and ٱ should be searchable > together, e.g. أمس and امس, etc. امس : search=امس https://ar.wikipedia.org/w/index.php?title=%D8%AE%D8%A7%D8%B5%3A%D8%A8%D8%AD%D8%AB&profile=default&search=امس&fulltext=Search&uselang=en There is a page named "امس" on this wiki. أمس : https://ar.wikipedia.org/w/index.php?title=%D8%AE%D8%A7%D8%B5%3A%D8%A8%D8%AD%D8%AB&profile=default&search=أمس&fulltext=Search&uselang=en Create the page "أمس" on this wiki! أمس : https://ar.wikipedia.org/w/index.php?title=%D8%AE%D8%A7%D8%B5%3A%D8%A8%D8%AD%D8%AB&profile=default&search=أمس&fulltext=Search&uselang=en&srbackend=CirrusSearch Create the page "أمس" on this wiki! -- You are receiving this mail because: You are on the CC list for the bug. ___ Wikibugs-l mailing list Wikibugs-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikibugs-l
[Bug 27055] Devanagari and Arabic combining character handling
https://bugzilla.wikimedia.org/show_bug.cgi?id=27055 --- Comment #4 from Siddhartha Ghai 2012-01-06 06:32:59 UTC --- Bug 33548 is related to this. Its about the appearance of devanagari diacritics in the "did you know" results. -- Configure bugmail: https://bugzilla.wikimedia.org/userprefs.cgi?tab=email --- You are receiving this mail because: --- You are on the CC list for the bug. ___ Wikibugs-l mailing list Wikibugs-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikibugs-l
[Bug 27055] Devanagari and Arabic combining character handling
https://bugzilla.wikimedia.org/show_bug.cgi?id=27055 Siddhartha Ghai changed: What|Removed |Added CC||siddhartha.g...@gmail.com --- Comment #3 from Siddhartha Ghai 2012-01-06 06:28:55 UTC --- (In reply to comment #1) > The discussion can be seen here, but here are the diacritics and characters > provided to me: > > > Hindi: > First of all, the pairs with nuqta (a dot underneath) and without it should be > searchable the same way Roman letters with diacritics and without are > searchable. > * क़/क ख़/ख ग़/ग ज़/ज फ़/फ ड़/ड ढ़/ढ > The letters are not identical but So that if a user typed खून, ख़ून would also > be listed. > * Words containing diacritics ॉ (candra), ् (virama) should be equal to > those without them: चॉकलेट / चाकलेट, सन् / सन. Similar to the way English > words > entries with a space are equal to those having a hyphen (-) between them. > > Arabic: > * Different forms of alif: ا, أ, إ, ﺁ and ٱ should be searchable > together, e.g. أمس and امس, etc. > * Words containing any of these diacritics could be searchable as if they > don't have them and the other way around: > ـَ fatHa, ـِ kasra, ـُ Damma, ـْ sukuun, ـّ shadda, ـٰ dagger 'alif. > > * ـٌ tanwiin al-Damm (تنوين الضم) > * ـٍ tanwiin al-kasr (تنوين الكسر) > * ـً tanwiin al-fatH (تنوين الفتح) > > Persian often uses a zero-width nonjoiner (& # x200C;) as in ویکیپدیا. People > who don’t know how to access it tend to substitute a space: ویکی پدیا. It’s a > misspelling, but lots of people can’t help it. > > In languages like Khmer and Thai that do not use word spaces, there is often a > zero-width space (& # x200B;) as in តើអ្នកនិយាយភាសាអង់គ្លេសទេ. More often > than not, it is simply left out (តើអ្នកនិយាយភាសាអង់គ្លេសទេ). Both spellings > are > correct. > > I think Anatoli neglected to mention the word-final Arabic pair ه/ة. The final > letter ة may be typed as ه. Actually चॉकलेट can also be written as चौकलेट or चोकलेट . However, everything other than चॉकलेट is grammatically incorrect. But, if equivalence is to be added, it should be चॉकलेट and चौकलेट, not चाकलेट. Reason being that a lot of unwanted equivalences would be introduced as well, like हॉल (hall) and हाल (condition someone is in). The handling for halant/viram is correctly stated as equivalence. However, there is more to it. Five characters in hindi when followed by halant, can be replaced by an anuswara on the next character. All five represent nasal sounds, which can be represented by anuswara. For example, सङ्गीत/संगीत, सम्वत/संवत The five characters are ङ ञ ण न म But not all cases of anuswara can be equated to each one, since each has a different sound. There is a grammatical rule which decides this. The rule depends on the character next to these five characters. On a case basis: क ख ग घ are preceded by ङ च छ ज झ are preceded by ञ ट ठ ड ढ are preceded by ण त थ द ध are preceded by न प फ बी भ are preceded by म Note that this is similar the utf8 encoding order. The four alphabets come in the stated order before before the respective nasal alphabet. So, if I type in सन् , I would expect संतान to show up, but not संभव. However, this limitation of equating is an ideal case with perfect grammar. In actual usage, न् has been used in place of ङ् ञ् and ण् but not म् since it is an entirely different sound. So, if I type in सन्, I would also expect संगीत, संजय, संडे to show up, but still not संभव. Hope I have clarified this clearly enough. PS:The nuqta stuff is correct. -- Configure bugmail: https://bugzilla.wikimedia.org/userprefs.cgi?tab=email --- You are receiving this mail because: --- You are on the CC list for the bug. ___ Wikibugs-l mailing list Wikibugs-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikibugs-l
[Bug 27055] Devanagari and Arabic combining character handling
https://bugzilla.wikimedia.org/show_bug.cgi?id=27055 Niklas Laxström changed: What|Removed |Added CC||niklas.laxst...@gmail.com --- Comment #2 from Niklas Laxström 2011-09-06 11:28:46 UTC --- Just adding a note that stripping diacritics from latin letters is not always the correct thing to do. It is obvious that we need to support different models for different languages. -- Configure bugmail: https://bugzilla.wikimedia.org/userprefs.cgi?tab=email --- You are receiving this mail because: --- You are on the CC list for the bug. ___ Wikibugs-l mailing list Wikibugs-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikibugs-l
[Bug 27055] Devanagari and Arabic combining character handling
https://bugzilla.wikimedia.org/show_bug.cgi?id=27055 Santhosh Thottingal changed: What|Removed |Added Keywords||i18n CC||santhosh.thottingal@gmail.c ||om -- Configure bugmail: https://bugzilla.wikimedia.org/userprefs.cgi?tab=email --- You are receiving this mail because: --- You are on the CC list for the bug. ___ Wikibugs-l mailing list Wikibugs-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikibugs-l
[Bug 27055] Devanagari and Arabic combining character handling
https://bugzilla.wikimedia.org/show_bug.cgi?id=27055 --- Comment #1 from Dave Ross 2011-02-05 15:41:31 UTC --- The discussion can be seen here, but here are the diacritics and characters provided to me: Hindi: First of all, the pairs with nuqta (a dot underneath) and without it should be searchable the same way Roman letters with diacritics and without are searchable. * क़/क ख़/ख ग़/ग ज़/ज फ़/फ ड़/ड ढ़/ढ The letters are not identical but So that if a user typed खून, ख़ून would also be listed. * Words containing diacritics ॉ (candra), ् (virama) should be equal to those without them: चॉकलेट / चाकलेट, सन् / सन. Similar to the way English words entries with a space are equal to those having a hyphen (-) between them. Arabic: * Different forms of alif: ا, أ, إ, ﺁ and ٱ should be searchable together, e.g. أمس and امس, etc. * Words containing any of these diacritics could be searchable as if they don't have them and the other way around: ـَ fatHa, ـِ kasra, ـُ Damma, ـْ sukuun, ـّ shadda, ـٰ dagger 'alif. * ـٌ tanwiin al-Damm (تنوين الضم) * ـٍ tanwiin al-kasr (تنوين الكسر) * ـً tanwiin al-fatH (تنوين الفتح) Persian often uses a zero-width nonjoiner (& # x200C;) as in ویکیپدیا. People who don’t know how to access it tend to substitute a space: ویکی پدیا. It’s a misspelling, but lots of people can’t help it. In languages like Khmer and Thai that do not use word spaces, there is often a zero-width space (& # x200B;) as in តើអ្នកនិយាយភាសាអង់គ្លេសទេ. More often than not, it is simply left out (តើអ្នកនិយាយភាសាអង់គ្លេសទេ). Both spellings are correct. I think Anatoli neglected to mention the word-final Arabic pair ه/ة. The final letter ة may be typed as ه. -- Configure bugmail: https://bugzilla.wikimedia.org/userprefs.cgi?tab=email --- You are receiving this mail because: --- You are on the CC list for the bug. ___ Wikibugs-l mailing list Wikibugs-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikibugs-l