[Bug 31135] New: Lucene tokenization is wrong for Indic languages

bugzilla-daemon Sat, 24 Sep 2011 06:57:00 -0700

https://bugzilla.wikimedia.org/show_bug.cgi?id=31135


       Web browser: ---
             Bug #: 31135
           Summary: Lucene tokenization is wrong for Indic languages
           Product: MediaWiki extensions
           Version: any
          Platform: All
        OS/Version: All
            Status: NEW
          Keywords: i18n
          Severity: normal
          Priority: Unprioritized
         Component: Lucene Search
        AssignedTo: [email protected]
        ReportedBy: [email protected]
                CC: [email protected]
    Classification: Unclassified


Lucene tokenizes the word in format control characters like ZWJ and ZWNJ
causing words in Indic languages, Sinhala broken in unwanted places.

This is the log from the lucened when a string ශ්‍රීලංකා (Srilanka, written in
Sinhala Language) is searched:

25959 [pool-2-thread-1] INFO  org.wikimedia.lsearch.search.SearchEngine  -
search wikidb: query=[ශ්‍රීලංකා] parsed=[custom(+(+(contents:ශ්^0.2
contents:ශ^0.1) +(contents:රීලංකා^0.2 contents:රලක^0.1)) relevance ([((P
contents:"(ශ් ශ) (රීලංකා රලක)"~100) (((P sections:"(ශ් ශ)") (P
sections:"(රීලංකා රලක)") (P sections:"(ශ් ශ) (රීලංකා රලක)"))^0.25))^2.0], ((P
alttitle:"(ශ් ශ)"^2.5) (P alttitle:"(රීලංකා රලක)"^2.5) (P alttitle:"(ශ් ශ)
(රීලංකා රලක)"~20^2.5)) ((P related:"(ශ් ශ)"^12.0) (P related:"(රීලංකා
රලක)"^12.0) (P related:"(ශ් ශ) (රීලංකා රලක)"^12.0))) (P alttitle:"ශ්
රීලංකා"~20))] hit=[0] in 250ms using IndexSearcherMul:1316871160395


ශ්‍රීලංකා is  0DC1 + 0DCA + 200D + 0DBB + 0DD3 + 0DBD + 0D82 + 0D9A + 0DCF 
or SHA + VIRAMA + ZWJ + RA + VOWEL SIGN II + LA + ANUSVARA + KA + VOWEL SIGN AA

The word is single one and cannot be tokenized further, but we can see that It
is tokenized at the place of ZWJ.

The solution would be writing language specific tokenization rules in Lucene.

-- 
Configure bugmail: https://bugzilla.wikimedia.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug.
You are on the CC list for the bug.
_______________________________________________
Wikibugs-l mailing list
[email protected]
https://lists.wikimedia.org/mailman/listinfo/wikibugs-l

[Bug 31135] New: Lucene tokenization is wrong for Indic languages

Reply via email to