https://bugzilla.wikimedia.org/show_bug.cgi?id=31135
Web browser: ---
Bug #: 31135
Summary: Lucene tokenization is wrong for Indic languages
Product: MediaWiki extensions
Version: any
Platform: All
OS/Version: All
Status: NEW
Keywords: i18n
Severity: normal
Priority: Unprioritized
Component: Lucene Search
AssignedTo: [email protected]
ReportedBy: [email protected]
CC: [email protected]
Classification: Unclassified
Lucene tokenizes the word in format control characters like ZWJ and ZWNJ
causing words in Indic languages, Sinhala broken in unwanted places.
This is the log from the lucened when a string ශ්රීලංකා (Srilanka, written in
Sinhala Language) is searched:
25959 [pool-2-thread-1] INFO org.wikimedia.lsearch.search.SearchEngine -
search wikidb: query=[ශ්රීලංකා] parsed=[custom(+(+(contents:ශ්^0.2
contents:ශ^0.1) +(contents:රීලංකා^0.2 contents:රලක^0.1)) relevance ([((P
contents:"(ශ් ශ) (රීලංකා රලක)"~100) (((P sections:"(ශ් ශ)") (P
sections:"(රීලංකා රලක)") (P sections:"(ශ් ශ) (රීලංකා රලක)"))^0.25))^2.0], ((P
alttitle:"(ශ් ශ)"^2.5) (P alttitle:"(රීලංකා රලක)"^2.5) (P alttitle:"(ශ් ශ)
(රීලංකා රලක)"~20^2.5)) ((P related:"(ශ් ශ)"^12.0) (P related:"(රීලංකා
රලක)"^12.0) (P related:"(ශ් ශ) (රීලංකා රලක)"^12.0))) (P alttitle:"ශ්
රීලංකා"~20))] hit=[0] in 250ms using IndexSearcherMul:1316871160395
ශ්රීලංකා is 0DC1 + 0DCA + 200D + 0DBB + 0DD3 + 0DBD + 0D82 + 0D9A + 0DCF
or SHA + VIRAMA + ZWJ + RA + VOWEL SIGN II + LA + ANUSVARA + KA + VOWEL SIGN AA
The word is single one and cannot be tokenized further, but we can see that It
is tokenized at the place of ZWJ.
The solution would be writing language specific tokenization rules in Lucene.
--
Configure bugmail: https://bugzilla.wikimedia.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug.
You are on the CC list for the bug.
_______________________________________________
Wikibugs-l mailing list
[email protected]
https://lists.wikimedia.org/mailman/listinfo/wikibugs-l