https://bugzilla.wikimedia.org/show_bug.cgi?id=26997

           Summary: CJKFilter wrongly tokenize a CJK and non-CJK mixed
                    string.
           Product: MediaWiki extensions
           Version: any
          Platform: All
        OS/Version: All
            Status: NEW
          Severity: normal
          Priority: Normal
         Component: Lucene Search
        AssignedTo: rain...@eunet.rs
        ReportedBy: mizuno....@gmail.com


Created attachment 8054
  --> https://bugzilla.wikimedia.org/attachment.cgi?id=8054
a patch for CJKFilter.java and its test.

With language=ja setting,
CJKFilter wrongly tokenize CJK string
if this string starts with non-CJK characters.

Example:
A string "abC1C2C3", where C1 C2 C3 mean a CJK characters, is tokenized into
a token stream (abC1, C1C2, C2C3).
This should be (ab, C1C2, C2C3, C3C4).

This behavior causes an odd snippet in search result.
A token stream (abC1, C1C2, C2C3) is combined into a word "abC1C1C2C3".

-- 
Configure bugmail: https://bugzilla.wikimedia.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are on the CC list for the bug.

_______________________________________________
Wikibugs-l mailing list
Wikibugs-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikibugs-l

Reply via email to