Summary: CJKFilter wrongly tokenize a CJK and non-CJK mixed
           Product: MediaWiki extensions
           Version: any
          Platform: All
        OS/Version: All
            Status: NEW
          Severity: normal
          Priority: Normal
         Component: Lucene Search

Created attachment 8054
a patch for and its test.

With language=ja setting,
CJKFilter wrongly tokenize CJK string
if this string starts with non-CJK characters.

A string "abC1C2C3", where C1 C2 C3 mean a CJK characters, is tokenized into
a token stream (abC1, C1C2, C2C3).
This should be (ab, C1C2, C2C3, C3C4).

This behavior causes an odd snippet in search result.
A token stream (abC1, C1C2, C2C3) is combined into a word "abC1C1C2C3".

Configure bugmail:
------- You are receiving this mail because: -------
You are on the CC list for the bug.

Wikibugs-l mailing list

Reply via email to