On 18/02/2017 07:22, Hao Wu wrote:
Thanks. Get it work.

Lucy's StandardTokenizer breaks up the text at the word boundaries defined in Unicode Standard Annex #29. Then we treat every Alphabetic character that doesn't have a Word_Break property as a single term. These are characters that match \p{Ideographic}, \p{Script: Hiragana}, or \p{Line_Break: Complex_Context}. This should work for Chinese but as Peter mentioned, we don't support n-grams.

If you're using QueryParser, you're likely to run into problems, though. QueryParser will turn a sequence of Chinese characters into a PhraseQuery which is obviously wrong. A quick hack is to insert a space after every Chinese character before passing a query string to QueryParser:

    $query_string =~ s/\p{Ideographic}/$& /g;

Nick

Reply via email to