On 18/02/2017 07:22, Hao Wu wrote:
Thanks. Get it work.
Lucy's StandardTokenizer breaks up the text at the word boundaries defined in
Unicode Standard Annex #29. Then we treat every Alphabetic character that
doesn't have a Word_Break property as a single term. These are characters that
match \p{Ideographic}, \p{Script: Hiragana}, or \p{Line_Break:
Complex_Context}. This should work for Chinese but as Peter mentioned, we
don't support n-grams.
If you're using QueryParser, you're likely to run into problems, though.
QueryParser will turn a sequence of Chinese characters into a PhraseQuery
which is obviously wrong. A quick hack is to insert a space after every
Chinese character before passing a query string to QueryParser:
$query_string =~ s/\p{Ideographic}/$& /g;
Nick