Re: [lucy-user] Chinese support?

Nick Wellnhofer Sat, 18 Feb 2017 06:46:30 -0800

On 18/02/2017 07:22, Hao Wu wrote:

Thanks. Get it work.

Lucy's StandardTokenizer breaks up the text at the word boundaries defined inUnicode Standard Annex #29. Then we treat every Alphabetic character thatdoesn't have a Word_Break property as a single term. These are characters thatmatch \p{Ideographic}, \p{Script: Hiragana}, or \p{Line_Break:Complex_Context}. This should work for Chinese but as Peter mentioned, wedon't support n-grams.

If you're using QueryParser, you're likely to run into problems, though.QueryParser will turn a sequence of Chinese characters into a PhraseQuerywhich is obviously wrong. A quick hack is to insert a space after everyChinese character before passing a query string to QueryParser:


    $query_string =~ s/\p{Ideographic}/$& /g;

Nick

Re: [lucy-user] Chinese support?

Reply via email to