Thanks. Get it work. code pasted below in case anyone have similar question.
package ChineseAnalyzer; use Jieba; use v5.10; use Encode qw(decode_utf8); use base qw( Lucy::Analysis::Analyzer ); sub new { my $self = shift->SUPER::new; return $self; } sub transform { my ($self, $inversion)= @_; return $inversion; } sub transform_text { my ($self, $text) = @_; my $inversion = Lucy::Analysis::Inversion->new; my @tokens = Jieba::jieba_tokenize(decode_utf8($text)); $inversion->append( Lucy::Analysis::Token->new(text =>$_->[0], start_offset=> $_->[1] , end_offset=>$_->[2] ) ) for @tokens; return $inversion; } 1; package Jieba; use v5.10; sub jieba_tokenize { jieba_tokenize_python(shift); } # TODO: #result = jieba.tokenize(u'永和服装饰品有限公司', mode='search') use Inline Python => <<'END_OF_PYTHON_CODE'; from jieba import tokenize def jieba_tokenize_python(text): seg_list = tokenize(text, mode='search') return(list(seg_list)) END_OF_PYTHON_CODE 1; On Fri, Feb 17, 2017 at 6:29 PM, Peter Karman <pe...@peknet.com> wrote: > Hao Wu wrote on 2/17/17 4:44 PM: > >> Hi all, >> >> I use the StandardTokenizer. search by English word work, but in >> Chinese give me strange results. >> >> my $tokenizer = Lucy::Analysis::StandardTokenizer->new; >> my $raw_type = Lucy::Plan::FullTextType->new( >> analyzer => $tokenizer, >> ); >> >> also, I was going to use the EasyAnalyzer ( >> https://metacpan.org/pod/distribution/Lucy/lib/Lucy/Analysis >> /EasyAnalyzer.pod >> ) >> , but chinese in not supported. >> >> What is the simple way to use lucy with chinese doc? Thanks. >> > > There is currently no equivalent of > https://lucene.apache.org/core/4_0_0/analyzers-common/org/ > apache/lucene/analysis/cjk/CJKTokenizer.html > within core Lucy. > > Furthermore, there is no automatic language detection in Lucy. You'll note > in https://metacpan.org/pod/distribution/Lucy/lib/Lucy/Analysis > /EasyAnalyzer.pod > that the language must be explicitly specified, and that is for the > stemming analyzer. Also, Chinese is not among the supported languages > listed. > > Maybe something wrapped around https://metacpan.org/pod/Lingu > a::CJK::Tokenizer would work as a custom analyzer. > > You can see an example in the documentation here > https://metacpan.org/pod/Lucy::Analysis::Analyzer#new > > > > -- > Peter Karman . https://peknet.com/ . https://keybase.io/peterkarman >