Thanks. Get it work.
code pasted below in case anyone have similar question.
package ChineseAnalyzer;
use Jieba;
use v5.10;
use Encode qw(decode_utf8);
use base qw( Lucy::Analysis::Analyzer );
sub new {
my $self = shift->SUPER::new;
return $self;
}
sub transform {
my ($self, $inversion)= @_;
return $inversion;
}
sub transform_text {
my ($self, $text) = @_;
my $inversion = Lucy::Analysis::Inversion->new;
my @tokens = Jieba::jieba_tokenize(decode_utf8($text));
$inversion->append(
Lucy::Analysis::Token->new(text =>$_->[0],
start_offset=> $_->[1] ,
end_offset=>$_->[2]
)
) for @tokens;
return $inversion;
}
1;
package Jieba;
use v5.10;
sub jieba_tokenize {
jieba_tokenize_python(shift);
}
# TODO:
#result = jieba.tokenize(u'永和服装饰品有限公司', mode='search')
use Inline Python => <<'END_OF_PYTHON_CODE';
from jieba import tokenize
def jieba_tokenize_python(text):
seg_list = tokenize(text, mode='search')
return(list(seg_list))
END_OF_PYTHON_CODE
1;
On Fri, Feb 17, 2017 at 6:29 PM, Peter Karman <[email protected]> wrote:
> Hao Wu wrote on 2/17/17 4:44 PM:
>
>> Hi all,
>>
>> I use the StandardTokenizer. search by English word work, but in
>> Chinese give me strange results.
>>
>> my $tokenizer = Lucy::Analysis::StandardTokenizer->new;
>> my $raw_type = Lucy::Plan::FullTextType->new(
>> analyzer => $tokenizer,
>> );
>>
>> also, I was going to use the EasyAnalyzer (
>> https://metacpan.org/pod/distribution/Lucy/lib/Lucy/Analysis
>> /EasyAnalyzer.pod
>> )
>> , but chinese in not supported.
>>
>> What is the simple way to use lucy with chinese doc? Thanks.
>>
>
> There is currently no equivalent of
> https://lucene.apache.org/core/4_0_0/analyzers-common/org/
> apache/lucene/analysis/cjk/CJKTokenizer.html
> within core Lucy.
>
> Furthermore, there is no automatic language detection in Lucy. You'll note
> in https://metacpan.org/pod/distribution/Lucy/lib/Lucy/Analysis
> /EasyAnalyzer.pod
> that the language must be explicitly specified, and that is for the
> stemming analyzer. Also, Chinese is not among the supported languages
> listed.
>
> Maybe something wrapped around https://metacpan.org/pod/Lingu
> a::CJK::Tokenizer would work as a custom analyzer.
>
> You can see an example in the documentation here
> https://metacpan.org/pod/Lucy::Analysis::Analyzer#new
>
>
>
> --
> Peter Karman . https://peknet.com/ . https://keybase.io/peterkarman
>