Thanks. Get it work.

code pasted below in case anyone have similar question.

package ChineseAnalyzer;
use Jieba;
use v5.10;
use Encode qw(decode_utf8);
use base qw( Lucy::Analysis::Analyzer );

sub new {
    my $self = shift->SUPER::new;
    return $self;
}

sub transform {
    my ($self, $inversion)= @_;
    return $inversion;
}

sub transform_text {
    my ($self, $text) = @_;
    my $inversion = Lucy::Analysis::Inversion->new;
    my @tokens = Jieba::jieba_tokenize(decode_utf8($text));
    $inversion->append(
       Lucy::Analysis::Token->new(text =>$_->[0],
                                  start_offset=> $_->[1] ,
                                  end_offset=>$_->[2]
        )

    ) for @tokens;
    return $inversion;
}

1;



package Jieba;
use v5.10;

sub jieba_tokenize {
    jieba_tokenize_python(shift);
}

# TODO:
#result = jieba.tokenize(u'永和服装饰品有限公司', mode='search')
use Inline Python => <<'END_OF_PYTHON_CODE';
from jieba import tokenize

def jieba_tokenize_python(text):
    seg_list = tokenize(text, mode='search')
    return(list(seg_list))

END_OF_PYTHON_CODE

1;


On Fri, Feb 17, 2017 at 6:29 PM, Peter Karman <pe...@peknet.com> wrote:

> Hao Wu wrote on 2/17/17 4:44 PM:
>
>> Hi all,
>>
>> I use the StandardTokenizer. search by English word work, but in
>> Chinese give me strange results.
>>
>> my $tokenizer = Lucy::Analysis::StandardTokenizer->new;
>> my $raw_type = Lucy::Plan::FullTextType->new(
>>         analyzer => $tokenizer,
>> );
>>
>> also, I was going to use the EasyAnalyzer (
>> https://metacpan.org/pod/distribution/Lucy/lib/Lucy/Analysis
>> /EasyAnalyzer.pod
>> )
>> , but chinese in not supported.
>>
>> What is the simple way to use lucy with chinese doc? Thanks.
>>
>
> There is currently no equivalent of
> https://lucene.apache.org/core/4_0_0/analyzers-common/org/
> apache/lucene/analysis/cjk/CJKTokenizer.html
> within core Lucy.
>
> Furthermore, there is no automatic language detection in Lucy. You'll note
> in https://metacpan.org/pod/distribution/Lucy/lib/Lucy/Analysis
> /EasyAnalyzer.pod
> that the language must be explicitly specified, and that is for the
> stemming analyzer. Also, Chinese is not among the supported languages
> listed.
>
> Maybe something wrapped around https://metacpan.org/pod/Lingu
> a::CJK::Tokenizer would work as a custom analyzer.
>
> You can see an example in the documentation here
> https://metacpan.org/pod/Lucy::Analysis::Analyzer#new
>
>
>
> --
> Peter Karman  .  https://peknet.com/  .  https://keybase.io/peterkarman
>

Reply via email to