Hao Wu wrote on 2/17/17 4:44 PM:
Hi all,

I use the StandardTokenizer. search by English word work, but in
Chinese give me strange results.

my $tokenizer = Lucy::Analysis::StandardTokenizer->new;
my $raw_type = Lucy::Plan::FullTextType->new(
        analyzer => $tokenizer,
);

also, I was going to use the EasyAnalyzer (
https://metacpan.org/pod/distribution/Lucy/lib/Lucy/Analysis/EasyAnalyzer.pod
)
, but chinese in not supported.

What is the simple way to use lucy with chinese doc? Thanks.

There is currently no equivalent of
https://lucene.apache.org/core/4_0_0/analyzers-common/org/apache/lucene/analysis/cjk/CJKTokenizer.html
within core Lucy.

Furthermore, there is no automatic language detection in Lucy. You'll note in https://metacpan.org/pod/distribution/Lucy/lib/Lucy/Analysis/EasyAnalyzer.pod that the language must be explicitly specified, and that is for the stemming analyzer. Also, Chinese is not among the supported languages listed.

Maybe something wrapped around https://metacpan.org/pod/Lingua::CJK::Tokenizer would work as a custom analyzer.

You can see an example in the documentation here
https://metacpan.org/pod/Lucy::Analysis::Analyzer#new



--
Peter Karman  .  https://peknet.com/  .  https://keybase.io/peterkarman

Reply via email to