Hao Wu wrote on 2/17/17 4:44 PM:
Hi all,
I use the StandardTokenizer. search by English word work, but in
Chinese give me strange results.
my $tokenizer = Lucy::Analysis::StandardTokenizer->new;
my $raw_type = Lucy::Plan::FullTextType->new(
analyzer => $tokenizer,
);
also, I was going to use the EasyAnalyzer (
https://metacpan.org/pod/distribution/Lucy/lib/Lucy/Analysis/EasyAnalyzer.pod
)
, but chinese in not supported.
What is the simple way to use lucy with chinese doc? Thanks.
There is currently no equivalent of
https://lucene.apache.org/core/4_0_0/analyzers-common/org/apache/lucene/analysis/cjk/CJKTokenizer.html
within core Lucy.
Furthermore, there is no automatic language detection in Lucy. You'll note in
https://metacpan.org/pod/distribution/Lucy/lib/Lucy/Analysis/EasyAnalyzer.pod
that the language must be explicitly specified, and that is for the stemming
analyzer. Also, Chinese is not among the supported languages listed.
Maybe something wrapped around https://metacpan.org/pod/Lingua::CJK::Tokenizer
would work as a custom analyzer.
You can see an example in the documentation here
https://metacpan.org/pod/Lucy::Analysis::Analyzer#new
--
Peter Karman . https://peknet.com/ . https://keybase.io/peterkarman