Hello,
I see differences in CJK languages (Chinese, Japanese, Korean). Note that
segmentation (aka tokenization) for these languages is a very complex task
because
they do not use spaces to separate words. There are some techniques to work
around this, e.g. creating bigrams. And of course they
Hi Developers,
I am attaching the tokens generated from Java Lucene and CLucene. I am
getting different tokens for non-latin texts using StandardAnalyser.
Is there a solution which will generate the same tokens for CLucene as the
Java Lucene?
Thanks & Regards,
Achyuth Pramod
On Mon, Jul 10, 2023