date:20230714

Re: [CLucene-dev] Inquiry about CLucene's UTF-8 support

2023-07-14 Thread Kostka Bořivoj

Hello, I see differences in CJK languages (Chinese, Japanese, Korean). Note that segmentation (aka tokenization) for these languages is a very complex task because they do not use spaces to separate words. There are some techniques to work around this, e.g. creating bigrams. And of course they

Re: [CLucene-dev] Inquiry about CLucene's UTF-8 support

2023-07-14 Thread Achyuth Pramod

Hi Developers, I am attaching the tokens generated from Java Lucene and CLucene. I am getting different tokens for non-latin texts using StandardAnalyser. Is there a solution which will generate the same tokens for CLucene as the Java Lucene? Thanks & Regards, Achyuth Pramod On Mon, Jul 10, 2023