How to search camel case words using CJKTokenizer

tiffany Fri, 03 Jun 2011 00:41:14 -0700

Hi all,

I'm using CJKTokenizerFactory tokenizer to handle text which contains both
Japanese and alphabet words.  However, I noticed that CJKTokenizerFactory
converts alphabet to lowercase, so that I cannot use
WordDelimiterFilterFactory filter with splitOnCaseChange property for camel
case words.


I changed to NGramTokenizerFactory (2-gram), but it only parses first 1024
characters. Because of that, I cannot use NGramTokenizerFactory, neither.

I tried the following two settings and both of them seem working fine, but I
don't know if these are good or not, or if there are some other better
solutions.

1)
        <tokenizer class="solr.CJKTokenizerFactory" />
        <filter class="solr.NGramFilterFactory" maxGramSize="2"
minGramSize="2" />

2)
        <tokenizer class="solr.StandardTokenizerFactory" />
        <filter class="solr.NGramFilterFactory" maxGramSize="1"
minGramSize="1" />

If anyone can give me any advice, it would be nice.

Thank you.

Tiffany

--
View this message in context: 
http://lucene.472066.n3.nabble.com/How-to-search-camel-case-words-using-CJKTokenizer-tp3018853p3018853.html
Sent from the Solr - User mailing list archive at Nabble.com.

How to search camel case words using CJKTokenizer

Reply via email to