I have a few questions about the CJKBigram filter.

About 10% of our queries that contain Han characters are single character 
queries.   It looks like the CJKBigram filter only outputs single characters 
when there are no adjacent bigrammable characters in the input.   This means we 
would have to create a separate field to index Han unigrams in order to address 
single character queries.  Is this correct?

For Japanese, the default settings form bigrams across character types.  So for 
a string containing Hiragana and Han characters bigrams containing a mixture of 
Hiragana and Han characters are formed:
いろは革命歌   =>    “いろ” ”ろは“  “は革”   ”革命” “命歌”

Is there a way to specify that you don’t want bigrams across character types?

Tom

Tom Burton-West
Digital Library Production Service
University of Michigan Library

http://www.hathitrust.org/blogs/large-scale-search

Reply via email to