I have a few questions about the CJKBigram filter. About 10% of our queries that contain Han characters are single character queries. It looks like the CJKBigram filter only outputs single characters when there are no adjacent bigrammable characters in the input. This means we would have to create a separate field to index Han unigrams in order to address single character queries. Is this correct?
For Japanese, the default settings form bigrams across character types. So for a string containing Hiragana and Han characters bigrams containing a mixture of Hiragana and Han characters are formed: いろは革命歌 => “いろ” ”ろは“ “は革” ”革命” “命歌” Is there a way to specify that you don’t want bigrams across character types? Tom Tom Burton-West Digital Library Production Service University of Michigan Library http://www.hathitrust.org/blogs/large-scale-search