Hi Shawn, I may still be missing your point. Below is an example where the ICUTokenizer splits Now, I'm beginning to wonder if I really understand what those flags on the CJKBigramFilter do. The ICUTokenizer spits out unigrams and the CJKBigramFilter will put them back together into bigrams.
I thought if you set han=true, hiragana=true You would get this kind of result where the third bigram is composed of a hirigana and han character いろは革命歌 => “いろ” ”ろは“ “は革” ”革命” “命歌” Hopefully the e-mail hasn't munged the output of the Solr analysis panel below: I can see this in our query processing where outpugUnigrams=false: org.apache.solr.analysis.ICUTokenizerFactory {luceneMatchVersion=LUCENE_36} Splits into unigrams term text いろは革命歌 org.apache.solr.analysis.CJKBigramFilterFactory {hangul=false, outputUnigrams=false, katakana=false, han=true, hiragana=true, luceneMatchVersion=LUCENE_36} makes bigrams including the middle one which is one character hirigana and one han term text いろろはは革革命命歌 It appears that if you include outputUnigrams=true (as we both do in the indexing configuration) that this doesn't happen. org.apache.solr.analysis.CJKBigramFilterFactory {hangul=false, outputUnigrams=true, katakana=false, han=true, hiragana=true , luceneMatchVersion=LUCENE_36} いろは革命歌 革命命歌 type <HIRAGANA><HIRAGANA><HIRAGANA><SINGLE><SINGLE><SINGLE> <DOUBLE><DOUBLE> Not sure what happens for katakana as the ICUTokenizer doesn't convert it to unigrams and our configuration is set to katakana=false. I'll play around on the test machine when I have time. Tom