Thanks wunder and Lance, In the discussions I've seen of Japanese IR in the English language IR literature, Hiragana is either removed or strings are segmented first by character class. I'm interested in finding out more about why bigramming across classes is desirable. Based on my limited understanding of Japanese, I can see how perhaps bigramming a Han and Hiragana character might make sense but what about Han and Katakana?
Lance, how did you weight the unigram vs bigram fields for CJK? or did you just OR them together assuming that idf will give the bigrams more weight? Tom