Thanks wunder and Lance,

In the discussions I've seen of Japanese IR in the English language IR 
literature, Hiragana is either removed or strings are segmented first by 
character class.  I'm interested in finding out more about why bigramming 
across classes is desirable.
Based on my limited understanding of Japanese, I can see how perhaps bigramming 
a Han and Hiragana character might make sense but what about Han and Katakana?

Lance, how did you weight the unigram vs bigram fields for CJK? or did you just 
OR them together assuming that idf will give the bigrams more weight?

Tom

Reply via email to