CJKBigram filter questons: single character queries, bigrams created across sript/character types

Burton-West, Tom Fri, 27 Apr 2012 10:44:30 -0700

I have a few questions about the CJKBigram filter.

About 10% of our queries that contain Han characters are single character 
queries.   It looks like the CJKBigram filter only outputs single characters 
when there are no adjacent bigrammable characters in the input.   This means we 
would have to create a separate field to index Han unigrams in order to address 
single character queries.  Is this correct?


For Japanese, the default settings form bigrams across character types.  So for 
a string containing Hiragana and Han characters bigrams containing a mixture of 
Hiragana and Han characters are formed:
いろは革命歌   =>    “いろ” ”ろは“  “は革”   ”革命” “命歌”

Is there a way to specify that you don’t want bigrams across character types?

Tom

Tom Burton-West
Digital Library Production Service
University of Michigan Library

http://www.hathitrust.org/blogs/large-scale-search

CJKBigram filter questons: single character queries, bigrams created across sript/character types

Reply via email to