changes SynonymFilterFactory for N-gram tokenizer
-------------------------------------------------
Key: SOLR-319
URL: https://issues.apache.org/jira/browse/SOLR-319
Project: Solr
Issue Type: Improvement
Reporter: Koji Sekiguchi
Priority: Minor
WHAT:
Currently, SynonymFilterFactory works very well with N-gram tokenizer
(CJKTokenizer, for example).
But we have to take care of the statement in synonyms.txt.
For example, if I use CJKTokenizer (work as bi-gram for CJK chars) and want
C1C2C3 maps to C4C5C6,
I have to write the rule as follows:
C1C2 C2C3 => C4C5 C5C6
But I want to write it "C1C2C3=>C4C5C6". This patch allows it. It is also
helpful for sharing synonyms.txt.
HOW:
tokenFactory attribute is added to <filter class="solr.SynonymFilterFactory"/>.
If the attribute is specified, SynonymFilterFactory uses the TokenizerFactory
to create Tokenizer.
Then SynonymFilterFactory uses the Tokenizer to get tokens from the rules in
synonyms.txt file.
sample-1: CJKTokenizer
<fieldtype name="text_cjk" class="solr.TextField"
positionIncrementGap="100">
<analyzer type="index">
<tokenizer class="solr.CJKTokenizerFactory"/>
<filter class="solr.SynonymFilterFactory"
synonyms="ngram_synonym_test_ja.txt"
ignoreCase="true" expand="true"
tokenFactory="solr.CJKTokenizerFactory"/>
<filter class="solr.LowerCaseFilterFactory"/>
</analyzer>
<analyzer type="query">
<tokenizer class="solr.CJKTokenizerFactory"/>
<filter class="solr.LowerCaseFilterFactory"/>
</analyzer>
</fieldtype>
sample-2: NGramTokenizer
<fieldtype name="text_ngram" class="solr.TextField"
positionIncrementGap="100">
<analyzer type="index">
<tokenizer class="solr.NGramTokenizerFactory" minGramSize="2"
maxGramSize="2"/>
<filter class="solr.LowerCaseFilterFactory"/>
</analyzer>
<analyzer type="query">
<tokenizer class="solr.NGramTokenizerFactory" minGramSize="2"
maxGramSize="2"/>
<filter class="solr.SynonymFilterFactory"
synonyms="ngram_synonym_test_ngram.txt"
ignoreCase="true" expand="true"
tokenFactory="solr.NGramTokenizerFactory"
minGramSize="2" maxGramSize="2"/>
<filter class="solr.LowerCaseFilterFactory"/>
</analyzer>
</fieldtype>
backward compatibility:
Yes. If you omit tokenFactory attribute from <filter
class="solr.SynonymFilterFactory"/> tag, it works as usual.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.