[ 
https://issues.apache.org/jira/browse/SOLR-319?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12529317
 ] 

Koji Sekiguchi commented on SOLR-319:
-------------------------------------

Hoss, Yonik, thank you very much for your opinion.

As for now, SynonymFilterFactory implicitly uses WhitespaceTokenizer when 
analyzing synonyms.txt file.
This works well for English and European languages, those use spaces to 
separate words.
But from standpoint of CJK users, we would like to replace the implicit 
tokenizer by an arbitrary tokenizer.
I thought that fieldtype could be specified to analyze synonyms.txt was a cool 
idea,
but it was difficult because IndexSchema hasn't been initialized at that time.

For CJK users, replacing tokenizer is enough for their purpose and fieldtype is 
overmuch...


> changes SynonymFilterFactoryto "Analyze" synonyms file
> ------------------------------------------------------
>
>                 Key: SOLR-319
>                 URL: https://issues.apache.org/jira/browse/SOLR-319
>             Project: Solr
>          Issue Type: Improvement
>            Reporter: Koji Sekiguchi
>            Priority: Minor
>         Attachments: SOLR-319.patch
>
>
> WHAT:
> Currently, SynonymFilterFactory works very well with N-gram tokenizer 
> (CJKTokenizer, for example).
> But we have to take care of the statement in synonyms.txt.
> For example, if I use CJKTokenizer (work as bi-gram for CJK chars) and want 
> C1C2C3 maps to C4C5C6,
> I have to write the rule as follows:
> C1C2 C2C3 => C4C5 C5C6
> But I want to write it "C1C2C3=>C4C5C6". This patch allows it. It is also 
> helpful for sharing synonyms.txt.
> HOW:
> tokenFactory attribute is added to <filter 
> class="solr.SynonymFilterFactory"/>.
> If the attribute is specified, SynonymFilterFactory uses the TokenizerFactory 
> to create Tokenizer.
> Then SynonymFilterFactory uses the Tokenizer to get tokens from the rules in 
> synonyms.txt file.
> sample-1: CJKTokenizer
>     <fieldtype name="text_cjk" class="solr.TextField" 
> positionIncrementGap="100">
>       <analyzer type="index">
>         <tokenizer class="solr.CJKTokenizerFactory"/>
>         <filter class="solr.SynonymFilterFactory" 
> synonyms="ngram_synonym_test_ja.txt"
>                       ignoreCase="true" expand="true" 
> tokenFactory="solr.CJKTokenizerFactory"/>
>         <filter class="solr.LowerCaseFilterFactory"/>
>       </analyzer>
>       <analyzer type="query">
>         <tokenizer class="solr.CJKTokenizerFactory"/>
>         <filter class="solr.LowerCaseFilterFactory"/>
>       </analyzer>
>     </fieldtype>
> sample-2: NGramTokenizer
>     <fieldtype name="text_ngram" class="solr.TextField" 
> positionIncrementGap="100">
>       <analyzer type="index">
>         <tokenizer class="solr.NGramTokenizerFactory" minGramSize="2" 
> maxGramSize="2"/>
>         <filter class="solr.LowerCaseFilterFactory"/>
>       </analyzer>
>       <analyzer type="query">
>         <tokenizer class="solr.NGramTokenizerFactory" minGramSize="2" 
> maxGramSize="2"/>
>         <filter class="solr.SynonymFilterFactory" 
> synonyms="ngram_synonym_test_ngram.txt"
>                       ignoreCase="true" expand="true"
>                       tokenFactory="solr.NGramTokenizerFactory" 
> minGramSize="2" maxGramSize="2"/>
>         <filter class="solr.LowerCaseFilterFactory"/>
>       </analyzer>
>     </fieldtype>
> backward compatibility:
> Yes. If you omit tokenFactory attribute from <filter 
> class="solr.SynonymFilterFactory"/> tag, it works as usual.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to