[jira] Commented: (SOLR-319) changes SynonymFilterFactory for N-gram tokenizer

Koji Sekiguchi (JIRA) Sun, 12 Aug 2007 19:02:07 -0700

    [ 
https://issues.apache.org/jira/browse/SOLR-319?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12519360
 ]


Koji Sekiguchi commented on SOLR-319:
-------------------------------------

In addition, this is useful for non-N-gram tokenizers for CJK users. For 
example, we use SenTokenizer, which is a popular morphological analyzer in 
Japan. It uses a Japanese dictionary to determine morpheme boundaries.

If I have the following definition in schema.xml:

<tokenizer class="solr.SenTokenizerFactory"/>
<filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt" 
ignoreCase="true" expand="true"/>

and I want a map rule "C1C2C3=>C4C5". I'm sure "C1C2C3" is a word, so I write 
the rule in synonyms.txt as follows:

C1C2C3=>C4C5

however, if there isn't "C1C2C3" in SenTokenizer's dictionary but "C1C2" and 
"C3" are there, SenTokenizer will output "C1C2" and "C3". In this case, the 
above rule doesn't work.

The patch solves this problem, in addition, it encourages sharing synonyms.txt 
file between N-gram and morphological tokenizer.

> changes SynonymFilterFactory for N-gram tokenizer
> -------------------------------------------------
>
>                 Key: SOLR-319
>                 URL: https://issues.apache.org/jira/browse/SOLR-319
>             Project: Solr
>          Issue Type: Improvement
>            Reporter: Koji Sekiguchi
>            Priority: Minor
>         Attachments: SOLR-319-UTF-8.patch
>
>
> WHAT:
> Currently, SynonymFilterFactory works very well with N-gram tokenizer 
> (CJKTokenizer, for example).
> But we have to take care of the statement in synonyms.txt.
> For example, if I use CJKTokenizer (work as bi-gram for CJK chars) and want 
> C1C2C3 maps to C4C5C6,
> I have to write the rule as follows:
> C1C2 C2C3 => C4C5 C5C6
> But I want to write it "C1C2C3=>C4C5C6". This patch allows it. It is also 
> helpful for sharing synonyms.txt.
> HOW:
> tokenFactory attribute is added to <filter 
> class="solr.SynonymFilterFactory"/>.
> If the attribute is specified, SynonymFilterFactory uses the TokenizerFactory 
> to create Tokenizer.
> Then SynonymFilterFactory uses the Tokenizer to get tokens from the rules in 
> synonyms.txt file.
> sample-1: CJKTokenizer
>     <fieldtype name="text_cjk" class="solr.TextField" 
> positionIncrementGap="100">
>       <analyzer type="index">
>         <tokenizer class="solr.CJKTokenizerFactory"/>
>         <filter class="solr.SynonymFilterFactory" 
> synonyms="ngram_synonym_test_ja.txt"
>                       ignoreCase="true" expand="true" 
> tokenFactory="solr.CJKTokenizerFactory"/>
>         <filter class="solr.LowerCaseFilterFactory"/>
>       </analyzer>
>       <analyzer type="query">
>         <tokenizer class="solr.CJKTokenizerFactory"/>
>         <filter class="solr.LowerCaseFilterFactory"/>
>       </analyzer>
>     </fieldtype>
> sample-2: NGramTokenizer
>     <fieldtype name="text_ngram" class="solr.TextField" 
> positionIncrementGap="100">
>       <analyzer type="index">
>         <tokenizer class="solr.NGramTokenizerFactory" minGramSize="2" 
> maxGramSize="2"/>
>         <filter class="solr.LowerCaseFilterFactory"/>
>       </analyzer>
>       <analyzer type="query">
>         <tokenizer class="solr.NGramTokenizerFactory" minGramSize="2" 
> maxGramSize="2"/>
>         <filter class="solr.SynonymFilterFactory" 
> synonyms="ngram_synonym_test_ngram.txt"
>                       ignoreCase="true" expand="true"
>                       tokenFactory="solr.NGramTokenizerFactory" 
> minGramSize="2" maxGramSize="2"/>
>         <filter class="solr.LowerCaseFilterFactory"/>
>       </analyzer>
>     </fieldtype>
> backward compatibility:
> Yes. If you omit tokenFactory attribute from <filter 
> class="solr.SynonymFilterFactory"/> tag, it works as usual.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (SOLR-319) changes SynonymFilterFactory for N-gram tokenizer

Reply via email to