[ 
https://issues.apache.org/jira/browse/SOLR-822?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12642181#action_12642181
 ] 

Todd Feak commented on SOLR-822:
--------------------------------

Seems like a very flexible way to solve the issue, as well as SOLR-814 and 
SOLR-815. It should also work for existing filters like LowerCase. Seems like 
it has the potential to be faster then the filters, as it doesn't have to 
perform the same replacement multiple times if a particular character is 
replicated into multiple tokens, like in NGramTokenizer or CJKTokenizer.

I didn't look in depth at the patch (good size patch to look through), but I 
wanted to verify at least 2 things. First, I assume that this only affects 
indexing and searching, not the actual document field contents? Second, is it 
easy to create a MappingCharFilter subclass with a hardcoded map built in? I 
don't think users should all have to recreate the same mapping files if we can 
just embed them.

However, what about Lucene? Is this something that should exist in Lucene 
first, then be expanded to Solr? Are Lucene users in need of a similar 
functionality?

> CharFilter - normalize characters before tokenizer
> --------------------------------------------------
>
>                 Key: SOLR-822
>                 URL: https://issues.apache.org/jira/browse/SOLR-822
>             Project: Solr
>          Issue Type: New Feature
>          Components: Analysis
>            Reporter: Koji Sekiguchi
>            Priority: Minor
>         Attachments: character-normalization.JPG, sample_mapping_ja.txt, 
> SOLR-822.patch, SOLR-822.patch
>
>
> A new plugin which can be placed in front of <tokenizer/>.
> {code:xml}
> <fieldType name="textCharNorm" class="solr.TextField" 
> positionIncrementGap="100" >
>   <analyzer>
>     <charFilter class="solr.MappingCharFilterFactory" 
> mapping="mapping_ja.txt" />
>     <tokenizer class="solr.MappingCJKTokenizerFactory"/>
>     <filter class="solr.StopFilterFactory" ignoreCase="true" 
> words="stopwords.txt"/>
>     <filter class="solr.LowerCaseFilterFactory"/>
>   </analyzer>
> </fieldType>
> {code}
> <charFilter/> can be multiple (chained). I'll post a JPEG file to show 
> character normalization sample soon.
> MOTIVATION:
> In Japan, there are two types of tokenizers -- N-gram (CJKTokenizer) and 
> Morphological Analyzer.
> When we use morphological analyzer, because the analyzer uses Japanese 
> dictionary to detect terms,
> we need to normalize characters before tokenization.
> I'll post a patch soon, too.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to