CharFilter - normalize characters before tokenizer
--------------------------------------------------
Key: SOLR-822
URL: https://issues.apache.org/jira/browse/SOLR-822
Project: Solr
Issue Type: New Feature
Components: Analysis
Reporter: Koji Sekiguchi
Priority: Minor
A new plugin which can be placed in front of <tokenizer/>.
{code:xml}
<fieldType name="textCharNorm" class="solr.TextField"
positionIncrementGap="100" >
<analyzer>
<charFilter class="solr.MappingCharFilterFactory" mapping="mapping_ja.txt"
/>
<tokenizer class="solr.MappingCJKTokenizerFactory"/>
<filter class="solr.StopFilterFactory" ignoreCase="true"
words="stopwords.txt"/>
<filter class="solr.LowerCaseFilterFactory"/>
</analyzer>
</fieldType>
{code}
<charFilter/> can be multiple (chained). I'll post a JPEG file to show
character normalization sample soon.
MOTIVATION:
In Japan, there are two types of tokenizers -- N-gram (CJKTokenizer) and
Morphological Analyzer.
When we use morphological analyzer, because the analyzer uses Japanese
dictionary to detect terms,
we need to normalize characters before tokenization.
I'll post a patch soon, too.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.