[ https://issues.apache.org/jira/browse/SOLR-822?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12643608#action_12643608 ]
Hoss Man commented on SOLR-822: ------------------------------- bq. Does that make sense to you? yes, definitely ... but still a few questions: 1) if i understand correctly: another use case beyond character normalization could be refactoring the existing HTMLStrip___Tokenizers so that instead people would use an HTMLStripCharFilter and then whatever tokenizer they like, correct? 2) based on your explanation, shouldn't CharFilterFactory be renamed CharStreamFactory ? ... there's no requirement that implementations produce a CharFilter, as long as they produce a ChaStream, correct? 3) should CharStream extend FilterReader? --- One thing that worries me is the interaction of CharStreams with their corrected positions and Tokenizers that may not know about CharStream at all. Oviously that could just be an unsupported case (ie; if you want to use some CharStreamFactories, you better use a TokenizerFactory that can handle it) but i still suspect some people could easily be bitten by this. i wonder if we could protect people from this. perhaps a new CharStreamTokenizerFactory interface that must be implemented by any TokenizerFactory that knows about CharStreams (with a single "public TokenStream create(CharStream input)") if a fieldType uses any CharStreamFactory it's an initialize error unless the TokenizerFactory is also a CharStreamTokenizerFactory. Something else to consider: it seems like a lot of future headache could be simplied if the CharStream API was committed in lucene-java so that the Tokenizer API and all of the existing OOTB Tokenizers could know about it. > CharFilter - normalize characters before tokenizer > -------------------------------------------------- > > Key: SOLR-822 > URL: https://issues.apache.org/jira/browse/SOLR-822 > Project: Solr > Issue Type: New Feature > Components: Analysis > Reporter: Koji Sekiguchi > Priority: Minor > Attachments: character-normalization.JPG, sample_mapping_ja.txt, > SOLR-822.patch, SOLR-822.patch, SOLR-822.patch > > > A new plugin which can be placed in front of <tokenizer/>. > {code:xml} > <fieldType name="textCharNorm" class="solr.TextField" > positionIncrementGap="100" > > <analyzer> > <charFilter class="solr.MappingCharFilterFactory" > mapping="mapping_ja.txt" /> > <tokenizer class="solr.MappingCJKTokenizerFactory"/> > <filter class="solr.StopFilterFactory" ignoreCase="true" > words="stopwords.txt"/> > <filter class="solr.LowerCaseFilterFactory"/> > </analyzer> > </fieldType> > {code} > <charFilter/> can be multiple (chained). I'll post a JPEG file to show > character normalization sample soon. > MOTIVATION: > In Japan, there are two types of tokenizers -- N-gram (CJKTokenizer) and > Morphological Analyzer. > When we use morphological analyzer, because the analyzer uses Japanese > dictionary to detect terms, > we need to normalize characters before tokenization. > I'll post a patch soon, too. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.