[ https://issues.apache.org/jira/browse/SOLR-822?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12642835#action_12642835 ]
Koji Sekiguchi commented on SOLR-822: ------------------------------------- Hoss, Sorry about the unrelated whitespaces in the patch. I'll remove them in the next patch. About CharStream, CharReader and CharFilter classes, I created CharFilterFactory: {code:java} public interface CharFilterFactory { public void init(Map<String,String> args); public Map<String,String> getArgs(); public CharStream create(CharStream input); } {code} instead of ReaderWrapperFactory mentioned by Hoss. CharFilterFactory is a factory of CharFilter which reads CharStream and outputs CharStream. CharStream is a Reader but has correctPosition method: {code:java} public abstract class CharStream extends Reader { public abstract int correctPosition( int currentPos ); } {code} The method will be called by CharFilters and Tokenizer(in this case, Tokenizer should be CharStream "aware") to correct start/end offsets of tokens, because CharFilters may convert 1 char to 2 chars or the other way around. The following is a sample implementation of the method: {code:java|title=MappingCharFilter.java} private List<PosCorrectMap> pcmList; public int correctPosition( int currentPos ){ currentPos = input.correctPosition( currentPos ); if( pcmList.isEmpty() ) return currentPos; for( int i = pcmList.size() - 1; i >= 0; i-- ){ if( currentPos >= pcmList.get( i ).pos ) return currentPos + pcmList.get( i ).cumulativeDiff; } return currentPos; } static class PosCorrectMap { int pos; int cumulativeDiff; public PosCorrectMap( int pos, int cumulativeDiff ){ this.pos = pos; this.cumulativeDiff = cumulativeDiff; } } {code} There is another CharStream class, CharReader. It is a Reader wrapper and necessary to get Reader and outputs CharStream. CharReader is a concrete class and instanciated in TokenizerChain. Does that make sense to you? > CharFilter - normalize characters before tokenizer > -------------------------------------------------- > > Key: SOLR-822 > URL: https://issues.apache.org/jira/browse/SOLR-822 > Project: Solr > Issue Type: New Feature > Components: Analysis > Reporter: Koji Sekiguchi > Priority: Minor > Attachments: character-normalization.JPG, sample_mapping_ja.txt, > SOLR-822.patch, SOLR-822.patch > > > A new plugin which can be placed in front of <tokenizer/>. > {code:xml} > <fieldType name="textCharNorm" class="solr.TextField" > positionIncrementGap="100" > > <analyzer> > <charFilter class="solr.MappingCharFilterFactory" > mapping="mapping_ja.txt" /> > <tokenizer class="solr.MappingCJKTokenizerFactory"/> > <filter class="solr.StopFilterFactory" ignoreCase="true" > words="stopwords.txt"/> > <filter class="solr.LowerCaseFilterFactory"/> > </analyzer> > </fieldType> > {code} > <charFilter/> can be multiple (chained). I'll post a JPEG file to show > character normalization sample soon. > MOTIVATION: > In Japan, there are two types of tokenizers -- N-gram (CJKTokenizer) and > Morphological Analyzer. > When we use morphological analyzer, because the analyzer uses Japanese > dictionary to detect terms, > we need to normalize characters before tokenization. > I'll post a patch soon, too. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.