[jira] Commented: (SOLR-822) CharFilter - normalize characters before tokenizer

Koji Sekiguchi (JIRA) Tue, 04 Nov 2008 05:19:08 -0800

    [ 
https://issues.apache.org/jira/browse/SOLR-822?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12644955#action_12644955
 ]


Koji Sekiguchi commented on SOLR-822:
-------------------------------------

Hoss, sorry for the late reply.

{quote}
1) if i understand correctly: another use case beyond character normalization 
could be refactoring the existing HTMLStrip___Tokenizers so that instead people 
would use an HTMLStripCharFilter and then whatever tokenizer they like, correct?
{quote}

Correct.

{quote}
3) should CharStream extend FilterReader?
{quote}

I think we need all these classes to construct the CharFilter framework - 
CharStream, CharReader and CharFilter. CharReader and CharFilter are the 
subclass of CharStream. CharStream has an abstract method correctOffset():

{code:java}
public abstract class CharStream extends Reader {
  /**
   * called by CharFilter(s) and Tokenizer to correct token offset.
   *
   * @param currentOff current offset
   * @return corrected token offset
   */
  public abstract int correctOffset( int currentOff );
}
{code}

CharStream extends Reader instead of FilterReader because FilterReader has a 
Reader member but I don't need it. Instead, CharReader has a Reader and 
CharFilter has CharStream. The role of CharReader is that it wraps Reader and 
makes it CharStream.

{code:java}
public final class CharReader extends CharStream {
  protected Reader input;
  public CharReader( Reader in ){
    input = in;
  }
  @Override
  public int correctOffset(int currentOff) {
    return currentOff;
  }
  :
}
{code}

Then CharReader is placed at the beginning of char-filter-chain. Now we get 
CharStream, CharFilters can be used to organize
a filter chain. I made the correctOffset() to final in CharFilter.

{code:java}
public abstract class CharFilter extends CharStream {
  protected CharStream input;
  protected CharFilter( CharStream in ){
    input = in;
  }
  protected int correctPosition( int pos ){
    return pos;
  }
  @Override
  public final int correctOffset(int currentOff) {
    return input.correctOffset( correctPosition( currentOff ) );
  }
  :
}
{code}

Subclass of CharFilter can override correctPosition() method to correct current 
position.

{quote}
2) based on your explanation, shouldn't CharFilterFactory be renamed 
CharStreamFactory ? ... there's no requirement that implementations produce a 
CharFilter, as long as they produce a ChaStream, correct?
{quote}

Yes, CharFilterFactory creates CharStream but I like CharFilterFactory because 
1) the factory will instanciate CharFilter (not CharStream) and 2) the return 
type of TokenFilterFactory.create() is TokenStream although it instantiates 
TokenFilter.

{quote}
Something else to consider: it seems like a lot of future headache could be 
simplied if the CharStream API was committed in lucene-java so that the 
Tokenizer API and all of the existing OOTB Tokenizers could know about it.
{quote}

Agreed. I'll open a ticket in Lucene.

> CharFilter - normalize characters before tokenizer
> --------------------------------------------------
>
>                 Key: SOLR-822
>                 URL: https://issues.apache.org/jira/browse/SOLR-822
>             Project: Solr
>          Issue Type: New Feature
>          Components: Analysis
>            Reporter: Koji Sekiguchi
>            Priority: Minor
>         Attachments: character-normalization.JPG, sample_mapping_ja.txt, 
> SOLR-822.patch, SOLR-822.patch, SOLR-822.patch
>
>
> A new plugin which can be placed in front of <tokenizer/>.
> {code:xml}
> <fieldType name="textCharNorm" class="solr.TextField" 
> positionIncrementGap="100" >
>   <analyzer>
>     <charFilter class="solr.MappingCharFilterFactory" 
> mapping="mapping_ja.txt" />
>     <tokenizer class="solr.MappingCJKTokenizerFactory"/>
>     <filter class="solr.StopFilterFactory" ignoreCase="true" 
> words="stopwords.txt"/>
>     <filter class="solr.LowerCaseFilterFactory"/>
>   </analyzer>
> </fieldType>
> {code}
> <charFilter/> can be multiple (chained). I'll post a JPEG file to show 
> character normalization sample soon.
> MOTIVATION:
> In Japan, there are two types of tokenizers -- N-gram (CJKTokenizer) and 
> Morphological Analyzer.
> When we use morphological analyzer, because the analyzer uses Japanese 
> dictionary to detect terms,
> we need to normalize characters before tokenization.
> I'll post a patch soon, too.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (SOLR-822) CharFilter - normalize characters before tokenizer

Reply via email to