[jira] Commented: (SOLR-822) CharFilter - normalize characters before tokenizer

Hoss Man (JIRA) Wed, 29 Oct 2008 12:05:38 -0700

    [ 
https://issues.apache.org/jira/browse/SOLR-822?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12643608#action_12643608
 ]


Hoss Man commented on SOLR-822:
-------------------------------

bq. Does that make sense to you?

yes, definitely ... but still a few questions:

1) if i understand correctly: another use case beyond character normalization 
could be refactoring the existing HTMLStrip___Tokenizers so that instead people 
would use an HTMLStripCharFilter and then whatever tokenizer they like, correct?

2) based on your explanation, shouldn't CharFilterFactory be renamed 
CharStreamFactory ? ... there's no requirement that implementations produce a 
CharFilter, as long as they produce a ChaStream, correct?

3) should CharStream extend FilterReader?

---

One thing that worries me is the interaction of CharStreams with their 
corrected positions and Tokenizers that may not know about CharStream at all.  
Oviously that could just be an unsupported case (ie; if you want to use some 
CharStreamFactories, you better use a TokenizerFactory that can handle it) but 
i still suspect some people could easily be bitten by this.

i wonder if we could protect people from this.  perhaps a new 
CharStreamTokenizerFactory interface that must be implemented by any 
TokenizerFactory that knows about CharStreams (with a single "public 
TokenStream create(CharStream input)")  if a fieldType uses any 
CharStreamFactory it's an initialize error unless the TokenizerFactory is also 
a CharStreamTokenizerFactory.

Something else to consider: it seems like a lot of future headache could be 
simplied if the CharStream API was committed in lucene-java so that the 
Tokenizer API and all of the existing OOTB Tokenizers could know about it.

> CharFilter - normalize characters before tokenizer
> --------------------------------------------------
>
>                 Key: SOLR-822
>                 URL: https://issues.apache.org/jira/browse/SOLR-822
>             Project: Solr
>          Issue Type: New Feature
>          Components: Analysis
>            Reporter: Koji Sekiguchi
>            Priority: Minor
>         Attachments: character-normalization.JPG, sample_mapping_ja.txt, 
> SOLR-822.patch, SOLR-822.patch, SOLR-822.patch
>
>
> A new plugin which can be placed in front of <tokenizer/>.
> {code:xml}
> <fieldType name="textCharNorm" class="solr.TextField" 
> positionIncrementGap="100" >
>   <analyzer>
>     <charFilter class="solr.MappingCharFilterFactory" 
> mapping="mapping_ja.txt" />
>     <tokenizer class="solr.MappingCJKTokenizerFactory"/>
>     <filter class="solr.StopFilterFactory" ignoreCase="true" 
> words="stopwords.txt"/>
>     <filter class="solr.LowerCaseFilterFactory"/>
>   </analyzer>
> </fieldType>
> {code}
> <charFilter/> can be multiple (chained). I'll post a JPEG file to show 
> character normalization sample soon.
> MOTIVATION:
> In Japan, there are two types of tokenizers -- N-gram (CJKTokenizer) and 
> Morphological Analyzer.
> When we use morphological analyzer, because the analyzer uses Japanese 
> dictionary to detect terms,
> we need to normalize characters before tokenization.
> I'll post a patch soon, too.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (SOLR-822) CharFilter - normalize characters before tokenizer

Reply via email to