[jira] Commented: (SOLR-822) CharFilter - normalize characters before tokenizer

Otis Gospodnetic (JIRA) Fri, 24 Apr 2009 09:50:53 -0700

    [ 
https://issues.apache.org/jira/browse/SOLR-822?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12702434#action_12702434
 ]


Otis Gospodnetic commented on SOLR-822:
---------------------------------------

Todd's comment from Oct 23, 2008 caught my attention:

{quote}
It should also work for existing filters like LowerCase. Seems like it has the 
potential to be faster then the filters, as it doesn't have to perform the same 
replacement multiple times if a particular character is replicated into 
multiple tokens, like in NGramTokenizer or CJKTokenizer. 
{quote}

Couldn't we replace LowerCaseFilter then?  Or does LCF still have some unique 
value?  Ah, it does - it makes it possible to put it *after* something like 
WordDelimiterFilterFactory.  Lowercasing at the very beginning would make it 
impossible for WDFF to do its job.  Never mind.  Leaving for posterity.

> CharFilter - normalize characters before tokenizer
> --------------------------------------------------
>
>                 Key: SOLR-822
>                 URL: https://issues.apache.org/jira/browse/SOLR-822
>             Project: Solr
>          Issue Type: New Feature
>          Components: Analysis
>    Affects Versions: 1.3
>            Reporter: Koji Sekiguchi
>            Assignee: Koji Sekiguchi
>            Priority: Minor
>             Fix For: 1.4
>
>         Attachments: character-normalization.JPG, sample_mapping_ja.txt, 
> sample_mapping_ja.txt, SOLR-822-for-1.3.patch, SOLR-822-renameMethod.patch, 
> SOLR-822.patch, SOLR-822.patch, SOLR-822.patch, SOLR-822.patch, SOLR-822.patch
>
>
> A new plugin which can be placed in front of <tokenizer/>.
> {code:xml}
> <fieldType name="textCharNorm" class="solr.TextField" 
> positionIncrementGap="100" >
>   <analyzer>
>     <charFilter class="solr.MappingCharFilterFactory" 
> mapping="mapping_ja.txt" />
>     <tokenizer class="solr.MappingCJKTokenizerFactory"/>
>     <filter class="solr.StopFilterFactory" ignoreCase="true" 
> words="stopwords.txt"/>
>     <filter class="solr.LowerCaseFilterFactory"/>
>   </analyzer>
> </fieldType>
> {code}
> <charFilter/> can be multiple (chained). I'll post a JPEG file to show 
> character normalization sample soon.
> MOTIVATION:
> In Japan, there are two types of tokenizers -- N-gram (CJKTokenizer) and 
> Morphological Analyzer.
> When we use morphological analyzer, because the analyzer uses Japanese 
> dictionary to detect terms,
> we need to normalize characters before tokenization.
> I'll post a patch soon, too.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (SOLR-822) CharFilter - normalize characters before tokenizer

Reply via email to