[jira] Commented: (SOLR-1343) HTMLStripCharFilter

Koji Sekiguchi (JIRA) Fri, 07 Aug 2009 13:45:38 -0700

    [ 
https://issues.apache.org/jira/browse/SOLR-1343?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12740719#action_12740719
 ]


Koji Sekiguchi commented on SOLR-1343:
--------------------------------------

bq. Koji, what is the advantage of the HTMLStripCharFilter over HTMLStripReader?
Good question, Shalin :)
Because after LUCENE-1466 committed, all tokenizers can read chars from 
CharFilter rather than Reader, I'd like to replace Readers like this by 
CharFilters. Obvious advantages are:
# We can use an arbitrary tokenizer, e.g. CJKTokenizer.
# We can use a chain of CharFilters. For example, we can strip HTML tags then 
normalize chars before tokenizer running.

> HTMLStripCharFilter
> -------------------
>
>                 Key: SOLR-1343
>                 URL: https://issues.apache.org/jira/browse/SOLR-1343
>             Project: Solr
>          Issue Type: Improvement
>          Components: Analysis
>    Affects Versions: 1.4
>            Reporter: Koji Sekiguchi
>            Assignee: Koji Sekiguchi
>            Priority: Trivial
>             Fix For: 1.4
>
>         Attachments: SOLR-1343.patch
>
>
> Introducing HTMLStripCharFilter:
> * move html strip logic from HTMLStripReader to HTMLStripCharFilter
> * make HTMLStripReader depracated
> * make HTMLStrip*TokenizerFactory deprecated

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (SOLR-1343) HTMLStripCharFilter

Reply via email to