[
https://issues.apache.org/jira/browse/SOLR-1343?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12740719#action_12740719
]
Koji Sekiguchi commented on SOLR-1343:
--------------------------------------
bq. Koji, what is the advantage of the HTMLStripCharFilter over HTMLStripReader?
Good question, Shalin :)
Because after LUCENE-1466 committed, all tokenizers can read chars from
CharFilter rather than Reader, I'd like to replace Readers like this by
CharFilters. Obvious advantages are:
# We can use an arbitrary tokenizer, e.g. CJKTokenizer.
# We can use a chain of CharFilters. For example, we can strip HTML tags then
normalize chars before tokenizer running.
> HTMLStripCharFilter
> -------------------
>
> Key: SOLR-1343
> URL: https://issues.apache.org/jira/browse/SOLR-1343
> Project: Solr
> Issue Type: Improvement
> Components: Analysis
> Affects Versions: 1.4
> Reporter: Koji Sekiguchi
> Assignee: Koji Sekiguchi
> Priority: Trivial
> Fix For: 1.4
>
> Attachments: SOLR-1343.patch
>
>
> Introducing HTMLStripCharFilter:
> * move html strip logic from HTMLStripReader to HTMLStripCharFilter
> * make HTMLStripReader depracated
> * make HTMLStrip*TokenizerFactory deprecated
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.