Why would you want to do that, instead of just using another tokenizer
and a lowercasefilter? It's more confusing less DRY code to leave them
separate -- the LowerCaseTokenizerFactory combines anyway because
someone decided it was such a common use case that it was worth it for
the demonstrated performance advantage. (At least I hope that's what
happened, otherwise there's no excuse for it!).
Do you know you get a worthwhile performance benefit for what you're
doing? If not, why do it?
Jonathan
Scott Gonyea wrote:
I went for a different route:
https://issues.apache.org/jira/browse/LUCENE-2644
Scott
On Tue, Sep 14, 2010 at 11:18 AM, Robert Muir <rcm...@gmail.com> wrote:
On Tue, Sep 14, 2010 at 1:54 PM, Scott Gonyea <sc...@aitrus.org> wrote:
Hi,
I'm tweaking my schema and the LowerCaseTokenizerFactory doesn't create
tokens, based solely on lower-casing characters. Is there a way to tell
it
NOT to drop non-characters? It's amazingly frustrating that the
TokenizerFactory and the FilterFactory have two entirely different modes
of
behavior. If I wanted it to tokenize based on non-lower case
characters....
wouldn't I use, say, LetterTokenizerFactory and tack on the
LowerCaseFilterFactory? Or any number of combinations that would
otherwise
achieve that specific end-result?
I don't think you should use LowerCaseTokenizerFactory if you dont want to
divide text on non-letters, its intended to do just that.
from the javadocs:
LowerCaseTokenizer performs the function of LetterTokenizer and
LowerCaseFilter together. It divides text at non-letters and converts them
to lower case. While it is functionally equivalent to the combination of
LetterTokenizer and LowerCaseFilter, there is a performance advantage to
doing the two tasks at once, hence this (redundant) implementation.
So... Is there a way for me to tell it to NOT split based on
non-characters?
Use a different tokenizer that doesn't split on non-characters, followed by
a LowerCaseFilter
--
Robert Muir
rcm...@gmail.com