Why would you want to do that, instead of just using another tokenizer and a lowercasefilter? It's more confusing less DRY code to leave them separate -- the LowerCaseTokenizerFactory combines anyway because someone decided it was such a common use case that it was worth it for the demonstrated performance advantage. (At least I hope that's what happened, otherwise there's no excuse for it!).

Do you know you get a worthwhile performance benefit for what you're doing? If not, why do it?

Jonathan

Scott Gonyea wrote:
I went for a different route:

https://issues.apache.org/jira/browse/LUCENE-2644

Scott

On Tue, Sep 14, 2010 at 11:18 AM, Robert Muir <rcm...@gmail.com> wrote:

On Tue, Sep 14, 2010 at 1:54 PM, Scott Gonyea <sc...@aitrus.org> wrote:

Hi,

I'm tweaking my schema and the LowerCaseTokenizerFactory doesn't create
tokens, based solely on lower-casing characters.  Is there a way to tell
it
NOT to drop non-characters?  It's amazingly frustrating that the
TokenizerFactory and the FilterFactory have two entirely different modes
of
behavior.  If I wanted it to tokenize based on non-lower case
characters....
wouldn't I use, say, LetterTokenizerFactory and tack on the
LowerCaseFilterFactory?  Or any number of combinations that would
otherwise
achieve that specific end-result?

I don't think you should use LowerCaseTokenizerFactory if you dont want to
divide text on non-letters, its intended to do just that.

from the javadocs:
LowerCaseTokenizer performs the function of LetterTokenizer and
LowerCaseFilter together. It divides text at non-letters and converts them
to lower case. While it is functionally equivalent to the combination of
LetterTokenizer and LowerCaseFilter, there is a performance advantage to
doing the two tasks at once, hence this (redundant) implementation.



So... Is there a way for me to tell it to NOT split based on
non-characters?
Use a different tokenizer that doesn't split on non-characters, followed by
a LowerCaseFilter

--
Robert Muir
rcm...@gmail.com


Reply via email to