Re: LowerCaseTokenizerFactory - Tokenizer Options? Why does it behave this way?

Jonathan Rochkind Tue, 14 Sep 2010 14:24:28 -0700

Why would you want to do that, instead of just using another tokenizerand a lowercasefilter? It's more confusing less DRY code to leave themseparate -- the LowerCaseTokenizerFactory combines anyway becausesomeone decided it was such a common use case that it was worth it forthe demonstrated performance advantage. (At least I hope that's whathappened, otherwise there's no excuse for it!).

Do you know you get a worthwhile performance benefit for what you'redoing? If not, why do it?


Jonathan

Scott Gonyea wrote:

I went for a different route:

https://issues.apache.org/jira/browse/LUCENE-2644

Scott

On Tue, Sep 14, 2010 at 11:18 AM, Robert Muir <rcm...@gmail.com> wrote:

On Tue, Sep 14, 2010 at 1:54 PM, Scott Gonyea <sc...@aitrus.org> wrote:

Hi,

I'm tweaking my schema and the LowerCaseTokenizerFactory doesn't create
tokens, based solely on lower-casing characters.  Is there a way to tell

it

NOT to drop non-characters?  It's amazingly frustrating that the
TokenizerFactory and the FilterFactory have two entirely different modes

of

behavior.  If I wanted it to tokenize based on non-lower case
characters....
wouldn't I use, say, LetterTokenizerFactory and tack on the
LowerCaseFilterFactory?  Or any number of combinations that would

otherwise

achieve that specific end-result?

I don't think you should use LowerCaseTokenizerFactory if you dont want to
divide text on non-letters, its intended to do just that.

from the javadocs:
LowerCaseTokenizer performs the function of LetterTokenizer and
LowerCaseFilter together. It divides text at non-letters and converts them
to lower case. While it is functionally equivalent to the combination of
LetterTokenizer and LowerCaseFilter, there is a performance advantage to
doing the two tasks at once, hence this (redundant) implementation.



So... Is there a way for me to tell it to NOT split based on
non-characters?

Use a different tokenizer that doesn't split on non-characters, followed by

a LowerCaseFilter

--
Robert Muir
rcm...@gmail.com

Re: LowerCaseTokenizerFactory - Tokenizer Options? Why does it behave this way?

Reply via email to