Re: LowerCaseTokenizerFactory - Tokenizer Options? Why does it behave this way?

Robert Muir Tue, 14 Sep 2010 11:19:43 -0700

On Tue, Sep 14, 2010 at 1:54 PM, Scott Gonyea <sc...@aitrus.org> wrote:


> Hi,
>
> I'm tweaking my schema and the LowerCaseTokenizerFactory doesn't create
> tokens, based solely on lower-casing characters.  Is there a way to tell it
> NOT to drop non-characters?  It's amazingly frustrating that the
> TokenizerFactory and the FilterFactory have two entirely different modes of
> behavior.  If I wanted it to tokenize based on non-lower case
> characters....
> wouldn't I use, say, LetterTokenizerFactory and tack on the
> LowerCaseFilterFactory?  Or any number of combinations that would otherwise
> achieve that specific end-result?
>

I don't think you should use LowerCaseTokenizerFactory if you dont want to
divide text on non-letters, its intended to do just that.

from the javadocs:
LowerCaseTokenizer performs the function of LetterTokenizer and
LowerCaseFilter together. It divides text at non-letters and converts them
to lower case. While it is functionally equivalent to the combination of
LetterTokenizer and LowerCaseFilter, there is a performance advantage to
doing the two tasks at once, hence this (redundant) implementation.



So... Is there a way for me to tell it to NOT split based on non-characters?
>

Use a different tokenizer that doesn't split on non-characters, followed by
a LowerCaseFilter

-- 
Robert Muir
rcm...@gmail.com

Re: LowerCaseTokenizerFactory - Tokenizer Options? Why does it behave this way?

Reply via email to