Re: LowerCaseTokenizerFactory - Tokenizer Options? Why does it behave this way?

Scott Gonyea Tue, 14 Sep 2010 14:20:13 -0700

I went for a different route:

https://issues.apache.org/jira/browse/LUCENE-2644


Scott

On Tue, Sep 14, 2010 at 11:18 AM, Robert Muir <rcm...@gmail.com> wrote:

> On Tue, Sep 14, 2010 at 1:54 PM, Scott Gonyea <sc...@aitrus.org> wrote:
>
> > Hi,
> >
> > I'm tweaking my schema and the LowerCaseTokenizerFactory doesn't create
> > tokens, based solely on lower-casing characters.  Is there a way to tell
> it
> > NOT to drop non-characters?  It's amazingly frustrating that the
> > TokenizerFactory and the FilterFactory have two entirely different modes
> of
> > behavior.  If I wanted it to tokenize based on non-lower case
> > characters....
> > wouldn't I use, say, LetterTokenizerFactory and tack on the
> > LowerCaseFilterFactory?  Or any number of combinations that would
> otherwise
> > achieve that specific end-result?
> >
>
> I don't think you should use LowerCaseTokenizerFactory if you dont want to
> divide text on non-letters, its intended to do just that.
>
> from the javadocs:
> LowerCaseTokenizer performs the function of LetterTokenizer and
> LowerCaseFilter together. It divides text at non-letters and converts them
> to lower case. While it is functionally equivalent to the combination of
> LetterTokenizer and LowerCaseFilter, there is a performance advantage to
> doing the two tasks at once, hence this (redundant) implementation.
>
>
>
> So... Is there a way for me to tell it to NOT split based on
> non-characters?
> >
>
> Use a different tokenizer that doesn't split on non-characters, followed by
> a LowerCaseFilter
>
> --
> Robert Muir
> rcm...@gmail.com
>

Re: LowerCaseTokenizerFactory - Tokenizer Options? Why does it behave this way?

Reply via email to