How about patching the LetterTokenizer to be capable of tokenizing how
you want, which can then be combined with a LowerCaseFilter (or not) as
desired. Or indeed creating a new tokenizer to do exactly what you want,
possibly (but one that doesn't combine an embedded lowercasefilter in
there too!). Instead of patching the LowerCaseTokenizer, which is of
dubious value. Just brainstorming.
Another way to tokenize based on "Non-Whitespace/Alpha/Numeric
character-content" might be using the existing PatternTokenizerFactory
with a suitable regexp, as you mention. Which of course could do what
the LetterTokenizer does to, but presumably not as efficiently. Is that
what gives you an uncomfortable feeling? If it performs worse enough to
matter, then that's why you'd need a custom tokenizer, other than that
I'm not sure anything's undesirable about the PatternTokenizer.
Jonathan
Scott Gonyea wrote:
I'd agree with your point entirely. My attacking LowerCaseTokenizer was a
result of not wanting to create yet more Classes.
That said, rightfully dumping LowerCaseTokenizer would probably have me
creating my own Tokenizer.
I could very well be thinking about this wrong... But what if I wanted to
create tokens based on Non-Whitespace/Alpha/Numeric character-content?
It looks like I could perhaps use the PatternTokenizer, but that didn't
leave me with a comfortable feeling when I had first looked into it.
Scott
On Tue, Sep 14, 2010 at 2:48 PM, Robert Muir <rcm...@gmail.com> wrote:
Jonathan, you bring up an excellent point.
I think its worth our time to actually benchmark this LowerCaseTokenizer
versus LetterTokenizer + LowerCaseFilter
This tokenizer is quite old, and although I can understand there is no
doubt
its technically faster than LetterTokenizer + LowerCaseFilter even today
(as
it can just go through the char[] only a single time), I have my doubts
that
this brings any value these days...
On Tue, Sep 14, 2010 at 5:23 PM, Jonathan Rochkind <rochk...@jhu.edu>
wrote:
Why would you want to do that, instead of just using another tokenizer
and
a lowercasefilter? It's more confusing less DRY code to leave them
separate
-- the LowerCaseTokenizerFactory combines anyway because someone decided
it
was such a common use case that it was worth it for the demonstrated
performance advantage. (At least I hope that's what happened, otherwise
there's no excuse for it!).
Do you know you get a worthwhile performance benefit for what you're
doing?
If not, why do it?
Jonathan
Scott Gonyea wrote:
I went for a different route:
https://issues.apache.org/jira/browse/LUCENE-2644
Scott
On Tue, Sep 14, 2010 at 11:18 AM, Robert Muir <rcm...@gmail.com> wrote:
On Tue, Sep 14, 2010 at 1:54 PM, Scott Gonyea <sc...@aitrus.org>
wrote:
Hi,
I'm tweaking my schema and the LowerCaseTokenizerFactory doesn't
create
tokens, based solely on lower-casing characters. Is there a way to
tell
it
NOT to drop non-characters? It's amazingly frustrating that the
TokenizerFactory and the FilterFactory have two entirely different
modes
of
behavior. If I wanted it to tokenize based on non-lower case
characters....
wouldn't I use, say, LetterTokenizerFactory and tack on the
LowerCaseFilterFactory? Or any number of combinations that would
otherwise
achieve that specific end-result?
I don't think you should use LowerCaseTokenizerFactory if you dont want
to
divide text on non-letters, its intended to do just that.
from the javadocs:
LowerCaseTokenizer performs the function of LetterTokenizer and
LowerCaseFilter together. It divides text at non-letters and converts
them
to lower case. While it is functionally equivalent to the combination
of
LetterTokenizer and LowerCaseFilter, there is a performance advantage
to
doing the two tasks at once, hence this (redundant) implementation.
So... Is there a way for me to tell it to NOT split based on
non-characters?
Use a different tokenizer that doesn't split on non-characters,
followed by
a LowerCaseFilter
--
Robert Muir
rcm...@gmail.com
--
Robert Muir
rcm...@gmail.com