Because (just IMO, I'm not an expert here either) the basic framework in
Solr is that tokenizers tokenize, but they don't generally change bytes
inside values. What changes bytes (or adds or removes tokens to the
token stream initially created by a tokenizer, etc) is filters. And
there's already a LowerCaseFilter to change your bytes by lower-casing
them, so if you want to lowercase your bytes, it's already there for you.
And, if we're talking patches (rather than a custom thing you write for
yourself), users can already combine the tokenizing behavior they want
with the filtering behavior they want by mixing and matching tokenizers
and filters -- no need to write new code to provide options to
tokenizers, options which would have to be kept consistent between
filters to avoid confusion and would duplicate existing filter
behavior. The "DRY" here applies to giving solr users too many
duplicatative options in their schema.xml too, not just to the
underlying Java, although even Java code that just calls other existing
Java code is still extra Java code to be maintained.
So that's why I suggest using a LowerCaseFilter to, well, lower-case
your values. Instead of building lower-casing into a tokenizer and
submitting that as a patch.
So if you need to build or change a tokenizer to get the tokenizing
behavior you want, then do that -- don't build in a lowercasefilter into
the tokenizer too. If what you want is basically the same as the
existing LetterTokenizer but with a few changes, then it might make
sense to add options to the LetterTokenizer. If what you want is so
different from LetterTokenizer that this is inconvenient, then maybe you
need to create a new tokenizer. But either way, there's no reason to
have a LowerCaseFilter built into a Tokenizer -- unless maybe there is
demonstrated non-trivial performance advantage and (for a patch to solr
rather than your own personal custom tokenizer) it's a common use case.
That's all I was trying to suggest. Perhaps I've misunderstood what
you're needs or plans are though.
Scott Gonyea wrote:
There's a lot of reasons, with the performance hit being notable--but also
because I feel that using a regex on something this basic amounts to a lazy
hack. I'm typically against regular expressions in XML.
I'm vehemently opposed to them in cases where not using them should
otherwise be quite trivial. Regarding LowerCaseFilter, etc:
My question is: Why should LowerCaseFilter be the means by which that work
is done? I fully agree with keeping things DRY, but I'm not quite sure I
agree with how that mantra is being employed. For instance, the two
tokenizer statements:
<tokenizer class="solr.WhiteSpaceTokenizer" downCase="true">
<tokenizer class="solr.LowerCaseLetterTokenizer">
Can be written to utilize the same codebase, which makes things DRY and
*may* even be a bit more performant for less trivial transformations.
If nothing else, I think a "CharacterTokenizer" would be good way to go.
<tokenizer class="solr.CharacterTokenizer" downCase="true"
tokenizeSpecialCharacters="true" tokenizeWhiteSpace="true"
tokenizedCharcterClasses="wd"/>
All that said :) I don't promote myself as an expert and I'm happy to be
shown the light / slapped across the head.
Scott
On Tue, Sep 14, 2010 at 3:10 PM, Jonathan Rochkind <rochk...@jhu.edu> wrote:
How about patching the LetterTokenizer to be capable of tokenizing how you
want, which can then be combined with a LowerCaseFilter (or not) as desired.
Or indeed creating a new tokenizer to do exactly what you want, possibly
(but one that doesn't combine an embedded lowercasefilter in there too!).
Instead of patching the LowerCaseTokenizer, which is of dubious value. Just
brainstorming.
Another way to tokenize based on "Non-Whitespace/Alpha/Numeric
character-content" might be using the existing PatternTokenizerFactory with
a suitable regexp, as you mention. Which of course could do what the
LetterTokenizer does to, but presumably not as efficiently. Is that what
gives you an uncomfortable feeling? If it performs worse enough to matter,
then that's why you'd need a custom tokenizer, other than that I'm not sure
anything's undesirable about the PatternTokenizer.
Jonathan
Scott Gonyea wrote:
I'd agree with your point entirely. My attacking LowerCaseTokenizer was a
result of not wanting to create yet more Classes.
That said, rightfully dumping LowerCaseTokenizer would probably have me
creating my own Tokenizer.
I could very well be thinking about this wrong... But what if I wanted to
create tokens based on Non-Whitespace/Alpha/Numeric character-content?
It looks like I could perhaps use the PatternTokenizer, but that didn't
leave me with a comfortable feeling when I had first looked into it.
Scott
On Tue, Sep 14, 2010 at 2:48 PM, Robert Muir <rcm...@gmail.com> wrote:
Jonathan, you bring up an excellent point.
I think its worth our time to actually benchmark this LowerCaseTokenizer
versus LetterTokenizer + LowerCaseFilter
This tokenizer is quite old, and although I can understand there is no
doubt
its technically faster than LetterTokenizer + LowerCaseFilter even today
(as
it can just go through the char[] only a single time), I have my doubts
that
this brings any value these days...
On Tue, Sep 14, 2010 at 5:23 PM, Jonathan Rochkind <rochk...@jhu.edu>
wrote:
Why would you want to do that, instead of just using another tokenizer
and
a lowercasefilter? It's more confusing less DRY code to leave them
separate
-- the LowerCaseTokenizerFactory combines anyway because someone
decided
it
was such a common use case that it was worth it for the demonstrated
performance advantage. (At least I hope that's what happened, otherwise
there's no excuse for it!).
Do you know you get a worthwhile performance benefit for what you're
doing?
If not, why do it?
Jonathan
Scott Gonyea wrote:
I went for a different route:
https://issues.apache.org/jira/browse/LUCENE-2644
Scott
On Tue, Sep 14, 2010 at 11:18 AM, Robert Muir <rcm...@gmail.com>
wrote:
On Tue, Sep 14, 2010 at 1:54 PM, Scott Gonyea <sc...@aitrus.org>
wrote:
Hi,
I'm tweaking my schema and the LowerCaseTokenizerFactory doesn't
create
tokens, based solely on lower-casing characters. Is there a way to
tell
it
NOT to drop non-characters? It's amazingly frustrating that the
TokenizerFactory and the FilterFactory have two entirely different
modes
of
behavior. If I wanted it to tokenize based on non-lower case
characters....
wouldn't I use, say, LetterTokenizerFactory and tack on the
LowerCaseFilterFactory? Or any number of combinations that would
otherwise
achieve that specific end-result?
I don't think you should use LowerCaseTokenizerFactory if you dont
want
to
divide text on non-letters, its intended to do just that.
from the javadocs:
LowerCaseTokenizer performs the function of LetterTokenizer and
LowerCaseFilter together. It divides text at non-letters and converts
them
to lower case. While it is functionally equivalent to the combination
of
LetterTokenizer and LowerCaseFilter, there is a performance advantage
to
doing the two tasks at once, hence this (redundant) implementation.
So... Is there a way for me to tell it to NOT split based on
non-characters?
Use a different tokenizer that doesn't split on non-characters,
followed by
a LowerCaseFilter
--
Robert Muir
rcm...@gmail.com
--
Robert Muir
rcm...@gmail.com