Because (just IMO, I'm not an expert here either) the basic framework in Solr is that tokenizers tokenize, but they don't generally change bytes inside values. What changes bytes (or adds or removes tokens to the token stream initially created by a tokenizer, etc) is filters. And there's already a LowerCaseFilter to change your bytes by lower-casing them, so if you want to lowercase your bytes, it's already there for you.

And, if we're talking patches (rather than a custom thing you write for yourself), users can already combine the tokenizing behavior they want with the filtering behavior they want by mixing and matching tokenizers and filters -- no need to write new code to provide options to tokenizers, options which would have to be kept consistent between filters to avoid confusion and would duplicate existing filter behavior. The "DRY" here applies to giving solr users too many duplicatative options in their schema.xml too, not just to the underlying Java, although even Java code that just calls other existing Java code is still extra Java code to be maintained.

So that's why I suggest using a LowerCaseFilter to, well, lower-case your values. Instead of building lower-casing into a tokenizer and submitting that as a patch.

So if you need to build or change a tokenizer to get the tokenizing behavior you want, then do that -- don't build in a lowercasefilter into the tokenizer too. If what you want is basically the same as the existing LetterTokenizer but with a few changes, then it might make sense to add options to the LetterTokenizer. If what you want is so different from LetterTokenizer that this is inconvenient, then maybe you need to create a new tokenizer. But either way, there's no reason to have a LowerCaseFilter built into a Tokenizer -- unless maybe there is demonstrated non-trivial performance advantage and (for a patch to solr rather than your own personal custom tokenizer) it's a common use case.

That's all I was trying to suggest. Perhaps I've misunderstood what you're needs or plans are though.

Scott Gonyea wrote:
There's a lot of reasons, with the performance hit being notable--but also
because I feel that using a regex on something this basic amounts to a lazy
hack.  I'm typically against regular expressions in XML.
 I'm vehemently opposed to them in cases where not using them should
otherwise be quite trivial.  Regarding LowerCaseFilter, etc:

My question is: Why should LowerCaseFilter be the means by which that work
is done? I fully agree with keeping things DRY, but I'm not quite sure I
agree with how that mantra is being employed.  For instance, the two
tokenizer statements:

<tokenizer class="solr.WhiteSpaceTokenizer" downCase="true">
<tokenizer class="solr.LowerCaseLetterTokenizer">

Can be written to utilize the same codebase, which makes things DRY and
*may* even be a bit more performant for less trivial transformations.

If nothing else, I think a "CharacterTokenizer" would be good way to go.

<tokenizer class="solr.CharacterTokenizer" downCase="true"
tokenizeSpecialCharacters="true" tokenizeWhiteSpace="true"
tokenizedCharcterClasses="wd"/>


All that said :)  I don't promote myself as an expert and I'm happy to be
shown the light / slapped across the head.

Scott

On Tue, Sep 14, 2010 at 3:10 PM, Jonathan Rochkind <rochk...@jhu.edu> wrote:

How about patching the LetterTokenizer to be capable of tokenizing how you
want, which can then be combined with a LowerCaseFilter (or not) as desired.
Or indeed creating a new tokenizer to do exactly what you want, possibly
(but one that doesn't combine an embedded lowercasefilter in there too!).
Instead of patching the LowerCaseTokenizer, which is of dubious value. Just
brainstorming.

Another way to tokenize based on "Non-Whitespace/Alpha/Numeric
character-content" might be using the existing PatternTokenizerFactory with
a suitable regexp, as you mention.  Which of course could do what the
LetterTokenizer does to, but presumably not as efficiently. Is that what
gives you an uncomfortable feeling? If it performs worse enough to matter,
then that's why you'd need a custom tokenizer, other than that I'm not sure
anything's undesirable about the PatternTokenizer.


Jonathan

Scott Gonyea wrote:

I'd agree with your point entirely.  My attacking LowerCaseTokenizer was a
result of not wanting to create yet more Classes.

That said, rightfully dumping LowerCaseTokenizer would probably have me
creating my own Tokenizer.

I could very well be thinking about this wrong...  But what if I wanted to
create tokens based on Non-Whitespace/Alpha/Numeric character-content?

It looks like I could perhaps use the PatternTokenizer, but that didn't
leave me with a comfortable feeling when I had first looked into it.

Scott

On Tue, Sep 14, 2010 at 2:48 PM, Robert Muir <rcm...@gmail.com> wrote:



Jonathan, you bring up an excellent point.

I think its worth our time to actually benchmark this LowerCaseTokenizer
versus LetterTokenizer + LowerCaseFilter

This tokenizer is quite old, and although I can understand there is no
doubt
its technically faster than LetterTokenizer + LowerCaseFilter even today
(as
it can just go through the char[] only a single time), I have my doubts
that
this brings any value these days...


On Tue, Sep 14, 2010 at 5:23 PM, Jonathan Rochkind <rochk...@jhu.edu>
wrote:



Why would you want to do that, instead of just using another tokenizer


and


a lowercasefilter?  It's more confusing less DRY code to leave them


separate


-- the LowerCaseTokenizerFactory  combines anyway because someone
decided


it


was such a common use case that it was worth it for the demonstrated
performance advantage. (At least I hope that's what happened, otherwise
there's no excuse for it!).

Do you know you get a worthwhile performance benefit for what you're


doing?


 If not, why do it?

Jonathan


Scott Gonyea wrote:



I went for a different route:

https://issues.apache.org/jira/browse/LUCENE-2644

Scott

On Tue, Sep 14, 2010 at 11:18 AM, Robert Muir <rcm...@gmail.com>
wrote:





On Tue, Sep 14, 2010 at 1:54 PM, Scott Gonyea <sc...@aitrus.org>


wrote:
Hi,

I'm tweaking my schema and the LowerCaseTokenizerFactory doesn't


create
 tokens, based solely on lower-casing characters.  Is there a way to
tell
it




NOT to drop non-characters?  It's amazingly frustrating that the
TokenizerFactory and the FilterFactory have two entirely different


modes
of




behavior.  If I wanted it to tokenize based on non-lower case
characters....
wouldn't I use, say, LetterTokenizerFactory and tack on the
LowerCaseFilterFactory?  Or any number of combinations that would




otherwise




achieve that specific end-result?





I don't think you should use LowerCaseTokenizerFactory if you dont
want
to
divide text on non-letters, its intended to do just that.

from the javadocs:
LowerCaseTokenizer performs the function of LetterTokenizer and
LowerCaseFilter together. It divides text at non-letters and converts
them
to lower case. While it is functionally equivalent to the combination


of
LetterTokenizer and LowerCaseFilter, there is a performance advantage
to
doing the two tasks at once, hence this (redundant) implementation.

So... Is there a way for me to tell it to NOT split based on
non-characters?
  Use a different tokenizer that doesn't split on non-characters,
followed by
a LowerCaseFilter

--
Robert Muir
rcm...@gmail.com






--
Robert Muir
rcm...@gmail.com





Reply via email to