: Operationally, I was thinking a tokenizer could use the stop-word list
: (or an optional-word list) to mark tokens as optional rather than
: removing them from the token stream. DisMaxOptional would then
: generate appropriate queries with the non-optionals as the core and
: then permute the
: frequently get queried for The Doors. Articles and prepositions
: (the stuff of good stop-lists) seem to me to be in a fuzzier class --
: use 'em if you have 'em during matching, but don't kill your queries
: because of them. Hence some desire to make them in some way
: optional during
]
To: solr-user@lucene.apache.org
Sent: Wednesday, March 26, 2008 9:05:08 PM
Subject: Re: Making stop-words optional with DisMax?
Hi Otis,
I skimmed your email. You are indexing book and music titles. Those tend to
be short.
Do you really benefit from removing stop words in the first place? I'd
We use two fields, one with and one without stopwords. The exact
field has a higher boost than the other. That works pretty well.
Thanks for the tip, wunder! We are doing likewise for our pf parm of
DisMax and that part works well -- exact matches are highly relevant
and stopped-matches less
sure, but what logic would you suggest be used to decide when to make them
optional? :)
Operationally, I was thinking a tokenizer could use the stop-word list
(or an optional-word list) to mark tokens as optional rather than
removing them from the token stream. DisMaxOptional would then
I've followed the stop-word discussion with some interest, but I've
yet to find a solution that completely satisfies our needs. I was
wondering if anyone could suggest some other options to try short of a
custom handler or building our own queries (DisMax does such a fine
job generally!).
We are
Hi Otis,
I skimmed your email. You are indexing book and music titles. Those tend to
be short.
Do you really benefit from removing stop words in the first place? I'd try
keeping all the stop
words and seeing if that has any negative side-effects in your context.
Thanks for your skim
We use two fields, one with and one without stopwords. The exact
field has a higher boost than the other. That works pretty well.
It helps to have an automated relevance test when tuning the boost
(and other things). I extracted queries and clicks from the logs
for a couple of months. Not