I've followed the stop-word discussion with some interest, but I've
yet to find a solution that completely satisfies our needs.  I was
wondering if anyone could suggest some other options to try short of a
custom handler or building our own queries (DisMax does such a fine
job generally!).

We are using DisMax, and indexing media titles (books, music).  We
want our queries to be sensitive to stop-words, but not so sensitive
that we fail to match on missing or incorrect stop-words.  For
example, here are a set of queries and desired behavior:

* it -> matches It by steven king (high relevance) and other titles
with it therein, e.g. Some Like It Hot (lower relevance)
* the the -> matches music by The The, other titles with the therein
at lower relevance are fine
* the sound of music -> matches The Sound of Music high relevance
* a sound of music -> still matches The Sound of Music, lower relevance is fine
* the doors -> matches music by The Doors, even though it is indexed
just as "Doors" (our data supplier drops the definite article)
* the life -> matches titles The Life with high relevance, matches
titles of just Life with lower relevance

Basically, we want direct matches (including stop-words) to be highly
relevant and we use the phrase query mechanism for that, but we also
want matches if the user mis-remembers the correct (stopped)
prepositions or inserts a few irrelevant stop-words (like articles).
We see this in the wild with non-trivial frequency -- the wrong choice
of preposition ("on mice and men") or an article used that our data
supplier didn't include in the original version ("doors").

One thing we tried is to include both a stopped version and a
non-stopped version of the title in the qf field, in the hopes that
this would retrieve all titles without stop-words and still allow us
to include pure stop-word queries ("it").  However, DisMax constructs
queries such that mixing stopped and non-stopped fields doesn't work
as one might hope, as described well here:

http://www.nabble.com/DisMax-request-handler-doesn%27t-work-with-stopwords--td11015905.html#a11112461

Since qf controls the initial set of results retrieved for DisMax, and
we don't want to use a pure stopped set of fields there (because we
won't match on "it" as a query) nor a pure non-stopped set (won't get
results for "a sound of music"), we'd seem to be out of luck unless we
can figure out a way to augment the qf coverage.

We've tried relaxing query term requirements to allow a missing word
or two in the query via mm, but recall is amped up too much since
non-stop-words tend to be dropped and you get a lot of results that
match primarily just across stop-words.

We've also considered creating a sort of equivalence class for all
stop-words (defining synonyms to map stops to some special token)
which would allow mis-remembered stop-words to be conflated, but then
something like "it" would match anything that contained any stop-word
-- again, too high on the recall.

What I think we want is something like an "optional stop-word DisMax"
that would mark stops as optional and construct queries such that
stop-words aren't passed into fields that apply stop-word removal in
query clauses (if that makes sense).  Has anyone done anything similar
or found a better way to handle stops that exhibits the desired
behavior?

Thanks in advance for any thoughts!  And, being new to Solr, apologies
if I'm confused in my reasoning somewhere.

Ron

Reply via email to