Re: Making stop-words optional with DisMax?

2008-03-28 Thread Chris Hostetter

: Operationally, I was thinking a tokenizer could use the stop-word list
: (or an optional-word list) to mark tokens as optional rather than
: removing them from the token stream.  DisMaxOptional would then
: generate appropriate queries with the non-optionals as the core and
: then permute the optionals around those as optional clauses.  I say
: this with no deep understanding of how DisMax does its thing, of
: course, so feel free to call me naive.

you're not naive ... the problem is just that *all* of the clauses are 
allready optional (unless the term had a + or - in front of it), 
that's where the mm param comes in, it decides how many of those optional 
params should be mandatory.

it sounds like what you want is for a new DisMaxOptional parser to look at 
this...

on mice and men

and because it knows on and and are stop words, treat it the same as 
if the current DisMax parsed this...

on +mice and +men

which is another interesting idea, but it changes the meaning of mm 
significantly, in that dismax with alow mm would not longer be tolerant of 
mispelled (or missing) words unless they were stop words.

my gut tells me changing dismax so that having multiple qf params result 
in multiple dismax queries would address your problem more directly.

: I think I've so internalized list advice *not* to generate multiple
: queries that that didn't readily occur to me.  :-)   One problem I
: suppose is that query might return some results but not the desired
: one (perhaps there is a title On Men and Mice) and so I don't get to
: the second query (mice men once stopped) that would get me Of Mice
: and Men.  But an improvement in cases where no results come back from
: an overspecified query, I'd agree.

...which is why multiple dismax queries as clauses in the main query 
would be good ... the results from each would be blended together.

: The other thought I've had is to just do some query analysis up front
: prior to submission -- if the query is all stops, send it to a
...
: to boost up exact matches.  I hate the analysis step which would
: probably duplicate the tokenization done by solr, but might be worth
: it.  There'd still be some problematic queries, but this may be as
: close as it'll get.

you could probably skip the external analysis by swapping the order of 
your queries and looking at the debuging output when hitting the second 
query ... if your stopworded fields don't appear in the parsed query 
structure, then it's all stopwords, so you do need your first query.


-Hoss



Re: Making stop-words optional with DisMax?

2008-03-27 Thread Chris Hostetter

: frequently get queried for The Doors.  Articles and prepositions
: (the stuff of good stop-lists) seem to me to be in a fuzzier class --
: use 'em if you have 'em during matching, but don't kill your queries
: because of them.  Hence some desire to make them in some way
: optional during matching.

sure, but what logic would you suggest be used to decide when to make them 
optional?  :)

based on your problem description (which was excellent by the way ... 
questions full of details are so great, you never have to worry that you 
are missunderstanding the problem)  the best suggestion i can give is one 
that i usually discourage:  execute multiple queries.

start by hitting Solr using a qf with fields that contain stop words.  if 
you get 0 hits, then query with a qf that contains all fields that don't 
have stop words in them, (but you can leave them in pf).

In an ideal world, the DisMax handler would let you specify N qf options, 
and each one would be used to build a separate DisjunctionMaxQuery and 
then they'd all be combined into the uber BooleanQuery as optional clauses 
-- but in the absense of that, two queries is probably your best bet.

(hmmm... actually qf is currently a single value param -- multiple values 
aren't supported -- so if someone wrote a patch to do something like i 
described it would be backward compatible ... anybody interested?)


-Hoss



Re: Making stop-words optional with DisMax?

2008-03-27 Thread Otis Gospodnetic
If you have doors in your index and a person enters: the doors, why not 
just drop stop-words at query time?
If a person searches for music by the doors and you have music doors in the 
index and really uses quotes to get the exact phrase, you can try it like Hoss 
said, and retry without stop words in you get inadequate response from the 
first query, or you could drop stop words from the phrase, but add some slop to 
the phrase to account for gaps.

Otis 

--
Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch

- Original Message 
From: Ronald K. Braun [EMAIL PROTECTED]
To: solr-user@lucene.apache.org
Sent: Wednesday, March 26, 2008 9:05:08 PM
Subject: Re: Making stop-words optional with DisMax?

Hi Otis,

 I skimmed your email.  You are indexing book and music titles.  Those tend to 
 be short.
 Do you really benefit from removing stop words in the first place?  I'd try 
 keeping all the stop
 words and seeing if that has any negative side-effects in your context.

Thanks for your skim and response!  We do keep all stop-words -- as
you say, makes sense since we aren't dealing with long free text
fields and because some titles are pure stops.

The negative side-effects lie in stop-words being treated with the
same importance as non-stop-words for matching purposes.  This
manifests in two ways:  1. Users occasionally get the stop-words wrong
-- say, wrong choice of preposition, which torpedoes the query since
some of the query terms aren't present in the target.  For example on
mice and men may return nothing (no match for on) even though it is
equivalent to of mice and men in a stopped sense.  2. Our original
indexed data doesn't always have leading articles and such.  For
example, we index on Doors since that is our sourced data but
frequently get queried for The Doors.  Articles and prepositions
(the stuff of good stop-lists) seem to me to be in a fuzzier class --
use 'em if you have 'em during matching, but don't kill your queries
because of them.  Hence some desire to make them in some way
optional during matching.

Ron





Re: Making stop-words optional with DisMax?

2008-03-27 Thread Ronald K. Braun
 We use two fields, one with and one without stopwords. The exact
 field has a higher boost than the other. That works pretty well.

Thanks for the tip, wunder!  We are doing likewise for our pf parm of
DisMax and that part works well -- exact matches are highly relevant
and stopped-matches less so but still present in the results set.  The
main problem is getting past the qf parm such that we don't have
invisible titles (stop-words removed by the qf pipeine leaving an
empty query) or over-specified generated queries (where stop-words
turn out to be required but can't match for various reasons).

 It helps to have an automated relevance test when tuning the boost
 (and other things). I extracted queries and clicks from the logs
 for a couple of months. Not perfect, but it is hard to argue with
 32 million clicks.

I'd say -- a dream data set.  :-)  Good idea on the relevance test --
eyeballing boost changes seems definitely prone to unexpected effects
across all of the queries one didn't think to try.  (A dark art, boost
tuning...)

Ron


Re: Making stop-words optional with DisMax?

2008-03-27 Thread Ronald K. Braun
 sure, but what logic would you suggest be used to decide when to make them
 optional?  :)

Operationally, I was thinking a tokenizer could use the stop-word list
(or an optional-word list) to mark tokens as optional rather than
removing them from the token stream.  DisMaxOptional would then
generate appropriate queries with the non-optionals as the core and
then permute the optionals around those as optional clauses.  I say
this with no deep understanding of how DisMax does its thing, of
course, so feel free to call me naive.

As to what words to put in the optionals list, the function words
(articles and prepositions) seem to be the ones that folks either omit
or confuse, so they'd be good candidates.

 start by hitting Solr using a qf with fields that contain stop words.  if
 you get 0 hits, then query with a qf that contains all fields that don't
 have stop words in them, (but you can leave them in pf).

I think I've so internalized list advice *not* to generate multiple
queries that that didn't readily occur to me.  :-)   One problem I
suppose is that query might return some results but not the desired
one (perhaps there is a title On Men and Mice) and so I don't get to
the second query (mice men once stopped) that would get me Of Mice
and Men.  But an improvement in cases where no results come back from
an overspecified query, I'd agree.

The other thought I've had is to just do some query analysis up front
prior to submission -- if the query is all stops, send it to a
separate handler that doesn't do stop-word removal in the qf
specification, otherwise if any non-stop-word exists, send it to a
handler with a qf that does remove stops and rely on the pf component
to boost up exact matches.  I hate the analysis step which would
probably duplicate the tokenization done by solr, but might be worth
it.  There'd still be some problematic queries, but this may be as
close as it'll get.

Thanks for the suggestions, Hoss!

Ron


Making stop-words optional with DisMax?

2008-03-26 Thread Ronald K. Braun
I've followed the stop-word discussion with some interest, but I've
yet to find a solution that completely satisfies our needs.  I was
wondering if anyone could suggest some other options to try short of a
custom handler or building our own queries (DisMax does such a fine
job generally!).

We are using DisMax, and indexing media titles (books, music).  We
want our queries to be sensitive to stop-words, but not so sensitive
that we fail to match on missing or incorrect stop-words.  For
example, here are a set of queries and desired behavior:

* it - matches It by steven king (high relevance) and other titles
with it therein, e.g. Some Like It Hot (lower relevance)
* the the - matches music by The The, other titles with the therein
at lower relevance are fine
* the sound of music - matches The Sound of Music high relevance
* a sound of music - still matches The Sound of Music, lower relevance is fine
* the doors - matches music by The Doors, even though it is indexed
just as Doors (our data supplier drops the definite article)
* the life - matches titles The Life with high relevance, matches
titles of just Life with lower relevance

Basically, we want direct matches (including stop-words) to be highly
relevant and we use the phrase query mechanism for that, but we also
want matches if the user mis-remembers the correct (stopped)
prepositions or inserts a few irrelevant stop-words (like articles).
We see this in the wild with non-trivial frequency -- the wrong choice
of preposition (on mice and men) or an article used that our data
supplier didn't include in the original version (doors).

One thing we tried is to include both a stopped version and a
non-stopped version of the title in the qf field, in the hopes that
this would retrieve all titles without stop-words and still allow us
to include pure stop-word queries (it).  However, DisMax constructs
queries such that mixing stopped and non-stopped fields doesn't work
as one might hope, as described well here:

http://www.nabble.com/DisMax-request-handler-doesn%27t-work-with-stopwords--td11015905.html#a2461

Since qf controls the initial set of results retrieved for DisMax, and
we don't want to use a pure stopped set of fields there (because we
won't match on it as a query) nor a pure non-stopped set (won't get
results for a sound of music), we'd seem to be out of luck unless we
can figure out a way to augment the qf coverage.

We've tried relaxing query term requirements to allow a missing word
or two in the query via mm, but recall is amped up too much since
non-stop-words tend to be dropped and you get a lot of results that
match primarily just across stop-words.

We've also considered creating a sort of equivalence class for all
stop-words (defining synonyms to map stops to some special token)
which would allow mis-remembered stop-words to be conflated, but then
something like it would match anything that contained any stop-word
-- again, too high on the recall.

What I think we want is something like an optional stop-word DisMax
that would mark stops as optional and construct queries such that
stop-words aren't passed into fields that apply stop-word removal in
query clauses (if that makes sense).  Has anyone done anything similar
or found a better way to handle stops that exhibits the desired
behavior?

Thanks in advance for any thoughts!  And, being new to Solr, apologies
if I'm confused in my reasoning somewhere.

Ron


Re: Making stop-words optional with DisMax?

2008-03-26 Thread Ronald K. Braun
Hi Otis,

 I skimmed your email.  You are indexing book and music titles.  Those tend to 
 be short.
 Do you really benefit from removing stop words in the first place?  I'd try 
 keeping all the stop
 words and seeing if that has any negative side-effects in your context.

Thanks for your skim and response!  We do keep all stop-words -- as
you say, makes sense since we aren't dealing with long free text
fields and because some titles are pure stops.

The negative side-effects lie in stop-words being treated with the
same importance as non-stop-words for matching purposes.  This
manifests in two ways:  1. Users occasionally get the stop-words wrong
-- say, wrong choice of preposition, which torpedoes the query since
some of the query terms aren't present in the target.  For example on
mice and men may return nothing (no match for on) even though it is
equivalent to of mice and men in a stopped sense.  2. Our original
indexed data doesn't always have leading articles and such.  For
example, we index on Doors since that is our sourced data but
frequently get queried for The Doors.  Articles and prepositions
(the stuff of good stop-lists) seem to me to be in a fuzzier class --
use 'em if you have 'em during matching, but don't kill your queries
because of them.  Hence some desire to make them in some way
optional during matching.

Ron


Re: Making stop-words optional with DisMax?

2008-03-26 Thread Walter Underwood
We use two fields, one with and one without stopwords. The exact
field has a higher boost than the other. That works pretty well.

It helps to have an automated relevance test when tuning the boost
(and other things). I extracted queries and clicks from the logs
for a couple of months. Not perfect, but it is hard to argue with
32 million clicks.

wunder

On 3/26/08 6:05 PM, Ronald K. Braun [EMAIL PROTECTED] wrote:

 Hi Otis,
 
 I skimmed your email.  You are indexing book and music titles.  Those tend to
 be short.
 Do you really benefit from removing stop words in the first place?  I'd try
 keeping all the stop
 words and seeing if that has any negative side-effects in your context.
 
 Thanks for your skim and response!  We do keep all stop-words -- as
 you say, makes sense since we aren't dealing with long free text
 fields and because some titles are pure stops.
 
 The negative side-effects lie in stop-words being treated with the
 same importance as non-stop-words for matching purposes.  This
 manifests in two ways:  1. Users occasionally get the stop-words wrong
 -- say, wrong choice of preposition, which torpedoes the query since
 some of the query terms aren't present in the target.  For example on
 mice and men may return nothing (no match for on) even though it is
 equivalent to of mice and men in a stopped sense.  2. Our original
 indexed data doesn't always have leading articles and such.  For
 example, we index on Doors since that is our sourced data but
 frequently get queried for The Doors.  Articles and prepositions
 (the stuff of good stop-lists) seem to me to be in a fuzzier class --
 use 'em if you have 'em during matching, but don't kill your queries
 because of them.  Hence some desire to make them in some way
 optional during matching.
 
 Ron