Re: Making stop-words optional with DisMax?
: Operationally, I was thinking a tokenizer could use the stop-word list : (or an optional-word list) to mark tokens as optional rather than : removing them from the token stream. DisMaxOptional would then : generate appropriate queries with the non-optionals as the core and : then permute the optionals around those as optional clauses. I say : this with no deep understanding of how DisMax does its thing, of : course, so feel free to call me naive. you're not naive ... the problem is just that *all* of the clauses are allready optional (unless the term had a + or - in front of it), that's where the mm param comes in, it decides how many of those optional params should be mandatory. it sounds like what you want is for a new DisMaxOptional parser to look at this... on mice and men and because it knows on and and are stop words, treat it the same as if the current DisMax parsed this... on +mice and +men which is another interesting idea, but it changes the meaning of mm significantly, in that dismax with alow mm would not longer be tolerant of mispelled (or missing) words unless they were stop words. my gut tells me changing dismax so that having multiple qf params result in multiple dismax queries would address your problem more directly. : I think I've so internalized list advice *not* to generate multiple : queries that that didn't readily occur to me. :-) One problem I : suppose is that query might return some results but not the desired : one (perhaps there is a title On Men and Mice) and so I don't get to : the second query (mice men once stopped) that would get me Of Mice : and Men. But an improvement in cases where no results come back from : an overspecified query, I'd agree. ...which is why multiple dismax queries as clauses in the main query would be good ... the results from each would be blended together. : The other thought I've had is to just do some query analysis up front : prior to submission -- if the query is all stops, send it to a ... : to boost up exact matches. I hate the analysis step which would : probably duplicate the tokenization done by solr, but might be worth : it. There'd still be some problematic queries, but this may be as : close as it'll get. you could probably skip the external analysis by swapping the order of your queries and looking at the debuging output when hitting the second query ... if your stopworded fields don't appear in the parsed query structure, then it's all stopwords, so you do need your first query. -Hoss
Re: Making stop-words optional with DisMax?
: frequently get queried for The Doors. Articles and prepositions : (the stuff of good stop-lists) seem to me to be in a fuzzier class -- : use 'em if you have 'em during matching, but don't kill your queries : because of them. Hence some desire to make them in some way : optional during matching. sure, but what logic would you suggest be used to decide when to make them optional? :) based on your problem description (which was excellent by the way ... questions full of details are so great, you never have to worry that you are missunderstanding the problem) the best suggestion i can give is one that i usually discourage: execute multiple queries. start by hitting Solr using a qf with fields that contain stop words. if you get 0 hits, then query with a qf that contains all fields that don't have stop words in them, (but you can leave them in pf). In an ideal world, the DisMax handler would let you specify N qf options, and each one would be used to build a separate DisjunctionMaxQuery and then they'd all be combined into the uber BooleanQuery as optional clauses -- but in the absense of that, two queries is probably your best bet. (hmmm... actually qf is currently a single value param -- multiple values aren't supported -- so if someone wrote a patch to do something like i described it would be backward compatible ... anybody interested?) -Hoss
Re: Making stop-words optional with DisMax?
If you have doors in your index and a person enters: the doors, why not just drop stop-words at query time? If a person searches for music by the doors and you have music doors in the index and really uses quotes to get the exact phrase, you can try it like Hoss said, and retry without stop words in you get inadequate response from the first query, or you could drop stop words from the phrase, but add some slop to the phrase to account for gaps. Otis -- Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch - Original Message From: Ronald K. Braun [EMAIL PROTECTED] To: solr-user@lucene.apache.org Sent: Wednesday, March 26, 2008 9:05:08 PM Subject: Re: Making stop-words optional with DisMax? Hi Otis, I skimmed your email. You are indexing book and music titles. Those tend to be short. Do you really benefit from removing stop words in the first place? I'd try keeping all the stop words and seeing if that has any negative side-effects in your context. Thanks for your skim and response! We do keep all stop-words -- as you say, makes sense since we aren't dealing with long free text fields and because some titles are pure stops. The negative side-effects lie in stop-words being treated with the same importance as non-stop-words for matching purposes. This manifests in two ways: 1. Users occasionally get the stop-words wrong -- say, wrong choice of preposition, which torpedoes the query since some of the query terms aren't present in the target. For example on mice and men may return nothing (no match for on) even though it is equivalent to of mice and men in a stopped sense. 2. Our original indexed data doesn't always have leading articles and such. For example, we index on Doors since that is our sourced data but frequently get queried for The Doors. Articles and prepositions (the stuff of good stop-lists) seem to me to be in a fuzzier class -- use 'em if you have 'em during matching, but don't kill your queries because of them. Hence some desire to make them in some way optional during matching. Ron
Re: Making stop-words optional with DisMax?
We use two fields, one with and one without stopwords. The exact field has a higher boost than the other. That works pretty well. Thanks for the tip, wunder! We are doing likewise for our pf parm of DisMax and that part works well -- exact matches are highly relevant and stopped-matches less so but still present in the results set. The main problem is getting past the qf parm such that we don't have invisible titles (stop-words removed by the qf pipeine leaving an empty query) or over-specified generated queries (where stop-words turn out to be required but can't match for various reasons). It helps to have an automated relevance test when tuning the boost (and other things). I extracted queries and clicks from the logs for a couple of months. Not perfect, but it is hard to argue with 32 million clicks. I'd say -- a dream data set. :-) Good idea on the relevance test -- eyeballing boost changes seems definitely prone to unexpected effects across all of the queries one didn't think to try. (A dark art, boost tuning...) Ron
Re: Making stop-words optional with DisMax?
sure, but what logic would you suggest be used to decide when to make them optional? :) Operationally, I was thinking a tokenizer could use the stop-word list (or an optional-word list) to mark tokens as optional rather than removing them from the token stream. DisMaxOptional would then generate appropriate queries with the non-optionals as the core and then permute the optionals around those as optional clauses. I say this with no deep understanding of how DisMax does its thing, of course, so feel free to call me naive. As to what words to put in the optionals list, the function words (articles and prepositions) seem to be the ones that folks either omit or confuse, so they'd be good candidates. start by hitting Solr using a qf with fields that contain stop words. if you get 0 hits, then query with a qf that contains all fields that don't have stop words in them, (but you can leave them in pf). I think I've so internalized list advice *not* to generate multiple queries that that didn't readily occur to me. :-) One problem I suppose is that query might return some results but not the desired one (perhaps there is a title On Men and Mice) and so I don't get to the second query (mice men once stopped) that would get me Of Mice and Men. But an improvement in cases where no results come back from an overspecified query, I'd agree. The other thought I've had is to just do some query analysis up front prior to submission -- if the query is all stops, send it to a separate handler that doesn't do stop-word removal in the qf specification, otherwise if any non-stop-word exists, send it to a handler with a qf that does remove stops and rely on the pf component to boost up exact matches. I hate the analysis step which would probably duplicate the tokenization done by solr, but might be worth it. There'd still be some problematic queries, but this may be as close as it'll get. Thanks for the suggestions, Hoss! Ron
Making stop-words optional with DisMax?
I've followed the stop-word discussion with some interest, but I've yet to find a solution that completely satisfies our needs. I was wondering if anyone could suggest some other options to try short of a custom handler or building our own queries (DisMax does such a fine job generally!). We are using DisMax, and indexing media titles (books, music). We want our queries to be sensitive to stop-words, but not so sensitive that we fail to match on missing or incorrect stop-words. For example, here are a set of queries and desired behavior: * it - matches It by steven king (high relevance) and other titles with it therein, e.g. Some Like It Hot (lower relevance) * the the - matches music by The The, other titles with the therein at lower relevance are fine * the sound of music - matches The Sound of Music high relevance * a sound of music - still matches The Sound of Music, lower relevance is fine * the doors - matches music by The Doors, even though it is indexed just as Doors (our data supplier drops the definite article) * the life - matches titles The Life with high relevance, matches titles of just Life with lower relevance Basically, we want direct matches (including stop-words) to be highly relevant and we use the phrase query mechanism for that, but we also want matches if the user mis-remembers the correct (stopped) prepositions or inserts a few irrelevant stop-words (like articles). We see this in the wild with non-trivial frequency -- the wrong choice of preposition (on mice and men) or an article used that our data supplier didn't include in the original version (doors). One thing we tried is to include both a stopped version and a non-stopped version of the title in the qf field, in the hopes that this would retrieve all titles without stop-words and still allow us to include pure stop-word queries (it). However, DisMax constructs queries such that mixing stopped and non-stopped fields doesn't work as one might hope, as described well here: http://www.nabble.com/DisMax-request-handler-doesn%27t-work-with-stopwords--td11015905.html#a2461 Since qf controls the initial set of results retrieved for DisMax, and we don't want to use a pure stopped set of fields there (because we won't match on it as a query) nor a pure non-stopped set (won't get results for a sound of music), we'd seem to be out of luck unless we can figure out a way to augment the qf coverage. We've tried relaxing query term requirements to allow a missing word or two in the query via mm, but recall is amped up too much since non-stop-words tend to be dropped and you get a lot of results that match primarily just across stop-words. We've also considered creating a sort of equivalence class for all stop-words (defining synonyms to map stops to some special token) which would allow mis-remembered stop-words to be conflated, but then something like it would match anything that contained any stop-word -- again, too high on the recall. What I think we want is something like an optional stop-word DisMax that would mark stops as optional and construct queries such that stop-words aren't passed into fields that apply stop-word removal in query clauses (if that makes sense). Has anyone done anything similar or found a better way to handle stops that exhibits the desired behavior? Thanks in advance for any thoughts! And, being new to Solr, apologies if I'm confused in my reasoning somewhere. Ron
Re: Making stop-words optional with DisMax?
Hi Otis, I skimmed your email. You are indexing book and music titles. Those tend to be short. Do you really benefit from removing stop words in the first place? I'd try keeping all the stop words and seeing if that has any negative side-effects in your context. Thanks for your skim and response! We do keep all stop-words -- as you say, makes sense since we aren't dealing with long free text fields and because some titles are pure stops. The negative side-effects lie in stop-words being treated with the same importance as non-stop-words for matching purposes. This manifests in two ways: 1. Users occasionally get the stop-words wrong -- say, wrong choice of preposition, which torpedoes the query since some of the query terms aren't present in the target. For example on mice and men may return nothing (no match for on) even though it is equivalent to of mice and men in a stopped sense. 2. Our original indexed data doesn't always have leading articles and such. For example, we index on Doors since that is our sourced data but frequently get queried for The Doors. Articles and prepositions (the stuff of good stop-lists) seem to me to be in a fuzzier class -- use 'em if you have 'em during matching, but don't kill your queries because of them. Hence some desire to make them in some way optional during matching. Ron
Re: Making stop-words optional with DisMax?
We use two fields, one with and one without stopwords. The exact field has a higher boost than the other. That works pretty well. It helps to have an automated relevance test when tuning the boost (and other things). I extracted queries and clicks from the logs for a couple of months. Not perfect, but it is hard to argue with 32 million clicks. wunder On 3/26/08 6:05 PM, Ronald K. Braun [EMAIL PROTECTED] wrote: Hi Otis, I skimmed your email. You are indexing book and music titles. Those tend to be short. Do you really benefit from removing stop words in the first place? I'd try keeping all the stop words and seeing if that has any negative side-effects in your context. Thanks for your skim and response! We do keep all stop-words -- as you say, makes sense since we aren't dealing with long free text fields and because some titles are pure stops. The negative side-effects lie in stop-words being treated with the same importance as non-stop-words for matching purposes. This manifests in two ways: 1. Users occasionally get the stop-words wrong -- say, wrong choice of preposition, which torpedoes the query since some of the query terms aren't present in the target. For example on mice and men may return nothing (no match for on) even though it is equivalent to of mice and men in a stopped sense. 2. Our original indexed data doesn't always have leading articles and such. For example, we index on Doors since that is our sourced data but frequently get queried for The Doors. Articles and prepositions (the stuff of good stop-lists) seem to me to be in a fuzzier class -- use 'em if you have 'em during matching, but don't kill your queries because of them. Hence some desire to make them in some way optional during matching. Ron