Re: Dismax Minimum Match/Stopwords Bug

2011-05-02 Thread Chris Hostetter

: However, is there an actual fix in the 3.1 eDisMax parser which solves 
: the problem for real? Cannot find a JIRA issue for it.

edismax uses the same query structure as dismax, which means it's not 
possible to "fix" anything here ... it's how the query parsers work.

each "word" from the query string is analyzed by each field in the "qf", 
and the result is used as a query on the "word" in the field.  The 
individual clauses for each word are aggregated into a 
DisjunctionMaxQuery, and the set of DisjunctionMaxQueries are then 
combined into a BooleanQuery (with the appropriate minNrShouldMatch set)

if a "word" from the input produces no output from the analyzers of *any* 
of the of fields, then the resulting DisjunctionMaxQuery is empty and 
droped from the final BooleanQuery ... so if a "word" in the query string 
is stop word for *every* field in the qf, there is no clause.  but if 
*any* field in the qf produces a term for it, then there is a 
DisjunctionMaxQuery for that word added to hte main BooleanQuery.

As i've said many times: this isn't a bug, it's fundemental point of the 
parser and the structure of the query.

The best "solution" for people who get bit by this (in my opinion) is not 
to give up on stop words -- if you want to use stop words, by all means 
use stop words.  BUT! You must use them in all the fields of your qf ... 
evne fields where you think "why in gods name would i need stopwords on 
this field, those terms will never exist in this field!" ... you may know 
that, and it may be true, but it doesn't change the fact that people will 
be *querying* for stop words against those fields, and you want to ignore 
them when they do.



-Hoss


Dismax Minimum Match/Stopwords Bug

2011-04-15 Thread Jan Høydahl
A thread with this same subject from 2008/2009 is here: 
http://search-lucene.com/m/jkBgXnSsla

We're seeing customers being bitten by this "bug" now and then, and normally my 
workaround is to simply not use stopwords at all.
However, is there an actual fix in the 3.1 eDisMax parser which solves the 
problem for real? Cannot find a JIRA issue for it.

--
Jan Høydahl, search solution architect
Cominvent AS - www.cominvent.com



Re: Dismax Minimum Match/Stopwords Bug

2009-01-08 Thread Chris Hostetter

: Hmm, that makes sense to me - however I still think that even if we have mm
: set to "2" and we have "the 7449078" it should still match 7449078 in a
: productId field (it does not:
: http://zeta.zappos.com/search?department=&term=the+7449078). This seems like
: it works against the way one would reasonably expect it to - that stopwords
: shouldn't impact the counts for mm (so, "the 7449078" would count as 1 term
: for mm since "the" is a stopword).

this is back to the original "problem"...

"stopwords" is an analyzer concept; "minShouldMatch" is 
BooleanQuery/DisMaxQueryParser concept ... if all of the analyzers for all 
of your fields agree on the list of stopwords, then q=the+7449078 will 
result in "the" getting thrown out and you'll only have one clause.  but 
if one of fields has an anayler that says "the" is a valid term, then it's 
a valid term and it gets a clause in the query.  if it gets a clause in 
the query, then it factors into the minShouldMatch calculation.

in that particular situation i believe the solution you want is to use the 
same stopwords like you have on other fields for your productId field as 
well, so "the" doesn't get a query clause at all ... unless you want 
q=the+7449078 to return product#7449078 if and only if it also has "the" 
in it's productId field.

: We have people asking for "the north" to return results from a brand called
: "the north face" - but it doesn't, and can't, because of this mm issue.

it may not work for you right now, but that doesn't mean it can't :)  ... 
i'm not sure why it wouldn't actually.

consider a query like this...
 
   q=the north&qf=manu^2 prodName^1 desc^0.5&pf=...&mm=66%

let's say that "desc" uses stop words, but prodName and manu don't 
(because we know we have manufacturer and product names like "the north 
face"). we're going to get one DisjunctionMaxQuery for "the" (on the manu 
and prodName fields) and one DisjunctionMaxQuery for "north" (on manu, 
prodName, and desc) and that's 2 clauses on a BooleanQuery whose 
mminShouldMatch is going to be 2 (because 66% of 2 rounded up is 2)  so 
now all products with "the" and "north" in their manufacturer name *OR* 
product name will match -- even if it's "the" in manu and "north" 
in prodName.  products will even match if the only place they contain 
"north" is in the description -- but only if they also contain "the" in 
manu or productName.  if you think "that's silly, why is 'the' required i 
want it to be a stopword!" then the solution is make it a stopword 
*everywhere* (inlcuding manu and prodName) ... since it's not a stopword, 
it's considered significant, so it needs to match.


-Hoss



Re: Dismax Minimum Match/Stopwords Bug

2008-12-29 Thread Matthew Runo
Hmm, that makes sense to me - however I still think that even if we  
have mm set to "2" and we have "the 7449078" it should still match  
7449078 in a productId field (it does not: http://zeta.zappos.com/search?department=&term=the+7449078) 
. This seems like it works against the way one would reasonably expect  
it to - that stopwords shouldn't impact the counts for mm (so, "the  
7449078" would count as 1 term for mm since "the" is a stopword).


Would there be a way around this? Could we possibly get it reworked?  
What would the downside to that be?


We have people asking for "the north" to return results from a brand  
called "the north face" - but it doesn't, and can't, because of this  
mm issue.


Thanks for your time helping us with this issue =)

Matthew Runo
Software Engineer, Zappos.com
mr...@zappos.com - 702-943-7833

On Dec 20, 2008, at 10:45 AM, Chris Hostetter wrote:



: Would this mean that, for example, if we wanted to search  
productId (long)
: we'd need to make a field type that had stopwords in it rather  
than simply

: using (long)?

not really ... that's kind of a special usecase.  if someone  
searches for
a productId that's usually *all* they search for (1 "chunk" of input  
fro

mthe query parser) so it's mandatory and produces a clause across all
fields.  It doesn't matter if the other fields have stopwords --  
even if

the productId happens to be a stop word, that just means it doesn't
produce a clause on those "stop worded" fields, but it will will on  
your

productId field.

The only case where you might get into trouble is if someone  
searches for

"the 123456" ... now you have two chunks of input, so the mm param
comes into play you have no stopwords on your productId field so both
"the" and "123456" produce clauses, but "the" isn't going to be  
found in

your productId field, and because of stopwords it doens't exist in the
other fields at all ... so you don't match anything.

FWIW: if i remember right if you want to put numeric fields in the  
qf, i
think you need *all* of them to be numeric and all of your input  
needs to

be numeric, or you get exceptions from the FieldType (not the dismax
parser) when people search for normal words.   i always copyField
productId into a productId_str field for purposes like this.


-Hoss





Re: Dismax Minimum Match/Stopwords Bug

2008-12-20 Thread Chris Hostetter

: Would this mean that, for example, if we wanted to search productId (long)
: we'd need to make a field type that had stopwords in it rather than simply
: using (long)?

not really ... that's kind of a special usecase.  if someone searches for 
a productId that's usually *all* they search for (1 "chunk" of input fro 
mthe query parser) so it's mandatory and produces a clause across all 
fields.  It doesn't matter if the other fields have stopwords -- even if 
the productId happens to be a stop word, that just means it doesn't 
produce a clause on those "stop worded" fields, but it will will on your 
productId field.

The only case where you might get into trouble is if someone searches for 
"the 123456" ... now you have two chunks of input, so the mm param 
comes into play you have no stopwords on your productId field so both 
"the" and "123456" produce clauses, but "the" isn't going to be found in 
your productId field, and because of stopwords it doens't exist in the 
other fields at all ... so you don't match anything.

FWIW: if i remember right if you want to put numeric fields in the qf, i 
think you need *all* of them to be numeric and all of your input needs to 
be numeric, or you get exceptions from the FieldType (not the dismax 
parser) when people search for normal words.   i always copyField 
productId into a productId_str field for purposes like this.


-Hoss



Re: Dismax Minimum Match/Stopwords Bug

2008-12-15 Thread Matthew Runo
Would this mean that, for example, if we wanted to search productId  
(long) we'd need to make a field type that had stopwords in it rather  
than simply using (long)?


Thanks for your time!

Matthew Runo
Software Engineer, Zappos.com
mr...@zappos.com - 702-943-7833

On Dec 12, 2008, at 11:56 PM, Chris Hostetter wrote:



: I have discovered some weirdness with our Minimum Match  
functionality.
: Essentially it comes up with absolutely no results on certain  
queries.
: Basically, searches with 2 words and 1 being ³the² don¹t have a  
return
: result.  From what we can gather the minimum match criteria is  
making it
: such that if there are 2 words then both are required.   
Unfortunately, the


you haven't mentioned what qf you're using, and you only listed one  
field
type, which includes stopwords -- but i suspect your qf contains at  
least

one field that *doesn't* remove stopwords.

this is in fact an unfortunate aspect of the way dismax works --
each "chunk" of text recognized by the querypaser is passed to each
analyzer for each field.  Any chunk that produces a query for a field
becomes a DisjunctionMaxQuery, and is included in the "mm" count --  
even
if that "chunk" is a stopword in every other field (and produces no  
query)


so you have to either be consistent with your stopwords across all  
fields,
or make your mm really small.  searching for "dismax stopwords"  
turns this

up...

http://www.nabble.com/Re%3A-DisMax-request-handler-doesn%27t-work-with-stopwords--p11016770.html

...if i'm wrong about your situation (some fields in the qf with  
stopwords
and some fields without) then please post all of the params you are  
using
(not just mm) and the full parsedquery_tostring from when  
debugQuery=true

is turned on.




-Hoss




Re: Dismax Minimum Match/Stopwords Bug

2008-12-12 Thread Chris Hostetter

: I have discovered some weirdness with our Minimum Match functionality.
: Essentially it comes up with absolutely no results on certain queries.
: Basically, searches with 2 words and 1 being ³the² don¹t have a return
: result.  From what we can gather the minimum match criteria is making it
: such that if there are 2 words then both are required.  Unfortunately, the

you haven't mentioned what qf you're using, and you only listed one field 
type, which includes stopwords -- but i suspect your qf contains at least 
one field that *doesn't* remove stopwords.

this is in fact an unfortunate aspect of the way dismax works -- 
each "chunk" of text recognized by the querypaser is passed to each 
analyzer for each field.  Any chunk that produces a query for a field 
becomes a DisjunctionMaxQuery, and is included in the "mm" count -- even 
if that "chunk" is a stopword in every other field (and produces no query)

so you have to either be consistent with your stopwords across all fields, 
or make your mm really small.  searching for "dismax stopwords" turns this 
up...

http://www.nabble.com/Re%3A-DisMax-request-handler-doesn%27t-work-with-stopwords--p11016770.html

...if i'm wrong about your situation (some fields in the qf with stopwords 
and some fields without) then please post all of the params you are using 
(not just mm) and the full parsedquery_tostring from when debugQuery=true 
is turned on.




-Hoss


Dismax Minimum Match/Stopwords Bug

2008-12-11 Thread Jeff Newburn
I have discovered some weirdness with our Minimum Match functionality.
Essentially it comes up with absolutely no results on certain queries.
Basically, searches with 2 words and 1 being ³the² don¹t have a return
result.  From what we can gather the minimum match criteria is making it
such that if there are 2 words then both are required.  Unfortunately, the
stopwords are pulled resulting in ³the² being removed and then solr is
requiring 2 words when only 1 exists to match on.  Is there a way around
this?  I really need it to either require only non-stopwords or not filter
out stopwords.  We know stopwords are causing the issue because taking out
the stopwords fixes the problem.  Also, we can change mm setting to 75% and
fix the problem.

Example:
Brand: The North Face
Search: the north (returns no results)

Our config is basically:
MM: str name="mm">2<-1
FieldType: