Re: edismax parser ignores mm parameter when tokenizer splits tokens (hypenated words, WDF splitting etc)

2012-07-02 Thread Tom Burton-West
Opened a JIRA issue: https://issues.apache.org/jira/browse/SOLR-3589, which
also lists a couple other related mailing list posts.




On Thu, Jun 28, 2012 at 12:18 PM, Tom Burton-West tburt...@umich.eduwrote:

 Hello,

 My previous e-mail with a CJK example has received no replies.   I
 verified that this problem also occurs for English.  For example in the
 case of the word fire-fly , The ICUTokenizer and the WordDelimeterFilter
 both split this into two tokens fire and fly.

 With an edismax query and a must match of 2 :  q={!edsmax mm=2} if the
 words are entered separately at [fire fly], the edismax parser honors the
 mm parameter and does the equivalent of a Boolean AND query.  However if
 the words are entered as a hypenated word [fire-fly], the tokenizer splits
 these into two tokens fire and fly and the edismax parser does the
 equivalent of a Boolean OR query.

 I'm not sure I understand the output of the debugQuery, but judging by the
 number of hits returned it appears that edismax is not honoring the mm
 parameter. Am I missing something, or is this a bug?

  I'd like to file a JIRA issue, but want to find out if I am missing
 something here.

 Details of several queries are appended below.

 Tom Burton-West

 edismax query mm=2   query with hypenated word [fire-fly]

 lst name=debug
 str name=rawquerystring{!edismax mm=2}fire-fly/str
 str name=querystring{!edismax mm=2}fire-fly/str
 str name=parsedquery+DisjunctionMaxQuery(((ocr:fire ocr:fly)))/str
 str name=parsedquery_toString+((ocr:fire ocr:fly))/str


 Entered as separate words [fire fly]  numFound=184962
  edismax mm=2
 lst name=debug
 str name=rawquerystring{!edismax mm=2}fire fly/str
 str name=querystring{!edismax mm=2}fire fly/str
 str name=parsedquery
 +((DisjunctionMaxQuery((ocr:fire)) DisjunctionMaxQuery((ocr:fly)))~2)
 /str


 Regular Boolean AND query:   [fire AND fly] numFound=184962
 str name=rawquerystringfire AND fly/str
 str name=querystringfire AND fly/str
 str name=parsedquery+ocr:fire +ocr:fly/str
 str name=parsedquery_toString+ocr:fire +ocr:fly/str

 Regular Boolean OR query: fire OR fly 366047  numFound=366047
 lst name=debug
 str name=rawquerystringfire OR fly/str
 str name=querystringfire OR fly/str
 str name=parsedqueryocr:fire ocr:fly/str
 str name=parsedquery_toStringocr:fire ocr:fly/str



edismax parser ignores mm parameter when tokenizer splits tokens (hypenated words, WDF splitting etc)

2012-06-28 Thread Tom Burton-West
Hello,

My previous e-mail with a CJK example has received no replies.   I verified
that this problem also occurs for English.  For example in the case of the
word fire-fly , The ICUTokenizer and the WordDelimeterFilter both split
this into two tokens fire and fly.

With an edismax query and a must match of 2 :  q={!edsmax mm=2} if the
words are entered separately at [fire fly], the edismax parser honors the
mm parameter and does the equivalent of a Boolean AND query.  However if
the words are entered as a hypenated word [fire-fly], the tokenizer splits
these into two tokens fire and fly and the edismax parser does the
equivalent of a Boolean OR query.

I'm not sure I understand the output of the debugQuery, but judging by the
number of hits returned it appears that edismax is not honoring the mm
parameter. Am I missing something, or is this a bug?

 I'd like to file a JIRA issue, but want to find out if I am missing
something here.

Details of several queries are appended below.

Tom Burton-West

edismax query mm=2   query with hypenated word [fire-fly]

lst name=debug
str name=rawquerystring{!edismax mm=2}fire-fly/str
str name=querystring{!edismax mm=2}fire-fly/str
str name=parsedquery+DisjunctionMaxQuery(((ocr:fire ocr:fly)))/str
str name=parsedquery_toString+((ocr:fire ocr:fly))/str


Entered as separate words [fire fly]  numFound=184962
 edismax mm=2
lst name=debug
str name=rawquerystring{!edismax mm=2}fire fly/str
str name=querystring{!edismax mm=2}fire fly/str
str name=parsedquery
+((DisjunctionMaxQuery((ocr:fire)) DisjunctionMaxQuery((ocr:fly)))~2)
/str


Regular Boolean AND query:   [fire AND fly] numFound=184962
str name=rawquerystringfire AND fly/str
str name=querystringfire AND fly/str
str name=parsedquery+ocr:fire +ocr:fly/str
str name=parsedquery_toString+ocr:fire +ocr:fly/str

Regular Boolean OR query: fire OR fly 366047  numFound=366047
lst name=debug
str name=rawquerystringfire OR fly/str
str name=querystringfire OR fly/str
str name=parsedqueryocr:fire ocr:fly/str
str name=parsedquery_toStringocr:fire ocr:fly/str