Hi,

Apologies if you are receiving it second time...having tough time with mail
server..

I take a user entered query as it is and run it with dismax query handler.
The documents fields have been filled from structured data, where different
fields have different attributes like number of beds, number of baths, city
name etc. A sample user query would look like "3 bed homes in new york". I
would like this to match against city:new york and beds:3 beds. When I use
dismax handler with boosts and tie parameter, I do not always get the most
relevant top 10 results because there seem to be many factors in play one of
which is not being able to recognize the presence of sub phrases and
secondly not being able to ignore unwanted matches in unwanted fields.

What are your thoughts on having one more request handler like dismax, but
which uses a sub-phrase query instead of dismax query ?
It would also provide the below parameters, on per field basis, to help
customize the behavior of the request handler, and give more flexibility in
different scenarios.
.
phraseBoost - how better is a 3 word sub phrase match than 2 word sub phrase
match
useOnlyMaxMatch - If many sub phrases match in the field, only the best
score is used.
ignoreDuplicates - If a field has duplicate matches, pick only one match for
scoring.
matchOnlyOneField - if match is found in the first field, remove the matched
terms while querying the other fields. For example, for me city match is
more important than in other fields. So,, I do not want the"new" in new york
to match all other fields and skew the results, which is what i am seeing
with dismax, irrespective of the high boosts.
ignoreSomeLuceneScorefactors - Ignore the lucene tf, idf, query norm or any
such criteria which is not needed for this field., since if I want exact
matches only, they are really not important. They also seem to play a big
role in me not being to get most relevant top 10 results.

I see this handler might be useful in the below use cases -
a) data is mostly exact in that, I am not trying to search on free text
like, mails, reviews, articles, web pages etc
b) numbers and their binding are important
c) exact phrase or sub phrase matches are more important than rankings
derived from tf, idf, query norm etc.
d) need to make sure that in some cases some fields affect the scoring and
in some they don't. I found this was the most difficult task, to trace the
noise matches from the required ones for my use case.

Your thoughts and suggestions on alternatives are welcome.

Have also posted a question on sub phrase matching in lucene-user which is
not related to having a solr handler with additional features like
sub-phrase matching, for user entered queries.

Thanks
Preetam

Reply via email to