: Agreed, but best match is not ONLY about keywords. Here is where the
: system developer can provide extra intelligence by doing query
: re-writing.

I finally got a chance to read through the URL (disclaimer: i do not have
"a basic working knowledge of Oracle Text, such as the operators used in
query expressions.")

At it's core what is being described here can easily be done with a custom
request handler that takes in a multivalue "q" param, and executes them in
order until it finds some matches ... careful math when dealing start/rows
and the number of results from each query make it easy to ensure that you
can seemlessly return results from any/all queries in the order described
(allthough you'd have to do something funky with the raw score values if
you actually wanted to return them to the client)

In general though, I agree with Walter ... this seems like a very naive
approach.  At a very low conceptually level, The DisMaxRequestHandler does
what the early counter example in the link talks about...

>>  select book_id from books
>>      where contains (author, '(michel crichton) OR (?michel ?crichton)
>>      OR (michel OR crichton) OR (?michel OR ?crichton)

the problem is that the two critisism of this appraoch (which may be valid
in Oracle text matching) don't really apply in Solr/Lucene...

>>   1.  From the user's point of view, hits which are a poor match will be
>> mixed in with hits which are a good match. The user wants to see good
>> matches displayed first.

"poor" hits won't score as high as "good" hits -- boost
values can be assigned for hte various pieces of the DisMax query so that
exact phrase matches can be weighted better then individual word matches,
coordFactors will ensure that docs only matching a few words don't score
as well as docs matching all of the words, etc...

>>   2. From the system's point of view, the search is inefficient. Even if
>> there were plenty of hits for exactly "Michel Crichton", it would still
>> have to do all the work of the fuzzy expansions and fetch data for all the
>> rows which satisfy the query.

My problem with this claim is the assumption that once you find lots of
hits for "Michel Crichton" you don't need to keep looking for "Michel" or
"Crichton" ... by this logic, many docs that contain the exact phrase
"Michel Crichton" (and are roughly the same length) will get the same
score, and the query will stop there ... the benefit of looking for
8everything* as a single query, is that the scores can become more fine
grained -- docs with 1 exact match that *also* contain things like "Mr
Crichton" several dozen times will score higher then docs with just that
one exact match (cosider an article about "Michel Crichton" in which his
full name appears only once vs an article listing popular authors, in
which "Michel Crichton" appears exactly once)

: Why do you say this? The rank is still provided by the search engine
: BASED ON THE QUERY submitted and it does consider natural language
: text. It's just leaving the order of execution in the hands of the
: developer who knows better what the system should return for some
: specific cases.

evaluating each of the query parts in isolation and then aggregating the
results doesn't take into account the *cumulative* value of the parts ...
it's like averagine the ages of people in each city, then averaging those
averages for each state and calling that the average age per state -- it's
a much less accurate representation of reality then averaging the ages of
everyone in the state all at once.



-Hoss

Reply via email to