On 8/30/06, Andrew May (JIRA) <[EMAIL PROTECTED]> wrote:
I've spent a bit of time trying to understand Gradient formatting and how 
QueryScorer is used.

Cool, Thanks for sharing your investigation.

In other words, I now agree with Mike that we should not support Gradient formatting. 
Perhaps we still want to retain the hl.formatter= parameter in case we have any other 
values than "simple" in the future - and keep hl.simple.pre and hl.simple.post 
as they are.

+1


As for the QueryScorer, I think it makes sense to support all three ways it can 
be construted:

The concept of scoring may be confusing to someone w/o knowledge of
how Lucene highlighting works (I want to just highlight the parts that
matched darn it! ;-)  In our final documentation, we should describe
the effect of the parameters, not their implementation (like which
constructor is called).

1) hl.scoring=simple (the default)  - construct with Query only. May have some 
matches from other terms, but allows you to highlight different fields to the 
ones searched.
2) hl.scoring=field - constructed with Query and fieldName. Only highlights 
terms matched in this field by the query.
3) hl.scoring=fieldidx - constructed with Query, fieldName and IndexReader. I 
think the selection of the best fragment(s) will be improved because the terms 
will be weighted according to their frequency in the index - but this has to be 
more costly as it calls IndexReader.docFreq for each term.

So, is there a better way to describe the differences here in a way
that will be durable (as the highligther implementation changes), be
applicable to different highlighter formatters, and easier to explain
to someone that knows nothing about lucene?  I'm not saying there
is... I'm just exploring the possibility.

1) an option to specify that all words in the query should be
highlighted on all selected fields.
2) an option to specify that words should be highlighted only if the
query matched the specific field
 Question:  would the phrase query "spider man" cause highlighting of
"the spider bit the man"?
3) when finding best matches, score rarer terms higher than common terms

Regarding docFreq... unless the number of terms is large, this
shouldn't have much of a performance impact at all.

Thanks for your continued work on this Andrew!  It's often easy/quick
to hack up something for your own use, but much more difficult to
create something valuable to everyone.

-Yonik

Reply via email to