On 8/30/06, Andrew May (JIRA) <[EMAIL PROTECTED]> wrote:
I've spent a bit of time trying to understand Gradient formatting and how QueryScorer is used.
Cool, Thanks for sharing your investigation.
In other words, I now agree with Mike that we should not support Gradient formatting. Perhaps we still want to retain the hl.formatter= parameter in case we have any other values than "simple" in the future - and keep hl.simple.pre and hl.simple.post as they are.
+1
As for the QueryScorer, I think it makes sense to support all three ways it can be construted:
The concept of scoring may be confusing to someone w/o knowledge of how Lucene highlighting works (I want to just highlight the parts that matched darn it! ;-) In our final documentation, we should describe the effect of the parameters, not their implementation (like which constructor is called).
1) hl.scoring=simple (the default) - construct with Query only. May have some matches from other terms, but allows you to highlight different fields to the ones searched. 2) hl.scoring=field - constructed with Query and fieldName. Only highlights terms matched in this field by the query. 3) hl.scoring=fieldidx - constructed with Query, fieldName and IndexReader. I think the selection of the best fragment(s) will be improved because the terms will be weighted according to their frequency in the index - but this has to be more costly as it calls IndexReader.docFreq for each term.
So, is there a better way to describe the differences here in a way that will be durable (as the highligther implementation changes), be applicable to different highlighter formatters, and easier to explain to someone that knows nothing about lucene? I'm not saying there is... I'm just exploring the possibility. 1) an option to specify that all words in the query should be highlighted on all selected fields. 2) an option to specify that words should be highlighted only if the query matched the specific field Question: would the phrase query "spider man" cause highlighting of "the spider bit the man"? 3) when finding best matches, score rarer terms higher than common terms Regarding docFreq... unless the number of terms is large, this shouldn't have much of a performance impact at all. Thanks for your continued work on this Andrew! It's often easy/quick to hack up something for your own use, but much more difficult to create something valuable to everyone. -Yonik