An observation: df and IDF (document frequency) is a key driver of the whole relevancy framework on which stock Lucene is based. There is no question about its significant value. But... that means that we can't blindly "correlate" relevancy between "collections", in large part because the document scores are so heavily driven by df, which is distinctly based on the specific corpus of each collection.

My modest proposal: As valuable as df-based relevancy is, offer an easy to use "switch" to drop back to a pure tf-based relevancy score (primarily tf, but it can include other factors, but simply limited to the contents of the document itself) to sidestep these corpus-dependent scores. In other words, the score of the document could depend on only the contents of the document itself, not the corpus. Yes, that's a major loss of relevance, but the benefits for operations in a multi-corpus, distributed world can be substantial.

Yes, you can do this yurself by just plugging in your own custom "similarity" class, but it should be offered as a much easier to use "switch" for Lucene itself (and Solr too!)

The alternative is to have some mechanism to define and work with a "super-corpus" or "super-collection" that integrates the df for multiple corpuses, but... df is calculated or updated for the overall corpus, so a cross-corpus df would require recalculating df for all terms in the index whenever the multi-corpus structure changes, which can work in some cases, but not for things like distributed searches for Solr. That might be a superior solution, but might now be so easy or as performant as a simple non-df similarity approach.

It might also be nice for apps to offer users pure-tf scoring if it provides faster search results, and then the user could click on a "refine results" button to re-do the search with the more expensive cross-corpus df-based scoring.

Thoughts?

-- Jack Krupansky

-----Original Message----- From: Baldwin, David
Sent: Friday, September 5, 2014 8:05 PM
To: java-user@lucene.apache.org
Subject: How to properly correlate relevance in a search across multiple collections

I have a project where there are multiple collections - could be dozens at times that a single results set needs to be generated by applying the same search criteria to each collection directory and then correlating all the sub searches into a single result set with correlating relevance.

Does anyone have any good experience with this and could they share some tid-bits or info I may not have run across yet?

-David


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org

Reply via email to