Re: How to properly correlate relevance in a search across multiple collections

Jack Krupansky Sat, 06 Sep 2014 11:11:33 -0700

An observation: df and IDF (document frequency) is a key driver of the wholerelevancy framework on which stock Lucene is based. There is no questionabout its significant value. But... that means that we can't blindly"correlate" relevancy between "collections", in large part because thedocument scores are so heavily driven by df, which is distinctly based onthe specific corpus of each collection.

My modest proposal: As valuable as df-based relevancy is, offer an easy touse "switch" to drop back to a pure tf-based relevancy score (primarily tf,but it can include other factors, but simply limited to the contents of thedocument itself) to sidestep these corpus-dependent scores. In other words,the score of the document could depend on only the contents of the documentitself, not the corpus. Yes, that's a major loss of relevance, but thebenefits for operations in a multi-corpus, distributed world can besubstantial.

Yes, you can do this yurself by just plugging in your own custom"similarity" class, but it should be offered as a much easier to use"switch" for Lucene itself (and Solr too!)

The alternative is to have some mechanism to define and work with a"super-corpus" or "super-collection" that integrates the df for multiplecorpuses, but... df is calculated or updated for the overall corpus, so across-corpus df would require recalculating df for all terms in the indexwhenever the multi-corpus structure changes, which can work in some cases,but not for things like distributed searches for Solr. That might be asuperior solution, but might now be so easy or as performant as a simplenon-df similarity approach.

It might also be nice for apps to offer users pure-tf scoring if it providesfaster search results, and then the user could click on a "refine results"button to re-do the search with the more expensive cross-corpus df-basedscoring.


Thoughts?

-- Jack Krupansky

-----Original Message-----From: Baldwin, David

Sent: Friday, September 5, 2014 8:05 PM
To: java-user@lucene.apache.org

Subject: How to properly correlate relevance in a search across multiplecollections

I have a project where there are multiple collections - could be dozens attimes that a single results set needs to be generated by applying the samesearch criteria to each collection directory and then correlating all thesub searches into a single result set with correlating relevance.

Does anyone have any good experience with this and could they share sometid-bits or info I may not have run across yet?


-David


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org

For additional commands, e-mail: java-user-h...@lucene.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org

Re: How to properly correlate relevance in a search across multiple collections

Reply via email to