An observation: df and IDF (document frequency) is a key driver of the whole
relevancy framework on which stock Lucene is based. There is no question
about its significant value. But... that means that we can't blindly
"correlate" relevancy between "collections", in large part because the
document scores are so heavily driven by df, which is distinctly based on
the specific corpus of each collection.
My modest proposal: As valuable as df-based relevancy is, offer an easy to
use "switch" to drop back to a pure tf-based relevancy score (primarily tf,
but it can include other factors, but simply limited to the contents of the
document itself) to sidestep these corpus-dependent scores. In other words,
the score of the document could depend on only the contents of the document
itself, not the corpus. Yes, that's a major loss of relevance, but the
benefits for operations in a multi-corpus, distributed world can be
substantial.
Yes, you can do this yurself by just plugging in your own custom
"similarity" class, but it should be offered as a much easier to use
"switch" for Lucene itself (and Solr too!)
The alternative is to have some mechanism to define and work with a
"super-corpus" or "super-collection" that integrates the df for multiple
corpuses, but... df is calculated or updated for the overall corpus, so a
cross-corpus df would require recalculating df for all terms in the index
whenever the multi-corpus structure changes, which can work in some cases,
but not for things like distributed searches for Solr. That might be a
superior solution, but might now be so easy or as performant as a simple
non-df similarity approach.
It might also be nice for apps to offer users pure-tf scoring if it provides
faster search results, and then the user could click on a "refine results"
button to re-do the search with the more expensive cross-corpus df-based
scoring.
Thoughts?
-- Jack Krupansky
-----Original Message-----
From: Baldwin, David
Sent: Friday, September 5, 2014 8:05 PM
To: java-user@lucene.apache.org
Subject: How to properly correlate relevance in a search across multiple
collections
I have a project where there are multiple collections - could be dozens at
times that a single results set needs to be generated by applying the same
search criteria to each collection directory and then correlating all the
sub searches into a single result set with correlating relevance.
Does anyone have any good experience with this and could they share some
tid-bits or info I may not have run across yet?
-David
---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org