I've been searching a little the Web & Lucene threads about a BM25 similarity
(or any "better" one than tf/idf)  but I can't tell whether/when it has
been/will be implemented (lucene-965).

One of the difficulties for BM25 itself seems to lie in its in need for the
average document length (length as a number of tokens) as a parameter which
is not one of the statistics readily available (idf, tf, numDocs are). It
seems possible within Solr to keep some statistics per field fairly easily;
if we were to keep up to date the sum of number of tokens per field (hooking
add/delete/... operations), since we know the number of documents, it would
make this average easy to compute and so would BM25. 
Since Solr already uses its "own" files (schema/config/multicore) and that a
SolrCore is always involved in all operations, book-keeping/serializing
those statistics should not be an issue (at least a smaller issue than
changing Lucene index serialization or generalizing payloads imho...).

On the Similarity itself, my current understanding is that the only methods
that would need to change from the DefaultSimilarity are idf() and tf(); can
anyone confirm this ? I noticed that there is already a SolrSimilarity class
that is not used but could certainly be harnessed to carry the statistics
around and/or allow per-field similarity instantiation (adding a
"similarity" in the Field configuration) .

Does it make (any) sense to try implementing this within Solr or should I
just forget about this ?
As a more general note, does it make sense to try to use Solr as a
"research" playground for similarities instead of Lucene? Or is this the
"wrong" level (aka Lucene being a better one)?


-- 
View this message in context: 
http://www.nabble.com/BM25---field-configurable-similarity---tp15901233p15901233.html
Sent from the Solr - Dev mailing list archive at Nabble.com.

Reply via email to