I've been searching a little the Web & Lucene threads about a BM25 similarity (or any "better" one than tf/idf) but I can't tell whether/when it has been/will be implemented (lucene-965).
One of the difficulties for BM25 itself seems to lie in its in need for the average document length (length as a number of tokens) as a parameter which is not one of the statistics readily available (idf, tf, numDocs are). It seems possible within Solr to keep some statistics per field fairly easily; if we were to keep up to date the sum of number of tokens per field (hooking add/delete/... operations), since we know the number of documents, it would make this average easy to compute and so would BM25. Since Solr already uses its "own" files (schema/config/multicore) and that a SolrCore is always involved in all operations, book-keeping/serializing those statistics should not be an issue (at least a smaller issue than changing Lucene index serialization or generalizing payloads imho...). On the Similarity itself, my current understanding is that the only methods that would need to change from the DefaultSimilarity are idf() and tf(); can anyone confirm this ? I noticed that there is already a SolrSimilarity class that is not used but could certainly be harnessed to carry the statistics around and/or allow per-field similarity instantiation (adding a "similarity" in the Field configuration) . Does it make (any) sense to try implementing this within Solr or should I just forget about this ? As a more general note, does it make sense to try to use Solr as a "research" playground for similarities instead of Lucene? Or is this the "wrong" level (aka Lucene being a better one)? -- View this message in context: http://www.nabble.com/BM25---field-configurable-similarity---tp15901233p15901233.html Sent from the Solr - Dev mailing list archive at Nabble.com.
