Hi Markus and Wunder, I'm missing the original context, but I don't think BM25 will solve this particular problem.
The k1 parameter sets how quickly the contribution of tf to the score falls off with increasing tf. It would be helpful for making sure really long documents don't get too high a score, but I don't think it would help for very short documents without messing up its original design purpose. For BM25, if you want to turn off length normalization, you set "b" to 0. However, I don't think that will do what you want, since turning off normalization will mean that the score for "new york, new york" will be twice that of the score for "new york" since without normalization the tf in "new york new york" is twice that of "new york". I think the earlier suggestion to "override tfidfsimilarity and emit 1f in tf() is probably the best way to switch to eliminate using tf counts, assumming that is really what you want. Tom On Tue, Apr 1, 2014 at 4:17 PM, Walter Underwood <wun...@wunderwood.org>wrote: > Thanks! We'll try that out and report back. I keep forgetting that I want > to try BM25, so this is a good excuse. > > wunder > > On Apr 1, 2014, at 12:30 PM, Markus Jelsma <markus.jel...@openindex.io> > wrote: > > > Also, if i remember correctly, k1 set to zero for bm25 automatically > omits norms in the calculation. So thats easy to play with without > reindexing. > > > > > > Markus Jelsma <markus.jel...@openindex.io> schreef:Yes, override > tfidfsimilarity and emit 1f in tf(). You can also use bm25 with k1 set to > zero in your schema. > > > > > > Walter Underwood <wun...@wunderwood.org> schreef:And here is another > peculiarity of short text fields. > > > > The movie "New York, New York" should not be twice as relevant for the > query "new york". Is there a way to use a binary term frequency rather than > a count? > > > > wunder > > -- > > Walter Underwood > > wun...@wunderwood.org > > > > > > > > -- > Walter Underwood > wun...@wunderwood.org > > > >