My problem is that both maxDoc() and docCount() both report documents that have been deleted in their values. Because of merging/etc.. those numbers can be different per replica (or at least that is what I'm seeing). I need a value that is consistent across replicas... I see in the comment it makes mention of not using IndexReader.numDocs() but there doesn't seem to me a way to get ahold of the IndexReader within a similarity implementation (as only TermStats, CollectionStats are passed in, and neither contains of ref to the reader)
I am contemplating just using a static value for the "number of docs" as this won't change dramatically often.. steve On Wed, Mar 12, 2014 at 11:18 AM, Markus Jelsma <markus.jel...@openindex.io>wrote: > Hi Steve - it seems most similarities use CollectionStatistics.maxDoc() in > idfExplain but there's also a docCount(). We use docCount in all our custom > similarities, also because it allows you to have multiple languages in one > index where one is much larger than the other. The small language will have > very high IDF scores using maxDoc but they are proportional enough using > docCount(). Using docCount() also fixes SolrCloud ranking problems, unless > one of your replica's becomes inconsistent ;) > > > https://lucene.apache.org/core/4_7_0/core/org/apache/lucene/search/CollectionStatistics.html#docCount%28%29 > > > > -----Original message----- > > From:Steven Bower <smb-apa...@alcyon.net> > > Sent: Wednesday 12th March 2014 16:08 > > To: solr-user <solr-user@lucene.apache.org> > > Subject: IDF maxDocs / numDocs > > > > I am noticing the maxDocs between replicas is consistently different and > > that in the idf calculation it is used which causes idf scores for the > same > > query/doc between replicas to be different. obviously an optimize can > > normalize the maxDocs scores, but that is only temporary.. is there a way > > to have idf use numDocs instead (as it should be consistent across > > replicas)? > > > > thanks, > > > > steve > > >