My problem is that both maxDoc() and docCount() both report documents that
have been deleted in their values. Because of merging/etc.. those numbers
can be different per replica (or at least that is what I'm seeing). I need
a value that is consistent across replicas... I see in the comment it makes
mention of not using IndexReader.numDocs() but there doesn't seem to me a
way to get ahold of the IndexReader within a similarity implementation (as
only TermStats, CollectionStats are passed in, and neither contains of ref
to the reader)

I am contemplating just using a static value for the "number of docs" as
this won't change dramatically often..

steve


On Wed, Mar 12, 2014 at 11:18 AM, Markus Jelsma
<markus.jel...@openindex.io>wrote:

> Hi Steve - it seems most similarities use CollectionStatistics.maxDoc() in
> idfExplain but there's also a docCount(). We use docCount in all our custom
> similarities, also because it allows you to have multiple languages in one
> index where one is much larger than the other. The small language will have
> very high IDF scores using maxDoc but they are proportional enough using
> docCount(). Using docCount() also fixes SolrCloud ranking problems, unless
> one of your replica's becomes inconsistent ;)
>
>
> https://lucene.apache.org/core/4_7_0/core/org/apache/lucene/search/CollectionStatistics.html#docCount%28%29
>
>
>
> -----Original message-----
> > From:Steven Bower <smb-apa...@alcyon.net>
> > Sent: Wednesday 12th March 2014 16:08
> > To: solr-user <solr-user@lucene.apache.org>
> > Subject: IDF maxDocs / numDocs
> >
> > I am noticing the maxDocs between replicas is consistently different and
> > that in the idf calculation it is used which causes idf scores for the
> same
> > query/doc between replicas to be different. obviously an optimize can
> > normalize the maxDocs scores, but that is only temporary.. is there a way
> > to have idf use numDocs instead (as it should be consistent across
> > replicas)?
> >
> > thanks,
> >
> > steve
> >
>

Reply via email to