Hi Markus, won't the problem be still present across shards without distributed IDF? You may have skewed shards and then each of them will have a different IDF for the same term (and field). In relation to the performance penalty, Walter highlighted, I definitely see some space for contribution, but I am not sure anyone is looking into that right now.
Cheers -------------------------- Alessandro Benedetti Apache Lucene/Solr Committer Director, R&D Software Engineer, Search Consultant www.sease.io On Mon, 6 Dec 2021 at 12:07, Markus Jelsma <markus.jel...@openindex.io> wrote: > Hello Sjoerd, > > ExactStatsCache indeed works fine when replicas of the same shard do not > share identical term stats, but it comes with some overhead. If you can, > upgrade to at least 7.x and change the default NRT replica types to TLOG. > You then no longer need to use ExactStatsCache because replicas will be > identical. > > Regards, > Markus > > Op ma 6 dec. 2021 om 12:09 schreef Alessandro Benedetti < > a.benede...@sease.io>: > > > Good to know you solved it! > > Yes, Distributed IDF is definitely a problem in case you have skewed > > documents distributions. > > > > Cheers > > -------------------------- > > Alessandro Benedetti > > Apache Lucene/Solr Committer > > Director, R&D Software Engineer, Search Consultant > > > > www.sease.io > > > > > > On Sun, 5 Dec 2021 at 17:19, Sjoerd Smeets <ssme...@gmail.com> wrote: > > > > > Found it! > > > > > > I had to enable the > > > ExactStatsCache > > > > > > Found a description over here. Thanks for pointing me in the right > > > direction. > > > > > > https://solr.pl/en/2019/05/20/distributed-idf/ > > > > > > > > > On Sun, Dec 5, 2021 at 11:09 AM Sjoerd Smeets <ssme...@gmail.com> > wrote: > > > > > >> Hi Allessandro, > > >> > > >> Thanks for your reply! Yes, the document are in the same result list > and > > >> I'm not doing any indexing at the moment and executed a commit just to > > be > > >> sure. Still the same result. It is an environment with 4 shards. > Perhaps > > >> that plays a factor? > > >> > > >> Thanks, > > >> Sjoerd > > >> > > >> On Sun, Dec 5, 2021 at 11:02 AM Alessandro Benedetti < > > >> a.benede...@sease.io> wrote: > > >> > > >>> It's seems like the underline index changed. > > >>> Are those two documents in the same result set? > > >>> Is it just one query? > > >>> It's definitely curious, even if a commit happened search results are > > >>> consistent in one searcher. > > >>> > > >>> > > >>> On Sun, 5 Dec 2021, 16:28 Sjoerd Smeets, <ssme...@gmail.com> wrote: > > >>> > > >>>> Hi all, > > >>>> > > >>>> I'm debugging the relevancy scores of my query and I see the > following > > >>>> for > > >>>> two documents hits. My question is, why is the idf score not the > same > > >>>> for > > >>>> both documents? This is Solr 6.6. > > >>>> > > >>>> Any guidance would be much appreciated. > > >>>> > > >>>> Thanks! > > >>>> > > >>>> *Doc1* > > >>>> "71d72354eea23b9eae934ab616e8ce38de69d760": " > > >>>> 104.994415 = sum of: > > >>>> 104.994415 = sum of: > > >>>> 82.89969 = weight(stemmed_data.timenote.narratives:remedi in > > 22470) > > >>>> [SchemaSimilarity], result of: > > >>>> 82.89969 = score(freq=9.0), computed as boost * idf * tf from: > > >>>> 100.0 = boost > > >>>> 0.87546873 = idf, computed as log(1 + (N - n + 0.5) / (n + > > 0.5)) > > >>>> from: > > >>>> *52 = n, number of documents containing term* > > >>>> *125 = N, total number of documents with field* > > >>>> 0.9469177 = tf, computed as freq / (freq + k1 * (1 - b + b * > > dl > > >>>> / > > >>>> avgdl)) from: > > >>>> 9.0 = freq, occurrences of term within document > > >>>> 1.2 = k1, term saturation parameter > > >>>> 0.75 = b, length normalization parameter > > >>>> 12312.0 = dl, length of field (approximate) > > >>>> 54179.03 = avgdl, average length of field > > >>>> 22.09473 = weight(stemmed_data.timenote.matters:remedi in 22470) > > >>>> [SchemaSimilarity], result of: > > >>>> 22.09473 = score(freq=4.0), computed as boost * idf * tf from: > > >>>> 10.0 = boost > > >>>> 2.4308395 = idf, computed as log(1 + (N - n + 0.5) / (n + > > 0.5)) > > >>>> from: > > >>>> *9 = n, number of documents containing term* > > >>>> *107 = N, total number of documents with field* > > >>>> 0.9089341 = tf, computed as freq / (freq + k1 * (1 - b + b * > > dl > > >>>> / > > >>>> avgdl)) from: > > >>>> 4.0 = freq, occurrences of term within document > > >>>> 1.2 = k1, term saturation parameter > > >>>> 0.75 = b, length normalization parameter > > >>>> 5656.0 = dl, length of field (approximate) > > >>>> 50520.543 = avgdl, average length of field > > >>>> 0.0 = FunctionQuery(int(s_integer_search.previews)), product of: > > >>>> 0.0 = int(s_integer_search.previews)=0 > > >>>> 1.0 = boost > > >>>> 0.0 = FunctionQuery(int(s_integer_search.downloads)), product of: > > >>>> 0.0 = int(s_integer_search.downloads)=0 > > >>>> 1.0 = boost > > >>>> " > > >>>> > > >>>> *Doc2* > > >>>> "80302a1ecc44d1e556970ab96c25b1fd3328a854": " > > >>>> 84.61461 = sum of: > > >>>> 84.61461 = sum of: > > >>>> 64.68881 = weight(stemmed_data.timenote.narratives:remedi in 0) > > >>>> [SchemaSimilarity], result of: > > >>>> 64.68881 = score(freq=493.0), computed as boost * idf * tf > from: > > >>>> 100.0 = boost > > >>>> 0.65094686 = idf, computed as log(1 + (N - n + 0.5) / (n + > > 0.5)) > > >>>> from: > > >>>> *60 = n, number of documents containing term* > > >>>> *115 = N, total number of documents with field* > > >>>> 0.99376476 = tf, computed as freq / (freq + k1 * (1 - b + b > * > > >>>> dl / > > >>>> avgdl)) from: > > >>>> 493.0 = freq, occurrences of term within document > > >>>> 1.2 = k1, term saturation parameter > > >>>> 0.75 = b, length normalization parameter > > >>>> 229400.0 = dl, length of field (approximate) > > >>>> 73913.91 = avgdl, average length of field > > >>>> 19.9258 = weight(stemmed_data.timenote.matters:remedi in 0) > > >>>> [SchemaSimilarity], result of: > > >>>> 19.9258 = score(freq=340.0), computed as boost * idf * tf > from: > > >>>> 10.0 = boost > > >>>> 2.0024805 = idf, computed as log(1 + (N - n + 0.5) / (n + > > 0.5)) > > >>>> from: > > >>>> *13 = n, number of documents containing term* > > >>>> *99 = N, total number of documents with field* > > >>>> 0.99505585 = tf, computed as freq / (freq + k1 * (1 - b + b > * > > >>>> dl / > > >>>> avgdl)) from: > > >>>> 340.0 = freq, occurrences of term within document > > >>>> 1.2 = k1, term saturation parameter > > >>>> 0.75 = b, length normalization parameter > > >>>> 147480.0 = dl, length of field (approximate) > > >>>> 95534.95 = avgdl, average length of field > > >>>> 0.0 = FunctionQuery(int(s_integer_search.previews)), product of: > > >>>> 0.0 = int(s_integer_search.previews)=0 > > >>>> 1.0 = boost > > >>>> 0.0 = FunctionQuery(int(s_integer_search.downloads)), product of: > > >>>> 0.0 = int(s_integer_search.downloads)=0 > > >>>> 1.0 = boost > > >>>> " > > >>>> > > >>> > > >