Re: Relevancy debugging - idf score

Markus Jelsma Mon, 06 Dec 2021 04:07:12 -0800

Hello Sjoerd,

ExactStatsCache indeed works fine when replicas of the same shard do not
share identical term stats, but it comes with some overhead. If you can,
upgrade to at least 7.x and change the default NRT replica types to TLOG.
You then no longer need to use ExactStatsCache because replicas will be
identical.


Regards,
Markus

Op ma 6 dec. 2021 om 12:09 schreef Alessandro Benedetti <
a.benede...@sease.io>:

> Good to know you solved it!
> Yes, Distributed IDF is definitely a problem in case you have skewed
> documents distributions.
>
> Cheers
> --------------------------
> Alessandro Benedetti
> Apache Lucene/Solr Committer
> Director, R&D Software Engineer, Search Consultant
>
> www.sease.io
>
>
> On Sun, 5 Dec 2021 at 17:19, Sjoerd Smeets <ssme...@gmail.com> wrote:
>
> > Found it!
> >
> > I had to enable the
> > ExactStatsCache
> >
> > Found a description over here. Thanks for pointing me in the right
> > direction.
> >
> > https://solr.pl/en/2019/05/20/distributed-idf/
> >
> >
> > On Sun, Dec 5, 2021 at 11:09 AM Sjoerd Smeets <ssme...@gmail.com> wrote:
> >
> >> Hi Allessandro,
> >>
> >> Thanks for your reply! Yes, the document are in the same result list and
> >> I'm not doing any indexing at the moment and executed a commit just to
> be
> >> sure. Still the same result. It is an environment with 4 shards. Perhaps
> >> that plays a factor?
> >>
> >> Thanks,
> >> Sjoerd
> >>
> >> On Sun, Dec 5, 2021 at 11:02 AM Alessandro Benedetti <
> >> a.benede...@sease.io> wrote:
> >>
> >>> It's seems like the underline index changed.
> >>> Are those two documents in the same result set?
> >>> Is it just one query?
> >>> It's definitely curious, even if a commit happened search results are
> >>> consistent in one searcher.
> >>>
> >>>
> >>> On Sun, 5 Dec 2021, 16:28 Sjoerd Smeets, <ssme...@gmail.com> wrote:
> >>>
> >>>> Hi all,
> >>>>
> >>>> I'm debugging the relevancy scores of my query and I see the following
> >>>> for
> >>>> two documents hits. My question is, why is the idf score not the same
> >>>> for
> >>>> both documents? This is Solr 6.6.
> >>>>
> >>>> Any guidance would be much appreciated.
> >>>>
> >>>> Thanks!
> >>>>
> >>>> *Doc1*
> >>>> "71d72354eea23b9eae934ab616e8ce38de69d760": "
> >>>> 104.994415 = sum of:
> >>>>   104.994415 = sum of:
> >>>>     82.89969 = weight(stemmed_data.timenote.narratives:remedi in
> 22470)
> >>>> [SchemaSimilarity], result of:
> >>>>       82.89969 = score(freq=9.0), computed as boost * idf * tf from:
> >>>>         100.0 = boost
> >>>>         0.87546873 = idf, computed as log(1 + (N - n + 0.5) / (n +
> 0.5))
> >>>> from:
> >>>>           *52 = n, number of documents containing term*
> >>>>           *125 = N, total number of documents with field*
> >>>>         0.9469177 = tf, computed as freq / (freq + k1 * (1 - b + b *
> dl
> >>>> /
> >>>> avgdl)) from:
> >>>>           9.0 = freq, occurrences of term within document
> >>>>           1.2 = k1, term saturation parameter
> >>>>           0.75 = b, length normalization parameter
> >>>>           12312.0 = dl, length of field (approximate)
> >>>>           54179.03 = avgdl, average length of field
> >>>>     22.09473 = weight(stemmed_data.timenote.matters:remedi in 22470)
> >>>> [SchemaSimilarity], result of:
> >>>>       22.09473 = score(freq=4.0), computed as boost * idf * tf from:
> >>>>         10.0 = boost
> >>>>         2.4308395 = idf, computed as log(1 + (N - n + 0.5) / (n +
> 0.5))
> >>>> from:
> >>>>           *9 = n, number of documents containing term*
> >>>>           *107 = N, total number of documents with field*
> >>>>         0.9089341 = tf, computed as freq / (freq + k1 * (1 - b + b *
> dl
> >>>> /
> >>>> avgdl)) from:
> >>>>           4.0 = freq, occurrences of term within document
> >>>>           1.2 = k1, term saturation parameter
> >>>>           0.75 = b, length normalization parameter
> >>>>           5656.0 = dl, length of field (approximate)
> >>>>           50520.543 = avgdl, average length of field
> >>>>   0.0 = FunctionQuery(int(s_integer_search.previews)), product of:
> >>>>     0.0 = int(s_integer_search.previews)=0
> >>>>     1.0 = boost
> >>>>   0.0 = FunctionQuery(int(s_integer_search.downloads)), product of:
> >>>>     0.0 = int(s_integer_search.downloads)=0
> >>>>     1.0 = boost
> >>>> "
> >>>>
> >>>> *Doc2*
> >>>> "80302a1ecc44d1e556970ab96c25b1fd3328a854": "
> >>>> 84.61461 = sum of:
> >>>>   84.61461 = sum of:
> >>>>     64.68881 = weight(stemmed_data.timenote.narratives:remedi in 0)
> >>>> [SchemaSimilarity], result of:
> >>>>       64.68881 = score(freq=493.0), computed as boost * idf * tf from:
> >>>>         100.0 = boost
> >>>>         0.65094686 = idf, computed as log(1 + (N - n + 0.5) / (n +
> 0.5))
> >>>> from:
> >>>>           *60 = n, number of documents containing term*
> >>>>           *115 = N, total number of documents with field*
> >>>>         0.99376476 = tf, computed as freq / (freq + k1 * (1 - b + b *
> >>>> dl /
> >>>> avgdl)) from:
> >>>>           493.0 = freq, occurrences of term within document
> >>>>           1.2 = k1, term saturation parameter
> >>>>           0.75 = b, length normalization parameter
> >>>>           229400.0 = dl, length of field (approximate)
> >>>>           73913.91 = avgdl, average length of field
> >>>>     19.9258 = weight(stemmed_data.timenote.matters:remedi in 0)
> >>>> [SchemaSimilarity], result of:
> >>>>       19.9258 = score(freq=340.0), computed as boost * idf * tf from:
> >>>>         10.0 = boost
> >>>>         2.0024805 = idf, computed as log(1 + (N - n + 0.5) / (n +
> 0.5))
> >>>> from:
> >>>>           *13 = n, number of documents containing term*
> >>>>           *99 = N, total number of documents with field*
> >>>>         0.99505585 = tf, computed as freq / (freq + k1 * (1 - b + b *
> >>>> dl /
> >>>> avgdl)) from:
> >>>>           340.0 = freq, occurrences of term within document
> >>>>           1.2 = k1, term saturation parameter
> >>>>           0.75 = b, length normalization parameter
> >>>>           147480.0 = dl, length of field (approximate)
> >>>>           95534.95 = avgdl, average length of field
> >>>>   0.0 = FunctionQuery(int(s_integer_search.previews)), product of:
> >>>>     0.0 = int(s_integer_search.previews)=0
> >>>>     1.0 = boost
> >>>>   0.0 = FunctionQuery(int(s_integer_search.downloads)), product of:
> >>>>     0.0 = int(s_integer_search.downloads)=0
> >>>>     1.0 = boost
> >>>> "
> >>>>
> >>>
>

Re: Relevancy debugging - idf score

Reply via email to