Re: Relevancy debugging - idf score

Alessandro Benedetti Mon, 06 Dec 2021 03:09:18 -0800

Good to know you solved it!
Yes, Distributed IDF is definitely a problem in case you have skewed
documents distributions.


Cheers
--------------------------
Alessandro Benedetti
Apache Lucene/Solr Committer
Director, R&D Software Engineer, Search Consultant

www.sease.io


On Sun, 5 Dec 2021 at 17:19, Sjoerd Smeets <ssme...@gmail.com> wrote:

> Found it!
>
> I had to enable the
> ExactStatsCache
>
> Found a description over here. Thanks for pointing me in the right
> direction.
>
> https://solr.pl/en/2019/05/20/distributed-idf/
>
>
> On Sun, Dec 5, 2021 at 11:09 AM Sjoerd Smeets <ssme...@gmail.com> wrote:
>
>> Hi Allessandro,
>>
>> Thanks for your reply! Yes, the document are in the same result list and
>> I'm not doing any indexing at the moment and executed a commit just to be
>> sure. Still the same result. It is an environment with 4 shards. Perhaps
>> that plays a factor?
>>
>> Thanks,
>> Sjoerd
>>
>> On Sun, Dec 5, 2021 at 11:02 AM Alessandro Benedetti <
>> a.benede...@sease.io> wrote:
>>
>>> It's seems like the underline index changed.
>>> Are those two documents in the same result set?
>>> Is it just one query?
>>> It's definitely curious, even if a commit happened search results are
>>> consistent in one searcher.
>>>
>>>
>>> On Sun, 5 Dec 2021, 16:28 Sjoerd Smeets, <ssme...@gmail.com> wrote:
>>>
>>>> Hi all,
>>>>
>>>> I'm debugging the relevancy scores of my query and I see the following
>>>> for
>>>> two documents hits. My question is, why is the idf score not the same
>>>> for
>>>> both documents? This is Solr 6.6.
>>>>
>>>> Any guidance would be much appreciated.
>>>>
>>>> Thanks!
>>>>
>>>> *Doc1*
>>>> "71d72354eea23b9eae934ab616e8ce38de69d760": "
>>>> 104.994415 = sum of:
>>>>   104.994415 = sum of:
>>>>     82.89969 = weight(stemmed_data.timenote.narratives:remedi in 22470)
>>>> [SchemaSimilarity], result of:
>>>>       82.89969 = score(freq=9.0), computed as boost * idf * tf from:
>>>>         100.0 = boost
>>>>         0.87546873 = idf, computed as log(1 + (N - n + 0.5) / (n + 0.5))
>>>> from:
>>>>           *52 = n, number of documents containing term*
>>>>           *125 = N, total number of documents with field*
>>>>         0.9469177 = tf, computed as freq / (freq + k1 * (1 - b + b * dl
>>>> /
>>>> avgdl)) from:
>>>>           9.0 = freq, occurrences of term within document
>>>>           1.2 = k1, term saturation parameter
>>>>           0.75 = b, length normalization parameter
>>>>           12312.0 = dl, length of field (approximate)
>>>>           54179.03 = avgdl, average length of field
>>>>     22.09473 = weight(stemmed_data.timenote.matters:remedi in 22470)
>>>> [SchemaSimilarity], result of:
>>>>       22.09473 = score(freq=4.0), computed as boost * idf * tf from:
>>>>         10.0 = boost
>>>>         2.4308395 = idf, computed as log(1 + (N - n + 0.5) / (n + 0.5))
>>>> from:
>>>>           *9 = n, number of documents containing term*
>>>>           *107 = N, total number of documents with field*
>>>>         0.9089341 = tf, computed as freq / (freq + k1 * (1 - b + b * dl
>>>> /
>>>> avgdl)) from:
>>>>           4.0 = freq, occurrences of term within document
>>>>           1.2 = k1, term saturation parameter
>>>>           0.75 = b, length normalization parameter
>>>>           5656.0 = dl, length of field (approximate)
>>>>           50520.543 = avgdl, average length of field
>>>>   0.0 = FunctionQuery(int(s_integer_search.previews)), product of:
>>>>     0.0 = int(s_integer_search.previews)=0
>>>>     1.0 = boost
>>>>   0.0 = FunctionQuery(int(s_integer_search.downloads)), product of:
>>>>     0.0 = int(s_integer_search.downloads)=0
>>>>     1.0 = boost
>>>> "
>>>>
>>>> *Doc2*
>>>> "80302a1ecc44d1e556970ab96c25b1fd3328a854": "
>>>> 84.61461 = sum of:
>>>>   84.61461 = sum of:
>>>>     64.68881 = weight(stemmed_data.timenote.narratives:remedi in 0)
>>>> [SchemaSimilarity], result of:
>>>>       64.68881 = score(freq=493.0), computed as boost * idf * tf from:
>>>>         100.0 = boost
>>>>         0.65094686 = idf, computed as log(1 + (N - n + 0.5) / (n + 0.5))
>>>> from:
>>>>           *60 = n, number of documents containing term*
>>>>           *115 = N, total number of documents with field*
>>>>         0.99376476 = tf, computed as freq / (freq + k1 * (1 - b + b *
>>>> dl /
>>>> avgdl)) from:
>>>>           493.0 = freq, occurrences of term within document
>>>>           1.2 = k1, term saturation parameter
>>>>           0.75 = b, length normalization parameter
>>>>           229400.0 = dl, length of field (approximate)
>>>>           73913.91 = avgdl, average length of field
>>>>     19.9258 = weight(stemmed_data.timenote.matters:remedi in 0)
>>>> [SchemaSimilarity], result of:
>>>>       19.9258 = score(freq=340.0), computed as boost * idf * tf from:
>>>>         10.0 = boost
>>>>         2.0024805 = idf, computed as log(1 + (N - n + 0.5) / (n + 0.5))
>>>> from:
>>>>           *13 = n, number of documents containing term*
>>>>           *99 = N, total number of documents with field*
>>>>         0.99505585 = tf, computed as freq / (freq + k1 * (1 - b + b *
>>>> dl /
>>>> avgdl)) from:
>>>>           340.0 = freq, occurrences of term within document
>>>>           1.2 = k1, term saturation parameter
>>>>           0.75 = b, length normalization parameter
>>>>           147480.0 = dl, length of field (approximate)
>>>>           95534.95 = avgdl, average length of field
>>>>   0.0 = FunctionQuery(int(s_integer_search.previews)), product of:
>>>>     0.0 = int(s_integer_search.previews)=0
>>>>     1.0 = boost
>>>>   0.0 = FunctionQuery(int(s_integer_search.downloads)), product of:
>>>>     0.0 = int(s_integer_search.downloads)=0
>>>>     1.0 = boost
>>>> "
>>>>
>>>

Re: Relevancy debugging - idf score

Reply via email to