Re: Relevancy debugging - idf score

Alessandro Benedetti Tue, 07 Dec 2021 01:52:23 -0800

Hi Markus,
won't the problem be still present across shards without distributed IDF?
You may have skewed shards and then each of them will have a different IDF
for the same term (and field).
In relation to the performance penalty, Walter highlighted, I definitely
see some space for contribution, but I am not sure anyone is looking into
that right now.


Cheers
--------------------------
Alessandro Benedetti
Apache Lucene/Solr Committer
Director, R&D Software Engineer, Search Consultant

www.sease.io


On Mon, 6 Dec 2021 at 12:07, Markus Jelsma <markus.jel...@openindex.io>
wrote:

> Hello Sjoerd,
>
> ExactStatsCache indeed works fine when replicas of the same shard do not
> share identical term stats, but it comes with some overhead. If you can,
> upgrade to at least 7.x and change the default NRT replica types to TLOG.
> You then no longer need to use ExactStatsCache because replicas will be
> identical.
>
> Regards,
> Markus
>
> Op ma 6 dec. 2021 om 12:09 schreef Alessandro Benedetti <
> a.benede...@sease.io>:
>
> > Good to know you solved it!
> > Yes, Distributed IDF is definitely a problem in case you have skewed
> > documents distributions.
> >
> > Cheers
> > --------------------------
> > Alessandro Benedetti
> > Apache Lucene/Solr Committer
> > Director, R&D Software Engineer, Search Consultant
> >
> > www.sease.io
> >
> >
> > On Sun, 5 Dec 2021 at 17:19, Sjoerd Smeets <ssme...@gmail.com> wrote:
> >
> > > Found it!
> > >
> > > I had to enable the
> > > ExactStatsCache
> > >
> > > Found a description over here. Thanks for pointing me in the right
> > > direction.
> > >
> > > https://solr.pl/en/2019/05/20/distributed-idf/
> > >
> > >
> > > On Sun, Dec 5, 2021 at 11:09 AM Sjoerd Smeets <ssme...@gmail.com>
> wrote:
> > >
> > >> Hi Allessandro,
> > >>
> > >> Thanks for your reply! Yes, the document are in the same result list
> and
> > >> I'm not doing any indexing at the moment and executed a commit just to
> > be
> > >> sure. Still the same result. It is an environment with 4 shards.
> Perhaps
> > >> that plays a factor?
> > >>
> > >> Thanks,
> > >> Sjoerd
> > >>
> > >> On Sun, Dec 5, 2021 at 11:02 AM Alessandro Benedetti <
> > >> a.benede...@sease.io> wrote:
> > >>
> > >>> It's seems like the underline index changed.
> > >>> Are those two documents in the same result set?
> > >>> Is it just one query?
> > >>> It's definitely curious, even if a commit happened search results are
> > >>> consistent in one searcher.
> > >>>
> > >>>
> > >>> On Sun, 5 Dec 2021, 16:28 Sjoerd Smeets, <ssme...@gmail.com> wrote:
> > >>>
> > >>>> Hi all,
> > >>>>
> > >>>> I'm debugging the relevancy scores of my query and I see the
> following
> > >>>> for
> > >>>> two documents hits. My question is, why is the idf score not the
> same
> > >>>> for
> > >>>> both documents? This is Solr 6.6.
> > >>>>
> > >>>> Any guidance would be much appreciated.
> > >>>>
> > >>>> Thanks!
> > >>>>
> > >>>> *Doc1*
> > >>>> "71d72354eea23b9eae934ab616e8ce38de69d760": "
> > >>>> 104.994415 = sum of:
> > >>>>   104.994415 = sum of:
> > >>>>     82.89969 = weight(stemmed_data.timenote.narratives:remedi in
> > 22470)
> > >>>> [SchemaSimilarity], result of:
> > >>>>       82.89969 = score(freq=9.0), computed as boost * idf * tf from:
> > >>>>         100.0 = boost
> > >>>>         0.87546873 = idf, computed as log(1 + (N - n + 0.5) / (n +
> > 0.5))
> > >>>> from:
> > >>>>           *52 = n, number of documents containing term*
> > >>>>           *125 = N, total number of documents with field*
> > >>>>         0.9469177 = tf, computed as freq / (freq + k1 * (1 - b + b *
> > dl
> > >>>> /
> > >>>> avgdl)) from:
> > >>>>           9.0 = freq, occurrences of term within document
> > >>>>           1.2 = k1, term saturation parameter
> > >>>>           0.75 = b, length normalization parameter
> > >>>>           12312.0 = dl, length of field (approximate)
> > >>>>           54179.03 = avgdl, average length of field
> > >>>>     22.09473 = weight(stemmed_data.timenote.matters:remedi in 22470)
> > >>>> [SchemaSimilarity], result of:
> > >>>>       22.09473 = score(freq=4.0), computed as boost * idf * tf from:
> > >>>>         10.0 = boost
> > >>>>         2.4308395 = idf, computed as log(1 + (N - n + 0.5) / (n +
> > 0.5))
> > >>>> from:
> > >>>>           *9 = n, number of documents containing term*
> > >>>>           *107 = N, total number of documents with field*
> > >>>>         0.9089341 = tf, computed as freq / (freq + k1 * (1 - b + b *
> > dl
> > >>>> /
> > >>>> avgdl)) from:
> > >>>>           4.0 = freq, occurrences of term within document
> > >>>>           1.2 = k1, term saturation parameter
> > >>>>           0.75 = b, length normalization parameter
> > >>>>           5656.0 = dl, length of field (approximate)
> > >>>>           50520.543 = avgdl, average length of field
> > >>>>   0.0 = FunctionQuery(int(s_integer_search.previews)), product of:
> > >>>>     0.0 = int(s_integer_search.previews)=0
> > >>>>     1.0 = boost
> > >>>>   0.0 = FunctionQuery(int(s_integer_search.downloads)), product of:
> > >>>>     0.0 = int(s_integer_search.downloads)=0
> > >>>>     1.0 = boost
> > >>>> "
> > >>>>
> > >>>> *Doc2*
> > >>>> "80302a1ecc44d1e556970ab96c25b1fd3328a854": "
> > >>>> 84.61461 = sum of:
> > >>>>   84.61461 = sum of:
> > >>>>     64.68881 = weight(stemmed_data.timenote.narratives:remedi in 0)
> > >>>> [SchemaSimilarity], result of:
> > >>>>       64.68881 = score(freq=493.0), computed as boost * idf * tf
> from:
> > >>>>         100.0 = boost
> > >>>>         0.65094686 = idf, computed as log(1 + (N - n + 0.5) / (n +
> > 0.5))
> > >>>> from:
> > >>>>           *60 = n, number of documents containing term*
> > >>>>           *115 = N, total number of documents with field*
> > >>>>         0.99376476 = tf, computed as freq / (freq + k1 * (1 - b + b
> *
> > >>>> dl /
> > >>>> avgdl)) from:
> > >>>>           493.0 = freq, occurrences of term within document
> > >>>>           1.2 = k1, term saturation parameter
> > >>>>           0.75 = b, length normalization parameter
> > >>>>           229400.0 = dl, length of field (approximate)
> > >>>>           73913.91 = avgdl, average length of field
> > >>>>     19.9258 = weight(stemmed_data.timenote.matters:remedi in 0)
> > >>>> [SchemaSimilarity], result of:
> > >>>>       19.9258 = score(freq=340.0), computed as boost * idf * tf
> from:
> > >>>>         10.0 = boost
> > >>>>         2.0024805 = idf, computed as log(1 + (N - n + 0.5) / (n +
> > 0.5))
> > >>>> from:
> > >>>>           *13 = n, number of documents containing term*
> > >>>>           *99 = N, total number of documents with field*
> > >>>>         0.99505585 = tf, computed as freq / (freq + k1 * (1 - b + b
> *
> > >>>> dl /
> > >>>> avgdl)) from:
> > >>>>           340.0 = freq, occurrences of term within document
> > >>>>           1.2 = k1, term saturation parameter
> > >>>>           0.75 = b, length normalization parameter
> > >>>>           147480.0 = dl, length of field (approximate)
> > >>>>           95534.95 = avgdl, average length of field
> > >>>>   0.0 = FunctionQuery(int(s_integer_search.previews)), product of:
> > >>>>     0.0 = int(s_integer_search.previews)=0
> > >>>>     1.0 = boost
> > >>>>   0.0 = FunctionQuery(int(s_integer_search.downloads)), product of:
> > >>>>     0.0 = int(s_integer_search.downloads)=0
> > >>>>     1.0 = boost
> > >>>> "
> > >>>>
> > >>>
> >
>

Re: Relevancy debugging - idf score

Reply via email to