[jira] [Commented] (LUCENE-8563) Remove k1+1 from the numerator of BM25Similarity
[ https://issues.apache.org/jira/browse/LUCENE-8563?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16704452#comment-16704452 ] Adrien Grand commented on LUCENE-8563: -- I created a Solr blocker issue as Jan suggested: SOLR-13025. > Remove k1+1 from the numerator of BM25Similarity > - > > Key: LUCENE-8563 > URL: https://issues.apache.org/jira/browse/LUCENE-8563 > Project: Lucene - Core > Issue Type: Improvement >Reporter: Adrien Grand >Priority: Minor > Time Spent: 40m > Remaining Estimate: 0h > > Our current implementation of BM25 does > {code:java} > boost * IDF * (k1+1) * tf / (tf + norm) > {code} > As (k1+1) is a constant, it is the same for every term and doesn't modify > ordering. It is often omitted and I found out that the "The Probabilistic > Relevance Framework: BM25 and Beyond" paper by Robertson (BM25's author) and > Zaragova even describes adding (k1+1) to the numerator as a variant whose > benefit is to be more comparable with Robertson/Sparck-Jones weighting, which > we don't care about. > {quote}A common variant is to add a (k1 + 1) component to the > numerator of the saturation function. This is the same for all > terms, and therefore does not affect the ranking produced. > The reason for including it was to make the final formula > more compatible with the RSJ weight used on its own > {quote} > Should we remove it from BM25Similarity as well? > A side-effect that I'm interested in is that integrating other score > contributions (eg. via oal.document.FeatureField) would be a bit easier to > reason about. For instance a weight of 3 in FeatureField#newSaturationQuery > would have a similar impact as a term whose IDF is 3 (and thus docFreq ~= 5%) > rather than a term whose IDF is 3/(k1 + 1). -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-8563) Remove k1+1 from the numerator of BM25Similarity
[ https://issues.apache.org/jira/browse/LUCENE-8563?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16704438#comment-16704438 ] ASF subversion and git services commented on LUCENE-8563: - Commit cf016f8987e804bcd858a2a414eacdf1b3c54cf5 in lucene-solr's branch refs/heads/master from javanna [ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=cf016f8 ] LUCENE-8563: Remove k1+1 constant factor from BM25 formula numerator. Signed-off-by: Adrien Grand > Remove k1+1 from the numerator of BM25Similarity > - > > Key: LUCENE-8563 > URL: https://issues.apache.org/jira/browse/LUCENE-8563 > Project: Lucene - Core > Issue Type: Improvement >Reporter: Adrien Grand >Priority: Minor > Time Spent: 40m > Remaining Estimate: 0h > > Our current implementation of BM25 does > {code:java} > boost * IDF * (k1+1) * tf / (tf + norm) > {code} > As (k1+1) is a constant, it is the same for every term and doesn't modify > ordering. It is often omitted and I found out that the "The Probabilistic > Relevance Framework: BM25 and Beyond" paper by Robertson (BM25's author) and > Zaragova even describes adding (k1+1) to the numerator as a variant whose > benefit is to be more comparable with Robertson/Sparck-Jones weighting, which > we don't care about. > {quote}A common variant is to add a (k1 + 1) component to the > numerator of the saturation function. This is the same for all > terms, and therefore does not affect the ranking produced. > The reason for including it was to make the final formula > more compatible with the RSJ weight used on its own > {quote} > Should we remove it from BM25Similarity as well? > A side-effect that I'm interested in is that integrating other score > contributions (eg. via oal.document.FeatureField) would be a bit easier to > reason about. For instance a weight of 3 in FeatureField#newSaturationQuery > would have a similar impact as a term whose IDF is 3 (and thus docFreq ~= 5%) > rather than a term whose IDF is 3/(k1 + 1). -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-8563) Remove k1+1 from the numerator of BM25Similarity
[ https://issues.apache.org/jira/browse/LUCENE-8563?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16703382#comment-16703382 ] Luca Cavanna commented on LUCENE-8563: -- I updated the PR according to the latest comments, and deprecated the newly introduced similarity like Robert suggested. > Remove k1+1 from the numerator of BM25Similarity > - > > Key: LUCENE-8563 > URL: https://issues.apache.org/jira/browse/LUCENE-8563 > Project: Lucene - Core > Issue Type: Improvement >Reporter: Adrien Grand >Priority: Minor > Time Spent: 40m > Remaining Estimate: 0h > > Our current implementation of BM25 does > {code:java} > boost * IDF * (k1+1) * tf / (tf + norm) > {code} > As (k1+1) is a constant, it is the same for every term and doesn't modify > ordering. It is often omitted and I found out that the "The Probabilistic > Relevance Framework: BM25 and Beyond" paper by Robertson (BM25's author) and > Zaragova even describes adding (k1+1) to the numerator as a variant whose > benefit is to be more comparable with Robertson/Sparck-Jones weighting, which > we don't care about. > {quote}A common variant is to add a (k1 + 1) component to the > numerator of the saturation function. This is the same for all > terms, and therefore does not affect the ranking produced. > The reason for including it was to make the final formula > more compatible with the RSJ weight used on its own > {quote} > Should we remove it from BM25Similarity as well? > A side-effect that I'm interested in is that integrating other score > contributions (eg. via oal.document.FeatureField) would be a bit easier to > reason about. For instance a weight of 3 in FeatureField#newSaturationQuery > would have a similar impact as a term whose IDF is 3 (and thus docFreq ~= 5%) > rather than a term whose IDF is 3/(k1 + 1). -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-8563) Remove k1+1 from the numerator of BM25Similarity
[ https://issues.apache.org/jira/browse/LUCENE-8563?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16703055#comment-16703055 ] Robert Muir commented on LUCENE-8563: - Please deprecate the crazy legacy one too, so it can be eventually removed. > Remove k1+1 from the numerator of BM25Similarity > - > > Key: LUCENE-8563 > URL: https://issues.apache.org/jira/browse/LUCENE-8563 > Project: Lucene - Core > Issue Type: Improvement >Reporter: Adrien Grand >Priority: Minor > Time Spent: 40m > Remaining Estimate: 0h > > Our current implementation of BM25 does > {code:java} > boost * IDF * (k1+1) * tf / (tf + norm) > {code} > As (k1+1) is a constant, it is the same for every term and doesn't modify > ordering. It is often omitted and I found out that the "The Probabilistic > Relevance Framework: BM25 and Beyond" paper by Robertson (BM25's author) and > Zaragova even describes adding (k1+1) to the numerator as a variant whose > benefit is to be more comparable with Robertson/Sparck-Jones weighting, which > we don't care about. > {quote}A common variant is to add a (k1 + 1) component to the > numerator of the saturation function. This is the same for all > terms, and therefore does not affect the ranking produced. > The reason for including it was to make the final formula > more compatible with the RSJ weight used on its own > {quote} > Should we remove it from BM25Similarity as well? > A side-effect that I'm interested in is that integrating other score > contributions (eg. via oal.document.FeatureField) would be a bit easier to > reason about. For instance a weight of 3 in FeatureField#newSaturationQuery > would have a similar impact as a term whose IDF is 3 (and thus docFreq ~= 5%) > rather than a term whose IDF is 3/(k1 + 1). -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-8563) Remove k1+1 from the numerator of BM25Similarity
[ https://issues.apache.org/jira/browse/LUCENE-8563?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16703007#comment-16703007 ] Adrien Grand commented on LUCENE-8563: -- My gut feeling is that this change is going to be unnoticed by the vast majority of users as ordering is preserved. So I'd rather not require changes on their end to use this simpler implementation of BM25 and just document the change in runtime behavior in the release notes. I'm happy with keeping Solr on the current scoring formula and opening a follow-up issue to discuss how to handle the migration. [~lucacavanna] Based on Jan's comments, then let's configure Solr's BM25SimilarityFactory and SchemaSimilarityFactory to use the LegacyBM25Similarity that you added? > Remove k1+1 from the numerator of BM25Similarity > - > > Key: LUCENE-8563 > URL: https://issues.apache.org/jira/browse/LUCENE-8563 > Project: Lucene - Core > Issue Type: Improvement >Reporter: Adrien Grand >Priority: Minor > Time Spent: 40m > Remaining Estimate: 0h > > Our current implementation of BM25 does > {code:java} > boost * IDF * (k1+1) * tf / (tf + norm) > {code} > As (k1+1) is a constant, it is the same for every term and doesn't modify > ordering. It is often omitted and I found out that the "The Probabilistic > Relevance Framework: BM25 and Beyond" paper by Robertson (BM25's author) and > Zaragova even describes adding (k1+1) to the numerator as a variant whose > benefit is to be more comparable with Robertson/Sparck-Jones weighting, which > we don't care about. > {quote}A common variant is to add a (k1 + 1) component to the > numerator of the saturation function. This is the same for all > terms, and therefore does not affect the ranking produced. > The reason for including it was to make the final formula > more compatible with the RSJ weight used on its own > {quote} > Should we remove it from BM25Similarity as well? > A side-effect that I'm interested in is that integrating other score > contributions (eg. via oal.document.FeatureField) would be a bit easier to > reason about. For instance a weight of 3 in FeatureField#newSaturationQuery > would have a similar impact as a term whose IDF is 3 (and thus docFreq ~= 5%) > rather than a term whose IDF is 3/(k1 + 1). -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-8563) Remove k1+1 from the numerator of BM25Similarity
[ https://issues.apache.org/jira/browse/LUCENE-8563?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16702966#comment-16702966 ] Jan Høydahl commented on LUCENE-8563: - I think it would be a far better approach to create a new Similarity with a distinct name (NewBM25Similarity, CleanBM25Similarity, SimplifiedBM25Similarity or similar) for this, so Lucene users can explicitly make an informed choice, instead of changing the implementation of the existing class. Then this issue would not need to touch any Solr code whatsoever. If for some reason that is not possible, I think this is a classic example of a usecase for luceneMatchVersion conditional for Solr. If so, please create a new 8.0 *blocker* SOLR Jira issue about completing the Solr-side of things. > Remove k1+1 from the numerator of BM25Similarity > - > > Key: LUCENE-8563 > URL: https://issues.apache.org/jira/browse/LUCENE-8563 > Project: Lucene - Core > Issue Type: Improvement >Reporter: Adrien Grand >Priority: Minor > Time Spent: 40m > Remaining Estimate: 0h > > Our current implementation of BM25 does > {code:java} > boost * IDF * (k1+1) * tf / (tf + norm) > {code} > As (k1+1) is a constant, it is the same for every term and doesn't modify > ordering. It is often omitted and I found out that the "The Probabilistic > Relevance Framework: BM25 and Beyond" paper by Robertson (BM25's author) and > Zaragova even describes adding (k1+1) to the numerator as a variant whose > benefit is to be more comparable with Robertson/Sparck-Jones weighting, which > we don't care about. > {quote}A common variant is to add a (k1 + 1) component to the > numerator of the saturation function. This is the same for all > terms, and therefore does not affect the ranking produced. > The reason for including it was to make the final formula > more compatible with the RSJ weight used on its own > {quote} > Should we remove it from BM25Similarity as well? > A side-effect that I'm interested in is that integrating other score > contributions (eg. via oal.document.FeatureField) would be a bit easier to > reason about. For instance a weight of 3 in FeatureField#newSaturationQuery > would have a similar impact as a term whose IDF is 3 (and thus docFreq ~= 5%) > rather than a term whose IDF is 3/(k1 + 1). -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-8563) Remove k1+1 from the numerator of BM25Similarity
[ https://issues.apache.org/jira/browse/LUCENE-8563?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16702856#comment-16702856 ] Adrien Grand commented on LUCENE-8563: -- Thanks [~lucacavanna] this looks good to me. [~softwaredoug] [~janhoy] Regarding Solr, would you rather like to always use this new BM25Similarity or only if the luceneMatchVersion is greater than or equal to 8? Given that Luca is adding a way to get the old scores as well, it should be easy to pick the right one depending on the luceneMatchVersion like Hoss did in SOLR-8261. > Remove k1+1 from the numerator of BM25Similarity > - > > Key: LUCENE-8563 > URL: https://issues.apache.org/jira/browse/LUCENE-8563 > Project: Lucene - Core > Issue Type: Improvement >Reporter: Adrien Grand >Priority: Minor > Time Spent: 20m > Remaining Estimate: 0h > > Our current implementation of BM25 does > {code:java} > boost * IDF * (k1+1) * tf / (tf + norm) > {code} > As (k1+1) is a constant, it is the same for every term and doesn't modify > ordering. It is often omitted and I found out that the "The Probabilistic > Relevance Framework: BM25 and Beyond" paper by Robertson (BM25's author) and > Zaragova even describes adding (k1+1) to the numerator as a variant whose > benefit is to be more comparable with Robertson/Sparck-Jones weighting, which > we don't care about. > {quote}A common variant is to add a (k1 + 1) component to the > numerator of the saturation function. This is the same for all > terms, and therefore does not affect the ranking produced. > The reason for including it was to make the final formula > more compatible with the RSJ weight used on its own > {quote} > Should we remove it from BM25Similarity as well? > A side-effect that I'm interested in is that integrating other score > contributions (eg. via oal.document.FeatureField) would be a bit easier to > reason about. For instance a weight of 3 in FeatureField#newSaturationQuery > would have a similar impact as a term whose IDF is 3 (and thus docFreq ~= 5%) > rather than a term whose IDF is 3/(k1 + 1). -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-8563) Remove k1+1 from the numerator of BM25Similarity
[ https://issues.apache.org/jira/browse/LUCENE-8563?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16702294#comment-16702294 ] Luca Cavanna commented on LUCENE-8563: -- I opened [https://github.com/apache/lucene-solr/pull/511] . > Remove k1+1 from the numerator of BM25Similarity > - > > Key: LUCENE-8563 > URL: https://issues.apache.org/jira/browse/LUCENE-8563 > Project: Lucene - Core > Issue Type: Improvement >Reporter: Adrien Grand >Priority: Minor > Time Spent: 10m > Remaining Estimate: 0h > > Our current implementation of BM25 does > {code:java} > boost * IDF * (k1+1) * tf / (tf + norm) > {code} > As (k1+1) is a constant, it is the same for every term and doesn't modify > ordering. It is often omitted and I found out that the "The Probabilistic > Relevance Framework: BM25 and Beyond" paper by Robertson (BM25's author) and > Zaragova even describes adding (k1+1) to the numerator as a variant whose > benefit is to be more comparable with Robertson/Sparck-Jones weighting, which > we don't care about. > {quote}A common variant is to add a (k1 + 1) component to the > numerator of the saturation function. This is the same for all > terms, and therefore does not affect the ranking produced. > The reason for including it was to make the final formula > more compatible with the RSJ weight used on its own > {quote} > Should we remove it from BM25Similarity as well? > A side-effect that I'm interested in is that integrating other score > contributions (eg. via oal.document.FeatureField) would be a bit easier to > reason about. For instance a weight of 3 in FeatureField#newSaturationQuery > would have a similar impact as a term whose IDF is 3 (and thus docFreq ~= 5%) > rather than a term whose IDF is 3/(k1 + 1). -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-8563) Remove k1+1 from the numerator of BM25Similarity
[ https://issues.apache.org/jira/browse/LUCENE-8563?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16688474#comment-16688474 ] Elizabeth Haubert commented on LUCENE-8563: --- +1 if this gets us closer to BM25F. I saw the previous paper, but did not understand that the BM25 with (K1+1) was the non-standard version. Would it be worth adding a note [here|https://lucene.apache.org/solr/guide/7_1/other-schema-elements.html] referencing the in-use algorithm? > Remove k1+1 from the numerator of BM25Similarity > - > > Key: LUCENE-8563 > URL: https://issues.apache.org/jira/browse/LUCENE-8563 > Project: Lucene - Core > Issue Type: Improvement >Reporter: Adrien Grand >Priority: Minor > > Our current implementation of BM25 does > {code:java} > boost * IDF * (k1+1) * tf / (tf + norm) > {code} > As (k1+1) is a constant, it is the same for every term and doesn't modify > ordering. It is often omitted and I found out that the "The Probabilistic > Relevance Framework: BM25 and Beyond" paper by Robertson (BM25's author) and > Zaragova even describes adding (k1+1) to the numerator as a variant whose > benefit is to be more comparable with Robertson/Sparck-Jones weighting, which > we don't care about. > {quote}A common variant is to add a (k1 + 1) component to the > numerator of the saturation function. This is the same for all > terms, and therefore does not affect the ranking produced. > The reason for including it was to make the final formula > more compatible with the RSJ weight used on its own > {quote} > Should we remove it from BM25Similarity as well? > A side-effect that I'm interested in is that integrating other score > contributions (eg. via oal.document.FeatureField) would be a bit easier to > reason about. For instance a weight of 3 in FeatureField#newSaturationQuery > would have a similar impact as a term whose IDF is 3 (and thus docFreq ~= 5%) > rather than a term whose IDF is 3/(k1 + 1). -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-8563) Remove k1+1 from the numerator of BM25Similarity
[ https://issues.apache.org/jira/browse/LUCENE-8563?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16688455#comment-16688455 ] Michael Gibney commented on LUCENE-8563: I see; +1 as well. Seeing the main practical motivation for the change as being "comparable scores across queries", this would I think also improve (unboosted) score comparability (relevant for dismax) across different fields configured with different similarities and different k1 (TF saturation rate). So this might ultimately _help_ significantly in cases that paradoxically have the bumpiest migration path ... > Remove k1+1 from the numerator of BM25Similarity > - > > Key: LUCENE-8563 > URL: https://issues.apache.org/jira/browse/LUCENE-8563 > Project: Lucene - Core > Issue Type: Improvement >Reporter: Adrien Grand >Priority: Minor > > Our current implementation of BM25 does > {code:java} > boost * IDF * (k1+1) * tf / (tf + norm) > {code} > As (k1+1) is a constant, it is the same for every term and doesn't modify > ordering. It is often omitted and I found out that the "The Probabilistic > Relevance Framework: BM25 and Beyond" paper by Robertson (BM25's author) and > Zaragova even describes adding (k1+1) to the numerator as a variant whose > benefit is to be more comparable with Robertson/Sparck-Jones weighting, which > we don't care about. > {quote}A common variant is to add a (k1 + 1) component to the > numerator of the saturation function. This is the same for all > terms, and therefore does not affect the ranking produced. > The reason for including it was to make the final formula > more compatible with the RSJ weight used on its own > {quote} > Should we remove it from BM25Similarity as well? > A side-effect that I'm interested in is that integrating other score > contributions (eg. via oal.document.FeatureField) would be a bit easier to > reason about. For instance a weight of 3 in FeatureField#newSaturationQuery > would have a similar impact as a term whose IDF is 3 (and thus docFreq ~= 5%) > rather than a term whose IDF is 3/(k1 + 1). -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-8563) Remove k1+1 from the numerator of BM25Similarity
[ https://issues.apache.org/jira/browse/LUCENE-8563?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16688424#comment-16688424 ] Doug Turnbull commented on LUCENE-8563: --- Ah... I assumed "Adrien has his performance hat on" which probably colored my perception of the issue Ah yeah my mistake I see that now, I think your strategy makes sense now and helps with scoring comparability across queries. :+1: to your approach with the LegacyBM25 implementation then! > Remove k1+1 from the numerator of BM25Similarity > - > > Key: LUCENE-8563 > URL: https://issues.apache.org/jira/browse/LUCENE-8563 > Project: Lucene - Core > Issue Type: Improvement >Reporter: Adrien Grand >Priority: Minor > > Our current implementation of BM25 does > {code:java} > boost * IDF * (k1+1) * tf / (tf + norm) > {code} > As (k1+1) is a constant, it is the same for every term and doesn't modify > ordering. It is often omitted and I found out that the "The Probabilistic > Relevance Framework: BM25 and Beyond" paper by Robertson (BM25's author) and > Zaragova even describes adding (k1+1) to the numerator as a variant whose > benefit is to be more comparable with Robertson/Sparck-Jones weighting, which > we don't care about. > {quote}A common variant is to add a (k1 + 1) component to the > numerator of the saturation function. This is the same for all > terms, and therefore does not affect the ranking produced. > The reason for including it was to make the final formula > more compatible with the RSJ weight used on its own > {quote} > Should we remove it from BM25Similarity as well? > A side-effect that I'm interested in is that integrating other score > contributions (eg. via oal.document.FeatureField) would be a bit easier to > reason about. For instance a weight of 3 in FeatureField#newSaturationQuery > would have a similar impact as a term whose IDF is 3 (and thus docFreq ~= 5%) > rather than a term whose IDF is 3/(k1 + 1). -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-8563) Remove k1+1 from the numerator of BM25Similarity
[ https://issues.apache.org/jira/browse/LUCENE-8563?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16688388#comment-16688388 ] Adrien Grand commented on LUCENE-8563: -- My goal is not to make things faster, I don't think it would change anything since this multiplication is only done once for every document anyway. My goal is rather to simplify (one less factor in the furmula, one less factor in the explanation) and also align with recent descriptions of BM25 by its original author himself: if you look at the [paper|http://www.staff.city.ac.uk/~sb317/papers/foundations_bm25_review.pdf] that I mentioned in the description from 2009, it doesn't put (k1+1) on the numerator and says that there is a "common variant" of BM25 that does it. > Remove k1+1 from the numerator of BM25Similarity > - > > Key: LUCENE-8563 > URL: https://issues.apache.org/jira/browse/LUCENE-8563 > Project: Lucene - Core > Issue Type: Improvement >Reporter: Adrien Grand >Priority: Minor > > Our current implementation of BM25 does > {code:java} > boost * IDF * (k1+1) * tf / (tf + norm) > {code} > As (k1+1) is a constant, it is the same for every term and doesn't modify > ordering. It is often omitted and I found out that the "The Probabilistic > Relevance Framework: BM25 and Beyond" paper by Robertson (BM25's author) and > Zaragova even describes adding (k1+1) to the numerator as a variant whose > benefit is to be more comparable with Robertson/Sparck-Jones weighting, which > we don't care about. > {quote}A common variant is to add a (k1 + 1) component to the > numerator of the saturation function. This is the same for all > terms, and therefore does not affect the ranking produced. > The reason for including it was to make the final formula > more compatible with the RSJ weight used on its own > {quote} > Should we remove it from BM25Similarity as well? > A side-effect that I'm interested in is that integrating other score > contributions (eg. via oal.document.FeatureField) would be a bit easier to > reason about. For instance a weight of 3 in FeatureField#newSaturationQuery > would have a similar impact as a term whose IDF is 3 (and thus docFreq ~= 5%) > rather than a term whose IDF is 3/(k1 + 1). -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-8563) Remove k1+1 from the numerator of BM25Similarity
[ https://issues.apache.org/jira/browse/LUCENE-8563?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16688382#comment-16688382 ] Doug Turnbull commented on LUCENE-8563: --- Thanks [~jpountz] - My feeling is if Lucene has something called "BM25 Similarity" it should match to the traditional definition of BM25, and shouldn't be deprecated. But if we want to create a faster version, and make it default, I think that would be great. Or if you want to call the current (what you call legacy) "ClassicBM25Similarity" instead of legacy... I just don't feel it should be deprecated. As an IR person, I would be surprised if I was new to Lucene, looked up BM25 and it wasn't actually BM25... > Remove k1+1 from the numerator of BM25Similarity > - > > Key: LUCENE-8563 > URL: https://issues.apache.org/jira/browse/LUCENE-8563 > Project: Lucene - Core > Issue Type: Improvement >Reporter: Adrien Grand >Priority: Minor > > Our current implementation of BM25 does > {code:java} > boost * IDF * (k1+1) * tf / (tf + norm) > {code} > As (k1+1) is a constant, it is the same for every term and doesn't modify > ordering. It is often omitted and I found out that the "The Probabilistic > Relevance Framework: BM25 and Beyond" paper by Robertson (BM25's author) and > Zaragova even describes adding (k1+1) to the numerator as a variant whose > benefit is to be more comparable with Robertson/Sparck-Jones weighting, which > we don't care about. > {quote}A common variant is to add a (k1 + 1) component to the > numerator of the saturation function. This is the same for all > terms, and therefore does not affect the ranking produced. > The reason for including it was to make the final formula > more compatible with the RSJ weight used on its own > {quote} > Should we remove it from BM25Similarity as well? > A side-effect that I'm interested in is that integrating other score > contributions (eg. via oal.document.FeatureField) would be a bit easier to > reason about. For instance a weight of 3 in FeatureField#newSaturationQuery > would have a similar impact as a term whose IDF is 3 (and thus docFreq ~= 5%) > rather than a term whose IDF is 3/(k1 + 1). -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-8563) Remove k1+1 from the numerator of BM25Similarity
[ https://issues.apache.org/jira/browse/LUCENE-8563?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16688101#comment-16688101 ] Adrien Grand commented on LUCENE-8563: -- If keeping a way to get the old scores is the main concern, we could add a similarity that looks like this to lucene/misc and mention it in the upgrade notes: {code:java} public class LegacyBM25Similarity extends Similarity { private final BM25Similarity bm25Similarity; public LegacyBM25Similarity() { bm25Similarity = new BM25Similarity(); } public LegacyBM25Similarity(float k1, float b) { bm25Similarity = new BM25Similarity(k1, b); } @Override public long computeNorm(FieldInvertState state) { return bm25Similarity.computeNorm(state); } @Override public SimScorer scorer(float boost, CollectionStatistics collectionStats, TermStatistics... termStats) { return bm25Similarity.scorer(boost * (1 + bm25Similarity.getK1()), collectionStats, termStats); } } {code} > Remove k1+1 from the numerator of BM25Similarity > - > > Key: LUCENE-8563 > URL: https://issues.apache.org/jira/browse/LUCENE-8563 > Project: Lucene - Core > Issue Type: Improvement >Reporter: Adrien Grand >Priority: Minor > > Our current implementation of BM25 does > {code:java} > boost * IDF * (k1+1) * tf / (tf + norm) > {code} > As (k1+1) is a constant, it is the same for every term and doesn't modify > ordering. It is often omitted and I found out that the "The Probabilistic > Relevance Framework: BM25 and Beyond" paper by Robertson (BM25's author) and > Zaragova even describes adding (k1+1) to the numerator as a variant whose > benefit is to be more comparable with Robertson/Sparck-Jones weighting, which > we don't care about. > {quote}A common variant is to add a (k1 + 1) component to the > numerator of the saturation function. This is the same for all > terms, and therefore does not affect the ranking produced. > The reason for including it was to make the final formula > more compatible with the RSJ weight used on its own > {quote} > Should we remove it from BM25Similarity as well? > A side-effect that I'm interested in is that integrating other score > contributions (eg. via oal.document.FeatureField) would be a bit easier to > reason about. For instance a weight of 3 in FeatureField#newSaturationQuery > would have a similar impact as a term whose IDF is 3 (and thus docFreq ~= 5%) > rather than a term whose IDF is 3/(k1 + 1). -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-8563) Remove k1+1 from the numerator of BM25Similarity
[ https://issues.apache.org/jira/browse/LUCENE-8563?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16688082#comment-16688082 ] Jan Høydahl commented on LUCENE-8563: - +1 to Doug's suggestion. Add the new Similarity and keep the old for the lifetime of 8.x so people have a graceful way to migrate if needed. > Remove k1+1 from the numerator of BM25Similarity > - > > Key: LUCENE-8563 > URL: https://issues.apache.org/jira/browse/LUCENE-8563 > Project: Lucene - Core > Issue Type: Improvement >Reporter: Adrien Grand >Priority: Minor > > Our current implementation of BM25 does > {code:java} > boost * IDF * (k1+1) * tf / (tf + norm) > {code} > As (k1+1) is a constant, it is the same for every term and doesn't modify > ordering. It is often omitted and I found out that the "The Probabilistic > Relevance Framework: BM25 and Beyond" paper by Robertson (BM25's author) and > Zaragova even describes adding (k1+1) to the numerator as a variant whose > benefit is to be more comparable with Robertson/Sparck-Jones weighting, which > we don't care about. > {quote}A common variant is to add a (k1 + 1) component to the > numerator of the saturation function. This is the same for all > terms, and therefore does not affect the ranking produced. > The reason for including it was to make the final formula > more compatible with the RSJ weight used on its own > {quote} > Should we remove it from BM25Similarity as well? > A side-effect that I'm interested in is that integrating other score > contributions (eg. via oal.document.FeatureField) would be a bit easier to > reason about. For instance a weight of 3 in FeatureField#newSaturationQuery > would have a similar impact as a term whose IDF is 3 (and thus docFreq ~= 5%) > rather than a term whose IDF is 3/(k1 + 1). -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-8563) Remove k1+1 from the numerator of BM25Similarity
[ https://issues.apache.org/jira/browse/LUCENE-8563?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16687922#comment-16687922 ] Robert Muir commented on LUCENE-8563: - No, we shouldn't clutter up BM25Similarity because some users have bad behavior. If they did the wrong thing and rely on the exact absolute magnitude of the old similarity, well that's why the mechanism is extensible. > Remove k1+1 from the numerator of BM25Similarity > - > > Key: LUCENE-8563 > URL: https://issues.apache.org/jira/browse/LUCENE-8563 > Project: Lucene - Core > Issue Type: Improvement >Reporter: Adrien Grand >Priority: Minor > > Our current implementation of BM25 does > {code:java} > boost * IDF * (k1+1) * tf / (tf + norm) > {code} > As (k1+1) is a constant, it is the same for every term and doesn't modify > ordering. It is often omitted and I found out that the "The Probabilistic > Relevance Framework: BM25 and Beyond" paper by Robertson (BM25's author) and > Zaragova even describes adding (k1+1) to the numerator as a variant whose > benefit is to be more comparable with Robertson/Sparck-Jones weighting, which > we don't care about. > {quote}A common variant is to add a (k1 + 1) component to the > numerator of the saturation function. This is the same for all > terms, and therefore does not affect the ranking produced. > The reason for including it was to make the final formula > more compatible with the RSJ weight used on its own > {quote} > Should we remove it from BM25Similarity as well? > A side-effect that I'm interested in is that integrating other score > contributions (eg. via oal.document.FeatureField) would be a bit easier to > reason about. For instance a weight of 3 in FeatureField#newSaturationQuery > would have a similar impact as a term whose IDF is 3 (and thus docFreq ~= 5%) > rather than a term whose IDF is 3/(k1 + 1). -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-8563) Remove k1+1 from the numerator of BM25Similarity
[ https://issues.apache.org/jira/browse/LUCENE-8563?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16687440#comment-16687440 ] Michael Gibney commented on LUCENE-8563: [~jpountz], thanks for pointing out the work on BM25F. I'm interested to take a closer look at that. "Users could multiply their per-field boosts by (k1+1)?" ... thanks, yes! That should work in a pinch, though I was so focused on the Similarity that I missed the possibility of scaling it externally in this way. Having k1's presence in the numerator be configurable (either as an extra boolean parameter to the (modified) existing BM25Similarity, or something along the lines of what [~softwaredoug] suggests) would make sense to me, regardless of the benefits of the change (performance or otherwise). > Remove k1+1 from the numerator of BM25Similarity > - > > Key: LUCENE-8563 > URL: https://issues.apache.org/jira/browse/LUCENE-8563 > Project: Lucene - Core > Issue Type: Improvement >Reporter: Adrien Grand >Priority: Minor > > Our current implementation of BM25 does > {code:java} > boost * IDF * (k1+1) * tf / (tf + norm) > {code} > As (k1+1) is a constant, it is the same for every term and doesn't modify > ordering. It is often omitted and I found out that the "The Probabilistic > Relevance Framework: BM25 and Beyond" paper by Robertson (BM25's author) and > Zaragova even describes adding (k1+1) to the numerator as a variant whose > benefit is to be more comparable with Robertson/Sparck-Jones weighting, which > we don't care about. > {quote}A common variant is to add a (k1 + 1) component to the > numerator of the saturation function. This is the same for all > terms, and therefore does not affect the ranking produced. > The reason for including it was to make the final formula > more compatible with the RSJ weight used on its own > {quote} > Should we remove it from BM25Similarity as well? > A side-effect that I'm interested in is that integrating other score > contributions (eg. via oal.document.FeatureField) would be a bit easier to > reason about. For instance a weight of 3 in FeatureField#newSaturationQuery > would have a similar impact as a term whose IDF is 3 (and thus docFreq ~= 5%) > rather than a term whose IDF is 3/(k1 + 1). -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-8563) Remove k1+1 from the numerator of BM25Similarity
[ https://issues.apache.org/jira/browse/LUCENE-8563?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16687268#comment-16687268 ] Doug Turnbull commented on LUCENE-8563: --- I feel perhaps one way forward is to create a second (default?) similarity - FastBM25Similarity? ConstantCeilingBM25Similarity? and leave in place the current BM25 similarity as an optional similarity to configure. There may be existing practices around tuning BM25 similarity at many places where writing a similarity plugin is not an option > Remove k1+1 from the numerator of BM25Similarity > - > > Key: LUCENE-8563 > URL: https://issues.apache.org/jira/browse/LUCENE-8563 > Project: Lucene - Core > Issue Type: Improvement >Reporter: Adrien Grand >Priority: Minor > > Our current implementation of BM25 does > {code:java} > boost * IDF * (k1+1) * tf / (tf + norm) > {code} > As (k1+1) is a constant, it is the same for every term and doesn't modify > ordering. It is often omitted and I found out that the "The Probabilistic > Relevance Framework: BM25 and Beyond" paper by Robertson (BM25's author) and > Zaragova even describes adding (k1+1) to the numerator as a variant whose > benefit is to be more comparable with Robertson/Sparck-Jones weighting, which > we don't care about. > {quote}A common variant is to add a (k1 + 1) component to the > numerator of the saturation function. This is the same for all > terms, and therefore does not affect the ranking produced. > The reason for including it was to make the final formula > more compatible with the RSJ weight used on its own > {quote} > Should we remove it from BM25Similarity as well? > A side-effect that I'm interested in is that integrating other score > contributions (eg. via oal.document.FeatureField) would be a bit easier to > reason about. For instance a weight of 3 in FeatureField#newSaturationQuery > would have a similar impact as a term whose IDF is 3 (and thus docFreq ~= 5%) > rather than a term whose IDF is 3/(k1 + 1). -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-8563) Remove k1+1 from the numerator of BM25Similarity
[ https://issues.apache.org/jira/browse/LUCENE-8563?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16687260#comment-16687260 ] Adrien Grand commented on LUCENE-8563: -- bq. "assuming a single similarity" – is this something that we want to assume? We can't indeed, even though this is the most common case. That said if you are searching multiple fields at once today, the I'm afraid that relevance isn't very good anyway as we don't support something like BM25F (LUCENE-8216) to merge index and document statistics (BlendedTermQuery merges index statistics, but not norms and term frequencies). By the way BM25F doesn't allow to configure the value of k1 on a per-field basis, only b may have different per-field values. bq. I'm sure this change would be appropriate for some scenarios, but it's a fundamental change that could in some cases have significant downstream consequences, with no easy way (as far as I can tell) to maintain existing behavior. Users could multiply their per-field boosts by (k1+1)? > Remove k1+1 from the numerator of BM25Similarity > - > > Key: LUCENE-8563 > URL: https://issues.apache.org/jira/browse/LUCENE-8563 > Project: Lucene - Core > Issue Type: Improvement >Reporter: Adrien Grand >Priority: Minor > > Our current implementation of BM25 does > {code:java} > boost * IDF * (k1+1) * tf / (tf + norm) > {code} > As (k1+1) is a constant, it is the same for every term and doesn't modify > ordering. It is often omitted and I found out that the "The Probabilistic > Relevance Framework: BM25 and Beyond" paper by Robertson (BM25's author) and > Zaragova even describes adding (k1+1) to the numerator as a variant whose > benefit is to be more comparable with Robertson/Sparck-Jones weighting, which > we don't care about. > {quote}A common variant is to add a (k1 + 1) component to the > numerator of the saturation function. This is the same for all > terms, and therefore does not affect the ranking produced. > The reason for including it was to make the final formula > more compatible with the RSJ weight used on its own > {quote} > Should we remove it from BM25Similarity as well? > A side-effect that I'm interested in is that integrating other score > contributions (eg. via oal.document.FeatureField) would be a bit easier to > reason about. For instance a weight of 3 in FeatureField#newSaturationQuery > would have a similar impact as a term whose IDF is 3 (and thus docFreq ~= 5%) > rather than a term whose IDF is 3/(k1 + 1). -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-8563) Remove k1+1 from the numerator of BM25Similarity
[ https://issues.apache.org/jira/browse/LUCENE-8563?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16686877#comment-16686877 ] Michael Gibney commented on LUCENE-8563: "assuming a single similarity" -- is this something that we want to assume? If every field similarity uses the same k1 param, then sure, relative ordering among fields is maintained. But if we're using these scores outside of the context of single-similarity, and intend to preserve the ability to adjust the k1 param, it's worth noting that this change fundamentally alters the effect of the k1 param on absolute scores (and thus also on relative scores across similarities). Namely, removing k1 from the numerator places a hard cap on the score, regardless of TF or k1 setting. The concept of saturation is preserved, but with no numerator k1, saturation is implemented strictly by depressing scores (with respect to the hard cap, by varying amounts according to TF) as k1 increases. The model with k1 in the numerator strikes me as being more flexible, both depressing scores for lower TF _and increasing_ scores for higher TF, around an inflection point determined by length norms and the value of b. I'm sure this change would be appropriate for some scenarios, but it's a fundamental change that could in some cases have significant downstream consequences, with no easy way (as far as I can tell) to maintain existing behavior. > Remove k1+1 from the numerator of BM25Similarity > - > > Key: LUCENE-8563 > URL: https://issues.apache.org/jira/browse/LUCENE-8563 > Project: Lucene - Core > Issue Type: Improvement >Reporter: Adrien Grand >Priority: Minor > > Our current implementation of BM25 does > {code:java} > boost * IDF * (k1+1) * tf / (tf + norm) > {code} > As (k1+1) is a constant, it is the same for every term and doesn't modify > ordering. It is often omitted and I found out that the "The Probabilistic > Relevance Framework: BM25 and Beyond" paper by Robertson (BM25's author) and > Zaragova even describes adding (k1+1) to the numerator as a variant whose > benefit is to be more comparable with Robertson/Sparck-Jones weighting, which > we don't care about. > {quote}A common variant is to add a (k1 + 1) component to the > numerator of the saturation function. This is the same for all > terms, and therefore does not affect the ranking produced. > The reason for including it was to make the final formula > more compatible with the RSJ weight used on its own > {quote} > Should we remove it from BM25Similarity as well? > A side-effect that I'm interested in is that integrating other score > contributions (eg. via oal.document.FeatureField) would be a bit easier to > reason about. For instance a weight of 3 in FeatureField#newSaturationQuery > would have a similar impact as a term whose IDF is 3 (and thus docFreq ~= 5%) > rather than a term whose IDF is 3/(k1 + 1). -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-8563) Remove k1+1 from the numerator of BM25Similarity
[ https://issues.apache.org/jira/browse/LUCENE-8563?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16686633#comment-16686633 ] Adrien Grand commented on LUCENE-8563: -- That would be great [~lucacavanna]. I suspect most of the work is going to be about fixing tests that rely on absolute score values. > Remove k1+1 from the numerator of BM25Similarity > - > > Key: LUCENE-8563 > URL: https://issues.apache.org/jira/browse/LUCENE-8563 > Project: Lucene - Core > Issue Type: Improvement >Reporter: Adrien Grand >Priority: Minor > > Our current implementation of BM25 does > {code:java} > boost * IDF * (k1+1) * tf / (tf + norm) > {code} > As (k1+1) is a constant, it is the same for every term and doesn't modify > ordering. It is often omitted and I found out that the "The Probabilistic > Relevance Framework: BM25 and Beyond" paper by Robertson (BM25's author) and > Zaragova even describes adding (k1+1) to the numerator as a variant whose > benefit is to be more comparable with Robertson/Sparck-Jones weighting, which > we don't care about. > {quote}A common variant is to add a (k1 + 1) component to the > numerator of the saturation function. This is the same for all > terms, and therefore does not affect the ranking produced. > The reason for including it was to make the final formula > more compatible with the RSJ weight used on its own > {quote} > Should we remove it from BM25Similarity as well? > A side-effect that I'm interested in is that integrating other score > contributions (eg. via oal.document.FeatureField) would be a bit easier to > reason about. For instance a weight of 3 in FeatureField#newSaturationQuery > would have a similar impact as a term whose IDF is 3 (and thus docFreq ~= 5%) > rather than a term whose IDF is 3/(k1 + 1). -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-8563) Remove k1+1 from the numerator of BM25Similarity
[ https://issues.apache.org/jira/browse/LUCENE-8563?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16686370#comment-16686370 ] Luca Cavanna commented on LUCENE-8563: -- Hi folks, I would like to work on this issue. > Remove k1+1 from the numerator of BM25Similarity > - > > Key: LUCENE-8563 > URL: https://issues.apache.org/jira/browse/LUCENE-8563 > Project: Lucene - Core > Issue Type: Improvement >Reporter: Adrien Grand >Priority: Minor > > Our current implementation of BM25 does > {code:java} > boost * IDF * (k1+1) * tf / (tf + norm) > {code} > As (k1+1) is a constant, it is the same for every term and doesn't modify > ordering. It is often omitted and I found out that the "The Probabilistic > Relevance Framework: BM25 and Beyond" paper by Robertson (BM25's author) and > Zaragova even describes adding (k1+1) to the numerator as a variant whose > benefit is to be more comparable with Robertson/Sparck-Jones weighting, which > we don't care about. > {quote}A common variant is to add a (k1 + 1) component to the > numerator of the saturation function. This is the same for all > terms, and therefore does not affect the ranking produced. > The reason for including it was to make the final formula > more compatible with the RSJ weight used on its own > {quote} > Should we remove it from BM25Similarity as well? > A side-effect that I'm interested in is that integrating other score > contributions (eg. via oal.document.FeatureField) would be a bit easier to > reason about. For instance a weight of 3 in FeatureField#newSaturationQuery > would have a similar impact as a term whose IDF is 3 (and thus docFreq ~= 5%) > rather than a term whose IDF is 3/(k1 + 1). -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-8563) Remove k1+1 from the numerator of BM25Similarity
[ https://issues.apache.org/jira/browse/LUCENE-8563?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16684107#comment-16684107 ] Adrien Grand commented on LUCENE-8563: -- Agreed [~softwaredoug] I was assuming a single similarity. This would also change ordering if other fields use different similarities. > Remove k1+1 from the numerator of BM25Similarity > - > > Key: LUCENE-8563 > URL: https://issues.apache.org/jira/browse/LUCENE-8563 > Project: Lucene - Core > Issue Type: Improvement >Reporter: Adrien Grand >Priority: Minor > > Our current implementation of BM25 does > {code:java} > boost * IDF * (k1+1) * tf / (tf + norm) > {code} > As (k1+1) is a constant, it is the same for every term and doesn't modify > ordering. It is often omitted and I found out that the "The Probabilistic > Relevance Framework: BM25 and Beyond" paper by Robertson (BM25's author) and > Zaragova even describes adding (k1+1) to the numerator as a variant whose > benefit is to be more comparable with Robertson/Sparck-Jones weighting, which > we don't care about. > {quote}A common variant is to add a (k1 + 1) component to the > numerator of the saturation function. This is the same for all > terms, and therefore does not affect the ranking produced. > The reason for including it was to make the final formula > more compatible with the RSJ weight used on its own > {quote} > Should we remove it from BM25Similarity as well? > A side-effect that I'm interested in is that integrating other score > contributions (eg. via oal.document.FeatureField) would be a bit easier to > reason about. For instance a weight of 3 in FeatureField#newSaturationQuery > would have a similar impact as a term whose IDF is 3 (and thus docFreq ~= 5%) > rather than a term whose IDF is 3/(k1 + 1). -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-8563) Remove k1+1 from the numerator of BM25Similarity
[ https://issues.apache.org/jira/browse/LUCENE-8563?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16684091#comment-16684091 ] Doug Turnbull commented on LUCENE-8563: --- For the sake of this discussion, here's a desmos graph with BM25 with/without k1 in the numerator https://www.desmos.com/calculator/cklb27fcn9 > Remove k1+1 from the numerator of BM25Similarity > - > > Key: LUCENE-8563 > URL: https://issues.apache.org/jira/browse/LUCENE-8563 > Project: Lucene - Core > Issue Type: Improvement >Reporter: Adrien Grand >Priority: Minor > > Our current implementation of BM25 does > {code:java} > boost * IDF * (k1+1) * tf / (tf + norm) > {code} > As (k1+1) is a constant, it is the same for every term and doesn't modify > ordering. It is often omitted and I found out that the "The Probabilistic > Relevance Framework: BM25 and Beyond" paper by Robertson (BM25's author) and > Zaragova even describes adding (k1+1) to the numerator as a variant whose > benefit is to be more comparable with Robertson/Sparck-Jones weighting, which > we don't care about. > {quote}A common variant is to add a (k1 + 1) component to the > numerator of the saturation function. This is the same for all > terms, and therefore does not affect the ranking produced. > The reason for including it was to make the final formula > more compatible with the RSJ weight used on its own > {quote} > Should we remove it from BM25Similarity as well? > A side-effect that I'm interested in is that integrating other score > contributions (eg. via oal.document.FeatureField) would be a bit easier to > reason about. For instance a weight of 3 in FeatureField#newSaturationQuery > would have a similar impact as a term whose IDF is 3 (and thus docFreq ~= 5%) > rather than a term whose IDF is 3/(k1 + 1). -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-8563) Remove k1+1 from the numerator of BM25Similarity
[ https://issues.apache.org/jira/browse/LUCENE-8563?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16684080#comment-16684080 ] Doug Turnbull commented on LUCENE-8563: --- It would modify ordering when dealing with multiple fields. Consider one field with a different k1 than another because the impact of term frequency is calibrated differently. If one calibrates one field to saturate term freq faster, and another slower, then ordering would be impacted Additionally, currently k1=0 is the only way to disable term frequency without also disabling positions. > Remove k1+1 from the numerator of BM25Similarity > - > > Key: LUCENE-8563 > URL: https://issues.apache.org/jira/browse/LUCENE-8563 > Project: Lucene - Core > Issue Type: Improvement >Reporter: Adrien Grand >Priority: Minor > > Our current implementation of BM25 does > {code:java} > boost * IDF * (k1+1) * tf / (tf + norm) > {code} > As (k1+1) is a constant, it is the same for every term and doesn't modify > ordering. It is often omitted and I found out that the "The Probabilistic > Relevance Framework: BM25 and Beyond" paper by Robertson (BM25's author) and > Zaragova even describes adding (k1+1) to the numerator as a variant whose > benefit is to be more comparable with Robertson/Sparck-Jones weighting, which > we don't care about. > {quote}A common variant is to add a (k1 + 1) component to the > numerator of the saturation function. This is the same for all > terms, and therefore does not affect the ranking produced. > The reason for including it was to make the final formula > more compatible with the RSJ weight used on its own > {quote} > Should we remove it from BM25Similarity as well? > A side-effect that I'm interested in is that integrating other score > contributions (eg. via oal.document.FeatureField) would be a bit easier to > reason about. For instance a weight of 3 in FeatureField#newSaturationQuery > would have a similar impact as a term whose IDF is 3 (and thus docFreq ~= 5%) > rather than a term whose IDF is 3/(k1 + 1). -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-8563) Remove k1+1 from the numerator of BM25Similarity
[ https://issues.apache.org/jira/browse/LUCENE-8563?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16684051#comment-16684051 ] Adrien Grand commented on LUCENE-8563: -- bq. There will be cases where this affects relative scoring and ranking I don't think this is correct. All scores would be divided by the same constant, so ordering would be preserved. > Remove k1+1 from the numerator of BM25Similarity > - > > Key: LUCENE-8563 > URL: https://issues.apache.org/jira/browse/LUCENE-8563 > Project: Lucene - Core > Issue Type: Improvement >Reporter: Adrien Grand >Priority: Minor > > Our current implementation of BM25 does > {code:java} > boost * IDF * (k1+1) * tf / (tf + norm) > {code} > As (k1+1) is a constant, it is the same for every term and doesn't modify > ordering. It is often omitted and I found out that the "The Probabilistic > Relevance Framework: BM25 and Beyond" paper by Robertson (BM25's author) and > Zaragova even describes adding (k1+1) to the numerator as a variant whose > benefit is to be more comparable with Robertson/Sparck-Jones weighting, which > we don't care about. > {quote}A common variant is to add a (k1 + 1) component to the > numerator of the saturation function. This is the same for all > terms, and therefore does not affect the ranking produced. > The reason for including it was to make the final formula > more compatible with the RSJ weight used on its own > {quote} > Should we remove it from BM25Similarity as well? > A side-effect that I'm interested in is that integrating other score > contributions (eg. via oal.document.FeatureField) would be a bit easier to > reason about. For instance a weight of 3 in FeatureField#newSaturationQuery > would have a similar impact as a term whose IDF is 3 (and thus docFreq ~= 5%) > rather than a term whose IDF is 3/(k1 + 1). -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-8563) Remove k1+1 from the numerator of BM25Similarity
[ https://issues.apache.org/jira/browse/LUCENE-8563?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16683976#comment-16683976 ] Elizabeth Haubert commented on LUCENE-8563: --- The boost*IDF is not particularly important, this is about the handling of the TF component relative to the norms. Pull that out as {code:java} (tf + tf*k1) / (tf + k1*length_norms) {code} Removing it only from the numerator produces {code:java} tf / (tf +k1* length norms) {code} At a minimum, that will need a new empirical default for k1. Changing k1 in the numerator is the knob to adjust the ratio of tf and norms. In the case where document length does not follow standard models, it can be helpful to damp down b. This is not the standard use case, but is not unusual, either. At the extreme, b=0 then this component reduces to {code:java} (tf * (k1 +1)) / (tf + k1) {code} Removing the (k1 +1) from the numerator only produces {code:java} tf / (tf + k1) {code} There will be cases where this affects relative scoring and ranking, and I don't understand the statement that it doesn't modify ordering. If there is a need to remove it in the normal case, then perhaps the numerator and denominator should be split into two distinct constants. > Remove k1+1 from the numerator of BM25Similarity > - > > Key: LUCENE-8563 > URL: https://issues.apache.org/jira/browse/LUCENE-8563 > Project: Lucene - Core > Issue Type: Improvement >Reporter: Adrien Grand >Priority: Minor > > Our current implementation of BM25 does > {code:java} > boost * IDF * (k1+1) * tf / (tf + norm) > {code} > As (k1+1) is a constant, it is the same for every term and doesn't modify > ordering. It is often omitted and I found out that the "The Probabilistic > Relevance Framework: BM25 and Beyond" paper by Robertson (BM25's author) and > Zaragova even describes adding (k1+1) to the numerator as a variant whose > benefit is to be more comparable with Robertson/Sparck-Jones weighting, which > we don't care about. > {quote}A common variant is to add a (k1 + 1) component to the > numerator of the saturation function. This is the same for all > terms, and therefore does not affect the ranking produced. > The reason for including it was to make the final formula > more compatible with the RSJ weight used on its own > {quote} > Should we remove it from BM25Similarity as well? > A side-effect that I'm interested in is that integrating other score > contributions (eg. via oal.document.FeatureField) would be a bit easier to > reason about. For instance a weight of 3 in FeatureField#newSaturationQuery > would have a similar impact as a term whose IDF is 3 (and thus docFreq ~= 5%) > rather than a term whose IDF is 3/(k1 + 1). -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-8563) Remove k1+1 from the numerator of BM25Similarity
[ https://issues.apache.org/jira/browse/LUCENE-8563?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16683868#comment-16683868 ] Robert Muir commented on LUCENE-8563: - +1 to nuke it. Currently the explain() goes out of its way to try to separate out this scaling factor to make it easier to see. Its unnecessary. > Remove k1+1 from the numerator of BM25Similarity > - > > Key: LUCENE-8563 > URL: https://issues.apache.org/jira/browse/LUCENE-8563 > Project: Lucene - Core > Issue Type: Improvement >Reporter: Adrien Grand >Priority: Minor > > Our current implementation of BM25 does > {code:java} > boost * IDF * (k1+1) * tf / (tf + norm) > {code} > As (k1+1) is a constant, it is the same for every term and doesn't modify > ordering. It is often omitted and I found out that the "The Probabilistic > Relevance Framework: BM25 and Beyond" paper by Robertson (BM25's author) and > Zaragova even describes adding (k1+1) to the numerator as a variant whose > benefit is to be more comparable with Robertson/Sparck-Jones weighting, which > we don't care about. > {quote}A common variant is to add a (k1 + 1) component to the > numerator of the saturation function. This is the same for all > terms, and therefore does not affect the ranking produced. > The reason for including it was to make the final formula > more compatible with the RSJ weight used on its own > {quote} > Should we remove it from BM25Similarity as well? > A side-effect that I'm interested in is that integrating other score > contributions (eg. via oal.document.FeatureField) would be a bit easier to > reason about. For instance a weight of 3 in FeatureField#newSaturationQuery > would have a similar impact as a term whose IDF is 3 (and thus docFreq ~= 5%) > rather than a term whose IDF is 3/(k1 + 1). -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-8563) Remove k1+1 from the numerator of BM25Similarity
[ https://issues.apache.org/jira/browse/LUCENE-8563?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16683858#comment-16683858 ] Adrien Grand commented on LUCENE-8563: -- [~ehaubert] The change I'm suggesting would divide every BM25 score by (k1+1), which doesn't affect ranking. Setting k1 to 0 would have the undesirable side-effect of disabling the impact of term frequency and document length: the formula that I wrote was a bit simplified as {{norm}} actually depends on {{k1}}, it looks like below when expanding {{norm}}: {code:java} boost * IDF * (k1+1) * tf / (tf + k1 * (1 - b + b * len / avgLen)) {code} > Remove k1+1 from the numerator of BM25Similarity > - > > Key: LUCENE-8563 > URL: https://issues.apache.org/jira/browse/LUCENE-8563 > Project: Lucene - Core > Issue Type: Improvement >Reporter: Adrien Grand >Priority: Minor > > Our current implementation of BM25 does > {code:java} > boost * IDF * (k1+1) * tf / (tf + norm) > {code} > As (k1+1) is a constant, it is the same for every term and doesn't modify > ordering. It is often omitted and I found out that the "The Probabilistic > Relevance Framework: BM25 and Beyond" paper by Robertson (BM25's author) and > Zaragova even describes adding (k1+1) to the numerator as a variant whose > benefit is to be more comparable with Robertson/Sparck-Jones weighting, which > we don't care about. > {quote}A common variant is to add a (k1 + 1) component to the > numerator of the saturation function. This is the same for all > terms, and therefore does not affect the ranking produced. > The reason for including it was to make the final formula > more compatible with the RSJ weight used on its own > {quote} > Should we remove it from BM25Similarity as well? > A side-effect that I'm interested in is that integrating other score > contributions (eg. via oal.document.FeatureField) would be a bit easier to > reason about. For instance a weight of 3 in FeatureField#newSaturationQuery > would have a similar impact as a term whose IDF is 3 (and thus docFreq ~= 5%) > rather than a term whose IDF is 3/(k1 + 1). -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-8563) Remove k1+1 from the numerator of BM25Similarity
[ https://issues.apache.org/jira/browse/LUCENE-8563?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16683815#comment-16683815 ] Elizabeth Haubert commented on LUCENE-8563: --- Mathematically, it changes the ratio of {code:java} tf * idf / ( tf + norm) {/code} which determines the relative importance of the norms parameter. It seems like that should affect ranking, at least for low values of tf. Why not just set the parameter to 0 for the cases you are looking at? > Remove k1+1 from the numerator of BM25Similarity > - > > Key: LUCENE-8563 > URL: https://issues.apache.org/jira/browse/LUCENE-8563 > Project: Lucene - Core > Issue Type: Improvement >Reporter: Adrien Grand >Priority: Minor > > Our current implementation of BM25 does > {code:java} > boost * IDF * (k1+1) * tf / (tf + norm) > {code} > As (k1+1) is a constant, it is the same for every term and doesn't modify > ordering. It is often omitted and I found out that the "The Probabilistic > Relevance Framework: BM25 and Beyond" paper by Robertson (BM25's author) and > Zaragova even describes adding (k1+1) to the numerator as a variant whose > benefit is to be more comparable with Robertson/Sparck-Jones weighting, which > we don't care about. > {quote}A common variant is to add a (k1 + 1) component to the > numerator of the saturation function. This is the same for all > terms, and therefore does not affect the ranking produced. > The reason for including it was to make the final formula > more compatible with the RSJ weight used on its own > {quote} > Should we remove it from BM25Similarity as well? > A side-effect that I'm interested in is that integrating other score > contributions (eg. via oal.document.FeatureField) would be a bit easier to > reason about. For instance a weight of 3 in FeatureField#newSaturationQuery > would have a similar impact as a term whose IDF is 3 (and thus docFreq ~= 5%) > rather than a term whose IDF is 3/(k1 + 1). -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org