[jira] [Commented] (LUCENE-7368) Remove queryNorm
[ https://issues.apache.org/jira/browse/LUCENE-7368?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17114084#comment-17114084 ] Dumitru Daniliuc commented on LUCENE-7368: -- You are right: our custom Similarity implementation did not override {{queryNorm()}}, so it defaulted to {{Similarity.queryNorm()}} which used to always return 1.0f. Thanks for your explanation and for helping us debug this! > Remove queryNorm > > > Key: LUCENE-7368 > URL: https://issues.apache.org/jira/browse/LUCENE-7368 > Project: Lucene - Core > Issue Type: Sub-task >Reporter: Adrien Grand >Assignee: Adrien Grand >Priority: Major > Fix For: 7.0 > > Attachments: LUCENE-7368.patch > > > Splitting LUCENE-7347 into smaller tasks. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-7368) Remove queryNorm
[ https://issues.apache.org/jira/browse/LUCENE-7368?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17114021#comment-17114021 ] Adrien Grand commented on LUCENE-7368: -- It looks to me like the problem is in 6.6.6, not 7.7.2. Seeing queryNorm=1 suggests that your custom similarity incompletely implements query normalization. See for instance what the same explanation looks like with ClassicSimilarity: the IDF factor of the queryWeight gets cancelled by the queryNorm, and only fieldWeight retains an IDF factor. > Remove queryNorm > > > Key: LUCENE-7368 > URL: https://issues.apache.org/jira/browse/LUCENE-7368 > Project: Lucene - Core > Issue Type: Sub-task >Reporter: Adrien Grand >Assignee: Adrien Grand >Priority: Major > Fix For: 7.0 > > Attachments: LUCENE-7368.patch > > > Splitting LUCENE-7347 into smaller tasks. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-7368) Remove queryNorm
[ https://issues.apache.org/jira/browse/LUCENE-7368?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17113558#comment-17113558 ] Dumitru Daniliuc commented on LUCENE-7368: -- [~jpountz], thanks for looking into this! Here's the old explain message: {noformat} 202743.53 = , product of: 587.6624 = sum of: 587.6624 = sum of: 587.6624 = sum of: 587.6624 = weight(username:barackobama in 0) [UserSimilarityProvider], result of: 587.6624 = score(doc=0,freq=1.0), product of: 33.93845 = queryWeight, product of: 1.96 = boost 17.315535 = idf, computed as log((docCount+1)/(docFreq+1)) + 1 from: 1.0 = docFreq 2.4365572E7 = docCount 1.0 = queryNorm 17.315535 = fieldWeight in 0, product of: 1.0 = tf(freq=1.0), with freq of: 1.0 = termFreq=1.0 17.315535 = idf, computed as log((docCount+1)/(docFreq+1)) + 1 from: 1.0 = docFreq 2.4365572E7 = docCount 1.0 = fieldNorm(doc=0) 345.0 = {noformat} And here's the new one: {noformat} 11708.552 = , product of: 33.93783 = sum of: 33.93783 = sum of: 33.93783 = sum of: 33.93783 = weight(username:barackobama in 0) [UserSimilarityProvider], result of: 33.93783 = score(doc=0,freq=1.0), product of: 1.96 = boost 17.31522 = fieldWeight in 0, product of: 1.0 = tf(freq=1.0), with freq of: 1.0 = termFreq=1.0 17.31522 = idf, computed as log((docCount+1)/(docFreq+1)) + 1 from: 1.0 = docFreq 2.4357912E7 = docCount 1.0 = fieldNorm(doc=0) 345.0 = {noformat} I'll take a look at the IndexSearcher methods you mentioned and see if we missed anything in our code (it's possible we override some of this behavior, and did not make the appropriate changes). > Remove queryNorm > > > Key: LUCENE-7368 > URL: https://issues.apache.org/jira/browse/LUCENE-7368 > Project: Lucene - Core > Issue Type: Sub-task >Reporter: Adrien Grand >Assignee: Adrien Grand >Priority: Major > Fix For: 7.0 > > Attachments: LUCENE-7368.patch > > > Splitting LUCENE-7347 into smaller tasks. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-7368) Remove queryNorm
[ https://issues.apache.org/jira/browse/LUCENE-7368?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17113528#comment-17113528 ] Adrien Grand commented on LUCENE-7368: -- [~ddaniliuc] It was intentional. This second IDF factor was only used for the normalization logic, the IDF would not be squared in the final score. See how IndexSearcher#createNormalizedWeight works: [https://github.com/apache/lucene-solr/blob/branch_6x/lucene/core/src/java/org/apache/lucene/search/IndexSearcher.java#L732-L742], here is what would happen for a TermQuery and ClassicSimilarity: - The term weight is initially computed as {{boost * IDF^2}} as you noted. - {{float v = weight.getValueForNormalization(); // v == boost^2 * IDF^2}} - {{float norm = getSimilarity(needsScores).queryNorm(v); // norm == 1/sqrt(v) == 1/(boost * IDF)}} - {{weight.normalize(norm, 1.0f); // value == norm * boost * IDF^2 == IDF}} Can you share the output of {{IndexSearcher#explain}} before and after the change? > Remove queryNorm > > > Key: LUCENE-7368 > URL: https://issues.apache.org/jira/browse/LUCENE-7368 > Project: Lucene - Core > Issue Type: Sub-task >Reporter: Adrien Grand >Assignee: Adrien Grand >Priority: Major > Fix For: 7.0 > > Attachments: LUCENE-7368.patch > > > Splitting LUCENE-7347 into smaller tasks. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-7368) Remove queryNorm
[ https://issues.apache.org/jira/browse/LUCENE-7368?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17104642#comment-17104642 ] Dumitru Daniliuc commented on LUCENE-7368: -- [~jpountz], I was wondering if you could help with a question about this patch. The javadocs for TFIDFSimilarity say that the IDF factor is squared in the final score. And before this patch (Lucene 6.6.6), it looks like it was: {noformat} private final class TFIDFSimScorer extends SimScorer { ... TFIDFSimScorer(IDFStats stats, NumericDocValues norms) throws IOException { this.stats = stats; this.weightValue = stats.value; // <--- stats.value = queryNorm * boost * idf.getValue() * idf.getValue() this.norms = norms; } } private static class IDFStats extends SimWeight { ... public IDFStats(String field, Explanation idf) { // TODO: Validate? this.field = field; this.idf = idf; normalize(1f, 1f); } ... @Override public void normalize(float queryNorm, float boost) { this.boost = boost; this.queryNorm = queryNorm; queryWeight = queryNorm * boost * idf.getValue(); value = queryWeight * idf.getValue(); // idf for document } } {noformat} After this patch though (Lucene 7.0.0 and beyond), it looks like we lost an IDF factor in this code: {noformat} private final class TFIDFSimScorer extends SimScorer { TFIDFSimScorer(IDFStats stats, NumericDocValues norms, float[] normTable) throws IOException { this.stats = stats; this.weightValue = stats.queryWeight; // <--- stats.queryWeight = boost * idf.getValue() this.norms = norms; this.normTable = normTable; } } static class IDFStats extends SimWeight { ... public IDFStats(String field, float boost, Explanation idf, float[] normTable) { // TODO: Validate? this.field = field; this.idf = idf; this.boost = boost; this.queryWeight = boost * idf.getValue(); this.normTable = normTable; } } {noformat} Was this change intentional? If so, I was wondering if you could point us to the location where the second IDF factor is supposed to come from now. For a bit more context: we've been running on an old Lucene version for a while, and we're working now on getting to the latest Lucene version (one major version at a time), and we've noticed that the scores for our results have lost an IDF factor when upgrading from Lucene 6.6.6 to 7.0.0, and this patch seems relevant. Thanks! > Remove queryNorm > > > Key: LUCENE-7368 > URL: https://issues.apache.org/jira/browse/LUCENE-7368 > Project: Lucene - Core > Issue Type: Sub-task >Reporter: Adrien Grand >Assignee: Adrien Grand >Priority: Major > Fix For: 7.0 > > Attachments: LUCENE-7368.patch > > > Splitting LUCENE-7347 into smaller tasks. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org