[jira] [Commented] (LUCENE-7368) Remove queryNorm

2020-05-22 Thread Dumitru Daniliuc (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-7368?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17114084#comment-17114084
 ] 

Dumitru Daniliuc commented on LUCENE-7368:
--

You are right: our custom Similarity implementation did not override 
{{queryNorm()}}, so it defaulted to {{Similarity.queryNorm()}} which used to 
always return 1.0f. Thanks for your explanation and for helping us debug this!

> Remove queryNorm
> 
>
> Key: LUCENE-7368
> URL: https://issues.apache.org/jira/browse/LUCENE-7368
> Project: Lucene - Core
>  Issue Type: Sub-task
>Reporter: Adrien Grand
>Assignee: Adrien Grand
>Priority: Major
> Fix For: 7.0
>
> Attachments: LUCENE-7368.patch
>
>
> Splitting LUCENE-7347 into smaller tasks.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Comment Edited] (LUCENE-7368) Remove queryNorm

2020-05-21 Thread Dumitru Daniliuc (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-7368?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17113558#comment-17113558
 ] 

Dumitru Daniliuc edited comment on LUCENE-7368 at 5/21/20, 9:55 PM:


[~jpountz], thanks for looking into this! Here's the old explain message 
(Lucene 6.6.6):
{noformat}
202743.53 = , product of:
  587.6624 = sum of:
587.6624 = sum of:
  587.6624 = sum of:
587.6624 = weight(username:barackobama in 0) 
[UserSimilarityProvider], result of:
  587.6624 = score(doc=0,freq=1.0), product of:
33.93845 = queryWeight, product of:
  1.96 = boost
  17.315535 = idf, computed as log((docCount+1)/(docFreq+1)) + 
1 from:
1.0 = docFreq
2.4365572E7 = docCount
  1.0 = queryNorm
17.315535 = fieldWeight in 0, product of:
  1.0 = tf(freq=1.0), with freq of:
1.0 = termFreq=1.0
  17.315535 = idf, computed as log((docCount+1)/(docFreq+1)) + 
1 from:
1.0 = docFreq
2.4365572E7 = docCount
  1.0 = fieldNorm(doc=0)
  345.0 = 
{noformat}

And here's the new one (Lucene 7.7.2):
{noformat}
11708.552 = , product of:
  33.93783 = sum of:
33.93783 = sum of:
  33.93783 = sum of:
33.93783 = weight(username:barackobama in 0) 
[UserSimilarityProvider], result of:
  33.93783 = score(doc=0,freq=1.0), product of:
1.96 = boost
17.31522 = fieldWeight in 0, product of:
  1.0 = tf(freq=1.0), with freq of:
1.0 = termFreq=1.0
  17.31522 = idf, computed as log((docCount+1)/(docFreq+1)) + 1 
from:
1.0 = docFreq
2.4357912E7 = docCount
  1.0 = fieldNorm(doc=0)
  345.0 = 
{noformat}

I'll take a look at the methods you mentioned.


was (Author: ddaniliuc):
[~jpountz], thanks for looking into this! Here's the old explain message:
{noformat}
202743.53 = , product of:
  587.6624 = sum of:
587.6624 = sum of:
  587.6624 = sum of:
587.6624 = weight(username:barackobama in 0) 
[UserSimilarityProvider], result of:
  587.6624 = score(doc=0,freq=1.0), product of:
33.93845 = queryWeight, product of:
  1.96 = boost
  17.315535 = idf, computed as log((docCount+1)/(docFreq+1)) + 
1 from:
1.0 = docFreq
2.4365572E7 = docCount
  1.0 = queryNorm
17.315535 = fieldWeight in 0, product of:
  1.0 = tf(freq=1.0), with freq of:
1.0 = termFreq=1.0
  17.315535 = idf, computed as log((docCount+1)/(docFreq+1)) + 
1 from:
1.0 = docFreq
2.4365572E7 = docCount
  1.0 = fieldNorm(doc=0)
  345.0 = 
{noformat}

And here's the new one:
{noformat}
11708.552 = , product of:
  33.93783 = sum of:
33.93783 = sum of:
  33.93783 = sum of:
33.93783 = weight(username:barackobama in 0) 
[UserSimilarityProvider], result of:
  33.93783 = score(doc=0,freq=1.0), product of:
1.96 = boost
17.31522 = fieldWeight in 0, product of:
  1.0 = tf(freq=1.0), with freq of:
1.0 = termFreq=1.0
  17.31522 = idf, computed as log((docCount+1)/(docFreq+1)) + 1 
from:
1.0 = docFreq
2.4357912E7 = docCount
  1.0 = fieldNorm(doc=0)
  345.0 = 
{noformat}

I'll take a look at the methods you mentioned.

> Remove queryNorm
> 
>
> Key: LUCENE-7368
> URL: https://issues.apache.org/jira/browse/LUCENE-7368
> Project: Lucene - Core
>  Issue Type: Sub-task
>Reporter: Adrien Grand
>Assignee: Adrien Grand
>Priority: Major
> Fix For: 7.0
>
> Attachments: LUCENE-7368.patch
>
>
> Splitting LUCENE-7347 into smaller tasks.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Comment Edited] (LUCENE-7368) Remove queryNorm

2020-05-21 Thread Dumitru Daniliuc (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-7368?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17113558#comment-17113558
 ] 

Dumitru Daniliuc edited comment on LUCENE-7368 at 5/21/20, 9:31 PM:


[~jpountz], thanks for looking into this! Here's the old explain message:
{noformat}
202743.53 = , product of:
  587.6624 = sum of:
587.6624 = sum of:
  587.6624 = sum of:
587.6624 = weight(username:barackobama in 0) 
[UserSimilarityProvider], result of:
  587.6624 = score(doc=0,freq=1.0), product of:
33.93845 = queryWeight, product of:
  1.96 = boost
  17.315535 = idf, computed as log((docCount+1)/(docFreq+1)) + 
1 from:
1.0 = docFreq
2.4365572E7 = docCount
  1.0 = queryNorm
17.315535 = fieldWeight in 0, product of:
  1.0 = tf(freq=1.0), with freq of:
1.0 = termFreq=1.0
  17.315535 = idf, computed as log((docCount+1)/(docFreq+1)) + 
1 from:
1.0 = docFreq
2.4365572E7 = docCount
  1.0 = fieldNorm(doc=0)
  345.0 = 
{noformat}

And here's the new one:
{noformat}
11708.552 = , product of:
  33.93783 = sum of:
33.93783 = sum of:
  33.93783 = sum of:
33.93783 = weight(username:barackobama in 0) 
[UserSimilarityProvider], result of:
  33.93783 = score(doc=0,freq=1.0), product of:
1.96 = boost
17.31522 = fieldWeight in 0, product of:
  1.0 = tf(freq=1.0), with freq of:
1.0 = termFreq=1.0
  17.31522 = idf, computed as log((docCount+1)/(docFreq+1)) + 1 
from:
1.0 = docFreq
2.4357912E7 = docCount
  1.0 = fieldNorm(doc=0)
  345.0 = 
{noformat}

I'll take a look at the methods you mentioned.


was (Author: ddaniliuc):
[~jpountz], thanks for looking into this! Here's the old explain message:
{noformat}
202743.53 = , product of:
  587.6624 = sum of:
587.6624 = sum of:
  587.6624 = sum of:
587.6624 = weight(username:barackobama in 0) 
[UserSimilarityProvider], result of:
  587.6624 = score(doc=0,freq=1.0), product of:
33.93845 = queryWeight, product of:
  1.96 = boost
  17.315535 = idf, computed as log((docCount+1)/(docFreq+1)) + 
1 from:
1.0 = docFreq
2.4365572E7 = docCount
  1.0 = queryNorm
17.315535 = fieldWeight in 0, product of:
  1.0 = tf(freq=1.0), with freq of:
1.0 = termFreq=1.0
  17.315535 = idf, computed as log((docCount+1)/(docFreq+1)) + 
1 from:
1.0 = docFreq
2.4365572E7 = docCount
  1.0 = fieldNorm(doc=0)
  345.0 = 
{noformat}

And here's the new one:
{noformat}
11708.552 = , product of:
  33.93783 = sum of:
33.93783 = sum of:
  33.93783 = sum of:
33.93783 = weight(username:barackobama in 0) 
[UserSimilarityProvider], result of:
  33.93783 = score(doc=0,freq=1.0), product of:
1.96 = boost
17.31522 = fieldWeight in 0, product of:
  1.0 = tf(freq=1.0), with freq of:
1.0 = termFreq=1.0
  17.31522 = idf, computed as log((docCount+1)/(docFreq+1)) + 1 
from:
1.0 = docFreq
2.4357912E7 = docCount
  1.0 = fieldNorm(doc=0)
  345.0 = 
{noformat}

I'll take a look at the IndexSearcher methods you mentioned and see if we 
missed anything in our code (it's possible we override some of this behavior, 
and did not make the appropriate changes).

> Remove queryNorm
> 
>
> Key: LUCENE-7368
> URL: https://issues.apache.org/jira/browse/LUCENE-7368
> Project: Lucene - Core
>  Issue Type: Sub-task
>Reporter: Adrien Grand
>Assignee: Adrien Grand
>Priority: Major
> Fix For: 7.0
>
> Attachments: LUCENE-7368.patch
>
>
> Splitting LUCENE-7347 into smaller tasks.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-7368) Remove queryNorm

2020-05-21 Thread Dumitru Daniliuc (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-7368?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17113558#comment-17113558
 ] 

Dumitru Daniliuc commented on LUCENE-7368:
--

[~jpountz], thanks for looking into this! Here's the old explain message:
{noformat}
202743.53 = , product of:
  587.6624 = sum of:
587.6624 = sum of:
  587.6624 = sum of:
587.6624 = weight(username:barackobama in 0) 
[UserSimilarityProvider], result of:
  587.6624 = score(doc=0,freq=1.0), product of:
33.93845 = queryWeight, product of:
  1.96 = boost
  17.315535 = idf, computed as log((docCount+1)/(docFreq+1)) + 
1 from:
1.0 = docFreq
2.4365572E7 = docCount
  1.0 = queryNorm
17.315535 = fieldWeight in 0, product of:
  1.0 = tf(freq=1.0), with freq of:
1.0 = termFreq=1.0
  17.315535 = idf, computed as log((docCount+1)/(docFreq+1)) + 
1 from:
1.0 = docFreq
2.4365572E7 = docCount
  1.0 = fieldNorm(doc=0)
  345.0 = 
{noformat}

And here's the new one:
{noformat}
11708.552 = , product of:
  33.93783 = sum of:
33.93783 = sum of:
  33.93783 = sum of:
33.93783 = weight(username:barackobama in 0) 
[UserSimilarityProvider], result of:
  33.93783 = score(doc=0,freq=1.0), product of:
1.96 = boost
17.31522 = fieldWeight in 0, product of:
  1.0 = tf(freq=1.0), with freq of:
1.0 = termFreq=1.0
  17.31522 = idf, computed as log((docCount+1)/(docFreq+1)) + 1 
from:
1.0 = docFreq
2.4357912E7 = docCount
  1.0 = fieldNorm(doc=0)
  345.0 = 
{noformat}

I'll take a look at the IndexSearcher methods you mentioned and see if we 
missed anything in our code (it's possible we override some of this behavior, 
and did not make the appropriate changes).

> Remove queryNorm
> 
>
> Key: LUCENE-7368
> URL: https://issues.apache.org/jira/browse/LUCENE-7368
> Project: Lucene - Core
>  Issue Type: Sub-task
>Reporter: Adrien Grand
>Assignee: Adrien Grand
>Priority: Major
> Fix For: 7.0
>
> Attachments: LUCENE-7368.patch
>
>
> Splitting LUCENE-7347 into smaller tasks.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-7368) Remove queryNorm

2020-05-11 Thread Dumitru Daniliuc (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-7368?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17104642#comment-17104642
 ] 

Dumitru Daniliuc commented on LUCENE-7368:
--

[~jpountz], I was wondering if you could help with a question about this patch. 
The javadocs for TFIDFSimilarity say that the IDF factor is squared in the 
final score. And before this patch (Lucene 6.6.6), it looks like it was:
{noformat}
  private final class TFIDFSimScorer extends SimScorer {
...
TFIDFSimScorer(IDFStats stats, NumericDocValues norms) throws IOException {
  this.stats = stats;
  this.weightValue = stats.value;  // <--- stats.value = queryNorm * boost 
* idf.getValue() * idf.getValue()
  this.norms = norms;
}
  }

  private static class IDFStats extends SimWeight {
...
public IDFStats(String field, Explanation idf) {
  // TODO: Validate?
  this.field = field;
  this.idf = idf;
  normalize(1f, 1f);
}
...
@Override
public void normalize(float queryNorm, float boost) {
  this.boost = boost;
  this.queryNorm = queryNorm;
  queryWeight = queryNorm * boost * idf.getValue();
  value = queryWeight * idf.getValue(); // idf for document
}
  }
{noformat}

After this patch though (Lucene 7.0.0 and beyond), it looks like we lost an IDF 
factor in this code:
{noformat}
  private final class TFIDFSimScorer extends SimScorer {

TFIDFSimScorer(IDFStats stats, NumericDocValues norms, float[] normTable) 
throws IOException {
  this.stats = stats;
  this.weightValue = stats.queryWeight;  // <--- stats.queryWeight = boost 
* idf.getValue()
  this.norms = norms;
  this.normTable = normTable;
}
  }

  static class IDFStats extends SimWeight {
...
public IDFStats(String field, float boost, Explanation idf, float[] 
normTable) {
  // TODO: Validate?
  this.field = field;
  this.idf = idf;
  this.boost = boost;
  this.queryWeight = boost * idf.getValue();
  this.normTable = normTable;
}
  }  
{noformat}

Was this change intentional? If so, I was wondering if you could point us to 
the location where the second IDF factor is supposed to come from now.

For a bit more context: we've been running on an old Lucene version for a 
while, and we're working now on getting to the latest Lucene version (one major 
version at a time), and we've noticed that the scores for our results have lost 
an IDF factor when upgrading from Lucene 6.6.6 to 7.0.0, and this patch seems 
relevant.

Thanks!

> Remove queryNorm
> 
>
> Key: LUCENE-7368
> URL: https://issues.apache.org/jira/browse/LUCENE-7368
> Project: Lucene - Core
>  Issue Type: Sub-task
>Reporter: Adrien Grand
>Assignee: Adrien Grand
>Priority: Major
> Fix For: 7.0
>
> Attachments: LUCENE-7368.patch
>
>
> Splitting LUCENE-7347 into smaller tasks.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org