[GitHub] [lucene] jtibshirani commented on a diff in pull request #910: LUCENE-10582: Fix merging of CollectionStatistics in CombinedFieldQuery

2022-05-24 Thread GitBox


jtibshirani commented on code in PR #910:
URL: https://github.com/apache/lucene/pull/910#discussion_r881050412


##
lucene/sandbox/src/test/org/apache/lucene/sandbox/search/TestCombinedFieldQuery.java:
##
@@ -589,4 +589,97 @@ public SimScorer scorer(
   return new BM25Similarity().scorer(boost, collectionStats, termStats);
 }
   }
+
+  public void testOverrideCollectionStatistics() throws IOException {
+Directory dir = newDirectory();
+IndexWriterConfig iwc = new IndexWriterConfig();
+Similarity similarity = randomCompatibleSimilarity();
+iwc.setSimilarity(similarity);
+RandomIndexWriter w = new RandomIndexWriter(random(), dir, iwc);
+
+int numMatch = atLeast(10);
+for (int i = 0; i < numMatch; i++) {
+  Document doc = new Document();
+  if (random().nextBoolean()) {
+doc.add(new TextField("a", "baz", Store.NO));
+doc.add(new TextField("b", "baz", Store.NO));
+for (int k = 0; k < 2; k++) {
+  doc.add(new TextField("ab", "baz", Store.NO));
+}
+w.addDocument(doc);
+doc.clear();
+  }
+  int freqA = random().nextInt(5) + 1;
+  for (int j = 0; j < freqA; j++) {
+doc.add(new TextField("a", "foo", Store.NO));
+  }
+  int freqB = random().nextInt(5) + 1;
+  for (int j = 0; j < freqB; j++) {
+doc.add(new TextField("b", "foo", Store.NO));
+  }
+  int freqAB = freqA + freqB;
+  for (int j = 0; j < freqAB; j++) {
+doc.add(new TextField("ab", "foo", Store.NO));
+  }
+  w.addDocument(doc);
+}
+
+IndexReader reader = w.getReader();
+
+int extraMaxDoc = randomIntBetween(0, 10);
+int extraDocCount = randomIntBetween(0, extraMaxDoc);
+
+int extraSumDocFreqA = extraDocCount + randomIntBetween(0, 10);

Review Comment:
   I think it'd make more sense to have a single `sumDocFreq` here. This 
represents the number of unique term-document pairs, and we can't just add the 
values across different fields. In fact `CombinedFieldQuery` chooses to take a 
maximum of the `sumDocFreq`.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] jtibshirani commented on a diff in pull request #910: LUCENE-10582: Fix merging of CollectionStatistics in CombinedFieldQuery

2022-05-20 Thread GitBox


jtibshirani commented on code in PR #910:
URL: https://github.com/apache/lucene/pull/910#discussion_r878568768


##
lucene/sandbox/src/test/org/apache/lucene/sandbox/search/TestCombinedFieldQuery.java:
##
@@ -589,4 +589,52 @@ public SimScorer scorer(
   return new BM25Similarity().scorer(boost, collectionStats, termStats);
 }
   }
+
+  public void testDistributedCollectionStatistics() throws IOException {
+Directory dir = newDirectory();
+IndexWriterConfig iwc = new IndexWriterConfig();
+iwc.setSimilarity(randomCompatibleSimilarity());
+RandomIndexWriter w = new RandomIndexWriter(random(), dir, iwc);
+
+String queryString = "foo";
+
+Document doc0 = new Document();
+doc0.add(new TextField("f", "foo", Store.NO));
+doc0.add(new TextField("g", "foo baz", Store.NO));
+w.addDocument(doc0);
+
+IndexReader reader = w.getReader();
+IndexSearcher searcher =
+new IndexSearcher(reader) {
+  @Override
+  public CollectionStatistics collectionStatistics(String field) 
throws IOException {
+CollectionStatistics shardStatistics = 
super.collectionStatistics(field);
+int extraMaxDoc = randomIntBetween(0, 10);
+int extraDocCount = randomIntBetween(0, extraMaxDoc);
+int extraSumDocFreq = extraDocCount + randomIntBetween(0, 10);
+int extraSumTotalTermFreq = extraSumDocFreq + randomIntBetween(0, 
10);
+CollectionStatistics globalStatistics =
+new CollectionStatistics(
+field,
+shardStatistics.maxDoc() + extraMaxDoc,
+shardStatistics.docCount() + extraDocCount,
+shardStatistics.sumTotalTermFreq() + extraSumTotalTermFreq,
+shardStatistics.sumDocFreq() + extraSumDocFreq);
+return globalStatistics;
+  }
+};
+searcher.setSimilarity(new BM25Similarity());
+CombinedFieldQuery query =
+new CombinedFieldQuery.Builder()
+.addField("f")
+.addField("g")
+.addTerm(new BytesRef(queryString))
+.build();
+// just check that search does not fail
+searcher.search(query, 10);

Review Comment:
   It'd be nice to assert something stronger here, to check that 
`CombinedFieldQuery` still works as expected when collection stats are 
overridden. Maybe we could compare the output of two query strategies like we 
do in `testCopyField`.



##
lucene/sandbox/src/test/org/apache/lucene/sandbox/search/TestCombinedFieldQuery.java:
##
@@ -589,4 +589,52 @@ public SimScorer scorer(
   return new BM25Similarity().scorer(boost, collectionStats, termStats);
 }
   }
+
+  public void testDistributedCollectionStatistics() throws IOException {
+Directory dir = newDirectory();
+IndexWriterConfig iwc = new IndexWriterConfig();
+iwc.setSimilarity(randomCompatibleSimilarity());
+RandomIndexWriter w = new RandomIndexWriter(random(), dir, iwc);
+
+String queryString = "foo";
+
+Document doc0 = new Document();
+doc0.add(new TextField("f", "foo", Store.NO));
+doc0.add(new TextField("g", "foo baz", Store.NO));
+w.addDocument(doc0);
+
+IndexReader reader = w.getReader();
+IndexSearcher searcher =
+new IndexSearcher(reader) {
+  @Override
+  public CollectionStatistics collectionStatistics(String field) 
throws IOException {
+CollectionStatistics shardStatistics = 
super.collectionStatistics(field);
+int extraMaxDoc = randomIntBetween(0, 10);
+int extraDocCount = randomIntBetween(0, extraMaxDoc);
+int extraSumDocFreq = extraDocCount + randomIntBetween(0, 10);
+int extraSumTotalTermFreq = extraSumDocFreq + randomIntBetween(0, 
10);
+CollectionStatistics globalStatistics =
+new CollectionStatistics(
+field,
+shardStatistics.maxDoc() + extraMaxDoc,
+shardStatistics.docCount() + extraDocCount,
+shardStatistics.sumTotalTermFreq() + extraSumTotalTermFreq,
+shardStatistics.sumDocFreq() + extraSumDocFreq);
+return globalStatistics;
+  }
+};
+searcher.setSimilarity(new BM25Similarity());

Review Comment:
   It's unusual to search with a different similarity than was used during 
indexing -- I think we could remove this line.



##
lucene/sandbox/src/test/org/apache/lucene/sandbox/search/TestCombinedFieldQuery.java:
##
@@ -589,4 +589,52 @@ public SimScorer scorer(
   return new BM25Similarity().scorer(boost, collectionStats, termStats);
 }
   }
+
+  public void testDistributedCollectionStatistics() throws IOException {

Review Comment:
   Small comment, maybe we could call this `testOverrideCollectionStatistics`? 
Lucene doesn't really have a native concept of "distributed