jtibshirani commented on code in PR #910:
URL: https://github.com/apache/lucene/pull/910#discussion_r878568768
##
lucene/sandbox/src/test/org/apache/lucene/sandbox/search/TestCombinedFieldQuery.java:
##
@@ -589,4 +589,52 @@ public SimScorer scorer(
return new BM25Similarity().scorer(boost, collectionStats, termStats);
}
}
+
+ public void testDistributedCollectionStatistics() throws IOException {
+Directory dir = newDirectory();
+IndexWriterConfig iwc = new IndexWriterConfig();
+iwc.setSimilarity(randomCompatibleSimilarity());
+RandomIndexWriter w = new RandomIndexWriter(random(), dir, iwc);
+
+String queryString = "foo";
+
+Document doc0 = new Document();
+doc0.add(new TextField("f", "foo", Store.NO));
+doc0.add(new TextField("g", "foo baz", Store.NO));
+w.addDocument(doc0);
+
+IndexReader reader = w.getReader();
+IndexSearcher searcher =
+new IndexSearcher(reader) {
+ @Override
+ public CollectionStatistics collectionStatistics(String field)
throws IOException {
+CollectionStatistics shardStatistics =
super.collectionStatistics(field);
+int extraMaxDoc = randomIntBetween(0, 10);
+int extraDocCount = randomIntBetween(0, extraMaxDoc);
+int extraSumDocFreq = extraDocCount + randomIntBetween(0, 10);
+int extraSumTotalTermFreq = extraSumDocFreq + randomIntBetween(0,
10);
+CollectionStatistics globalStatistics =
+new CollectionStatistics(
+field,
+shardStatistics.maxDoc() + extraMaxDoc,
+shardStatistics.docCount() + extraDocCount,
+shardStatistics.sumTotalTermFreq() + extraSumTotalTermFreq,
+shardStatistics.sumDocFreq() + extraSumDocFreq);
+return globalStatistics;
+ }
+};
+searcher.setSimilarity(new BM25Similarity());
+CombinedFieldQuery query =
+new CombinedFieldQuery.Builder()
+.addField("f")
+.addField("g")
+.addTerm(new BytesRef(queryString))
+.build();
+// just check that search does not fail
+searcher.search(query, 10);
Review Comment:
It'd be nice to assert something stronger here, to check that
`CombinedFieldQuery` still works as expected when collection stats are
overridden. Maybe we could compare the output of two query strategies like we
do in `testCopyField`.
##
lucene/sandbox/src/test/org/apache/lucene/sandbox/search/TestCombinedFieldQuery.java:
##
@@ -589,4 +589,52 @@ public SimScorer scorer(
return new BM25Similarity().scorer(boost, collectionStats, termStats);
}
}
+
+ public void testDistributedCollectionStatistics() throws IOException {
+Directory dir = newDirectory();
+IndexWriterConfig iwc = new IndexWriterConfig();
+iwc.setSimilarity(randomCompatibleSimilarity());
+RandomIndexWriter w = new RandomIndexWriter(random(), dir, iwc);
+
+String queryString = "foo";
+
+Document doc0 = new Document();
+doc0.add(new TextField("f", "foo", Store.NO));
+doc0.add(new TextField("g", "foo baz", Store.NO));
+w.addDocument(doc0);
+
+IndexReader reader = w.getReader();
+IndexSearcher searcher =
+new IndexSearcher(reader) {
+ @Override
+ public CollectionStatistics collectionStatistics(String field)
throws IOException {
+CollectionStatistics shardStatistics =
super.collectionStatistics(field);
+int extraMaxDoc = randomIntBetween(0, 10);
+int extraDocCount = randomIntBetween(0, extraMaxDoc);
+int extraSumDocFreq = extraDocCount + randomIntBetween(0, 10);
+int extraSumTotalTermFreq = extraSumDocFreq + randomIntBetween(0,
10);
+CollectionStatistics globalStatistics =
+new CollectionStatistics(
+field,
+shardStatistics.maxDoc() + extraMaxDoc,
+shardStatistics.docCount() + extraDocCount,
+shardStatistics.sumTotalTermFreq() + extraSumTotalTermFreq,
+shardStatistics.sumDocFreq() + extraSumDocFreq);
+return globalStatistics;
+ }
+};
+searcher.setSimilarity(new BM25Similarity());
Review Comment:
It's unusual to search with a different similarity than was used during
indexing -- I think we could remove this line.
##
lucene/sandbox/src/test/org/apache/lucene/sandbox/search/TestCombinedFieldQuery.java:
##
@@ -589,4 +589,52 @@ public SimScorer scorer(
return new BM25Similarity().scorer(boost, collectionStats, termStats);
}
}
+
+ public void testDistributedCollectionStatistics() throws IOException {
Review Comment:
Small comment, maybe we could call this `testOverrideCollectionStatistics`?
Lucene doesn't really have a native concept of "distributed