Re: [PR] Make FSTPostingFormat to build FST off-heap [lucene]

2024-03-14 Thread via GitHub


github-actions[bot] commented on PR #12980:
URL: https://github.com/apache/lucene/pull/12980#issuecomment-1998682382

   This PR has not had activity in the past 2 weeks, labeling it as stale. If 
the PR is waiting for review, notify the d...@lucene.apache.org list. Thank you 
for your contribution!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [I] TestIDVersionPostingsFormat failure [lucene]

2024-03-14 Thread via GitHub


benwtrent commented on issue #13127:
URL: https://github.com/apache/lucene/issues/13127#issuecomment-1998480248

   More investigation is needed. The only other method that updates 
`DWDQ#nextSeqNo` is `DWDQ#skipSequenceNumbers(long)`.
   
   The only place that `DWDQ#skipSequenceNumbers(long)` is called is within 
`DW#lockAndAbortAll` which is synchronized on `DW` which disallows 
`DW#resetDeleteQueue` or `DW#getNextSequenceNumber` from messing with the queue.
   
   I am not sure we have a race condition or just actually bad logic somewhere 
in these paths. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [I] TestIDVersionPostingsFormat failure [lucene]

2024-03-14 Thread via GitHub


benwtrent commented on issue #13127:
URL: https://github.com/apache/lucene/issues/13127#issuecomment-1998402170

   Well, that race-condition wasn't the cause. I have seen another failure.
   ```
   ./gradlew test --tests TestIDVersionPostingsFormat.testGlobalVersions 
-Dtests.seed=DEC45C861B1BCFEE -Dtests.locale=ar-YE -Dtests.timezone=Etc/GMT-8 
-Dtests.asserts=true -Dtests.file.encoding=UTF-8
   ```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Add new parallel merge task executor for parallel actions within a single merge action [lucene]

2024-03-14 Thread via GitHub


benwtrent merged PR #13124:
URL: https://github.com/apache/lucene/pull/13124


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Make the HitQueue size more appropriate for KNN exact search [lucene]

2024-03-14 Thread via GitHub


benwtrent commented on code in PR #13184:
URL: https://github.com/apache/lucene/pull/13184#discussion_r1525341872


##
lucene/join/src/java/org/apache/lucene/search/join/DiversifyingChildrenFloatKnnVectorQuery.java:
##
@@ -98,7 +98,8 @@ protected TopDocs exactSearch(LeafReaderContext context, 
DocIdSetIterator accept
 parentBitSet,
 query,
 fi.getVectorSimilarityFunction());
-HitQueue queue = new HitQueue(k, true);
+final int queueSize = Math.min(k, Math.toIntExact(acceptIterator.cost()));

Review Comment:
   I don't know a better way, but, since this diversifies over parent doc ids, 
its possible that the hitqueue is still much smaller than 
`acceptIterator.cost()` as `acceptIterator.cost()` is the iterator over CHILD 
docs (e.g. passage vector docs). I think any further optimization (e.g. 
counting the number of relevant parents) would add undo overhead.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Change BP reordering logic to help support document blocks later on. [lucene]

2024-03-14 Thread via GitHub


rishabhmaurya commented on code in PR #13123:
URL: https://github.com/apache/lucene/pull/13123#discussion_r1525325740


##
lucene/misc/src/java/org/apache/lucene/misc/index/BPIndexReorderer.java:
##
@@ -341,116 +344,94 @@ protected void compute() {
  */
 private boolean shuffle(
 ForwardIndex forwardIndex,
-IntsRef left,
-IntsRef right,
+IntsRef docIDs,
+int midPoint,
 int[] leftDocFreqs,
 int[] rightDocFreqs,
-float[] gains,
+float[] biases,
 int iter)
 throws IOException {
-  assert left.ints == right.ints;
-  assert left.offset + left.length == right.offset;
 
-  // Computing gains is typically a bottleneck, because each iteration 
needs to iterate over all
-  // postings to recompute gains, and the total number of postings is 
usually one order of
+  // Computing biases is typically a bottleneck, because each iteration 
needs to iterate over
+  // all postings to recompute biases, and the total number of postings is 
usually one order of
   // magnitude or more larger than the number of docs. So we try to 
parallelize it.
-  ComputeGainsTask leftGainsTask =
-  new ComputeGainsTask(
-  left.ints,
-  gains,
-  left.offset,
-  left.offset + left.length,
+  new ComputeBiasTask(
+  docIDs.ints,
+  biases,
+  docIDs.offset,
+  docIDs.offset + docIDs.length,
   leftDocFreqs,
   rightDocFreqs,
   threadLocal,
-  depth);
-  ComputeGainsTask rightGainsTask =
-  new ComputeGainsTask(
-  right.ints,
-  gains,
-  right.offset,
-  right.offset + right.length,
-  rightDocFreqs,
-  leftDocFreqs,
-  threadLocal,
-  depth);
-  if (shouldFork(docIDs.length, docIDs.ints.length)) {
-invokeAll(leftGainsTask, rightGainsTask);
-  } else {
-leftGainsTask.compute();
-rightGainsTask.compute();
+  depth)
+  .compute();
+
+  float maxLeftBias = Float.NEGATIVE_INFINITY;
+  for (int i = docIDs.offset; i < midPoint; ++i) {
+maxLeftBias = Math.max(maxLeftBias, biases[i]);
+  }
+  float minRightBias = Float.POSITIVE_INFINITY;
+  for (int i = midPoint, end = docIDs.offset + docIDs.length; i < end; 
++i) {
+minRightBias = Math.min(minRightBias, biases[i]);
+  }
+  float gain = maxLeftBias - minRightBias;
+  // This uses the simulated annealing proposed by Mackenzie et al in 
"Tradeoff Options for
+  // Bipartite Graph Partitioning" by comparing the gain of swapping the 
doc from the left side
+  // that is most attracted to the right and the doc from the right side 
that is most attracted
+  // to the left against `iter` rather than zero.
+  if (gain <= iter) {
+return false;
   }
 
-  class ByDescendingGainSorter extends IntroSorter {
+  new IntroSelector() {
 
 int pivotDoc;
-float pivotGain;
+float pivotBias;
 
 @Override
 protected void setPivot(int i) {
-  pivotDoc = left.ints[i];
-  pivotGain = gains[i];
+  pivotDoc = docIDs.ints[i];
+  pivotBias = biases[i];
 }
 
 @Override
 protected int comparePivot(int j) {
-  // Compare in reverse order to get a descending sort
-  int cmp = Float.compare(gains[j], pivotGain);
+  int cmp = Float.compare(pivotBias, biases[j]);

Review Comment:
   @jpountz done https://github.com/apache/lucene/pull/13186. I wasn't sure if 
its worthy a PR, thus started discussion here. 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[PR] Avoid iterations: cooling using simulated annealing [lucene]

2024-03-14 Thread via GitHub


rishabhmaurya opened a new pull request, #13186:
URL: https://github.com/apache/lucene/pull/13186

   ### Description
   As described in the paper (Tradeoff Options for Bipartite Graph 
Partitioning), simulated annealing-type mechanism be employed to reduce number 
of swaps with each iteration. If projected advantage swapping 2 docs across 
partitions isn't as appealing, in this case less than `iter` bits, then swap 
can be avoided. 
   Swapping documents is a heavy operation as it requires recomputing biases 
for documents which are affected by the swap i.e. have common terms with other 
documents in the partition. 
   Currently, its only employed once before starting the shuffle operation, 
where we check against the max gain possible for a given iteration and if its 
at least `iter` bits apart. This condition can be checked with each swap when 
performing quick select, if its worthy.  
   
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Add new parallel merge task executor for parallel actions within a single merge action [lucene]

2024-03-14 Thread via GitHub


zhaih commented on code in PR #13124:
URL: https://github.com/apache/lucene/pull/13124#discussion_r1525307991


##
lucene/core/src/java/org/apache/lucene/index/ConcurrentMergeScheduler.java:
##
@@ -902,12 +932,52 @@ private static String getSegmentName(MergePolicy.OneMerge 
merge) {
   }
 
   static {
-TestSecrets.setConcurrentMergeSchedulerAccess(
-new ConcurrentMergeSchedulerAccess() {
-  @Override
-  public void setSuppressExceptions(ConcurrentMergeScheduler cms) {
-cms.setSuppressExceptions();
-  }
-});
+
TestSecrets.setConcurrentMergeSchedulerAccess(ConcurrentMergeScheduler::setSuppressExceptions);
+  }
+
+  private class CachedExecutor implements Executor {

Review Comment:
   Can we have some simple javadoc explaining what this executor is doing? E.g. 
will use a spare thread if it is within configured max thread and will run 
directly on caller thread if we have no extra thread available?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Support disabling IndexSearcher.maxClauseCount with a value of -1 [lucene]

2024-03-14 Thread via GitHub


dsmiley commented on PR #13178:
URL: https://github.com/apache/lucene/pull/13178#issuecomment-1998018450

   Problem: clarity in being able to turn it off; a -1 value would clarify the 
configuration intent.  It has become even less useful, not that I quite wanted 
it in the first place in the past either.  In the production clusters I work 
with, there are other limits (query parsing level enforced) and they are 
adequate.  I could use a bunch of 9's and add a comment in the configuration 
but shouldn't Lucene let you turn it off?  Why argue against a simple `if` 
condition for a check the user doesn't want to enforce.  I understand your 
point that it probably costs pathetically little compared to executing the 
query anyway.  However note the sad `TermInSetQuery.visit()` implementation.
   
   Additionally/alternatively, making getNumClausesCheckVisitor protected and 
non-static could be useful.
   
   Additionally/alternatively, remove this limit.  Keep a sample QueryVisitor 
impl with instructional javadocs to show how trivial it is for a user to layer 
this on if they so choose (subclass `rewrite`)


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Add new VectorScorer interface to vector value iterators [lucene]

2024-03-14 Thread via GitHub


benwtrent commented on code in PR #13181:
URL: https://github.com/apache/lucene/pull/13181#discussion_r1525282411


##
lucene/core/src/java/org/apache/lucene/search/AbstractKnnVectorQuery.java:
##
@@ -190,12 +190,13 @@ protected TopDocs exactSearch(LeafReaderContext context, 
DocIdSetIterator accept
 if (vectorScorer == null) {
   return NO_RESULTS;
 }
+DocIdSetIterator vectorIterator = vectorScorer.iterator();
 HitQueue queue = new HitQueue(k, true);
 ScoreDoc topDoc = queue.top();
 int doc;
 while ((doc = acceptIterator.nextDoc()) != DocIdSetIterator.NO_MORE_DOCS) {
-  boolean advanced = vectorScorer.advanceExact(doc);
-  assert advanced;
+  vectorIterator.advance(doc);
+  assert vectorIterator.docID() == doc;

Review Comment:
   The iterator already does a filter on the field existing. But I agree, this 
interaction needs improved. Especially as adding the filter doesn't really seem 
necessary and could unnecessarily thrash any filter clause caching that exists. 
   
   The iterator query that is actually created upstream of this code:
   ```java
   if (filter != null) {
 BooleanQuery booleanQuery =
 new BooleanQuery.Builder()
 .add(filter, BooleanClause.Occur.FILTER)
 .add(new FieldExistsQuery(field), BooleanClause.Occur.FILTER)
 .build();
 Query rewritten = indexSearcher.rewrite(booleanQuery);
 filterWeight = indexSearcher.createWeight(rewritten, 
ScoreMode.COMPLETE_NO_SCORES, 1f);
   } else {
 filterWeight = null;
   }
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [I] Avoid recalculating the norm of the target vector when using cosine metric [lucene]

2024-03-14 Thread via GitHub


benwtrent commented on issue #13185:
URL: https://github.com/apache/lucene/issues/13185#issuecomment-1997999092

   I would rather not change anything related to this enumeration until we 
figure out: https://github.com/apache/lucene/issues/13182 
   
   As an aside, I think cosine as a metric is fairly useless. Folks using 
Lucene should just normalize everything and use dot product, or use 
max-inner-product.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Add new parallel merge task executor for parallel actions within a single merge action [lucene]

2024-03-14 Thread via GitHub


jpountz commented on code in PR #13124:
URL: https://github.com/apache/lucene/pull/13124#discussion_r1525270702


##
lucene/core/src/java/org/apache/lucene/index/SegmentMerger.java:
##
@@ -130,19 +135,31 @@ MergeState merge() throws IOException {
 IOContext.READ,
 segmentWriteState.segmentSuffix);
 
+TaskExecutor taskExecutor = new 
TaskExecutor(mergeState.intraMergeTaskExecutor);
+List> mergingTasks = new ArrayList<>();
+
 if (mergeState.mergeFieldInfos.hasNorms()) {
   mergeWithLogging(this::mergeNorms, segmentWriteState, segmentReadState, 
"norms", numMerged);
 }
 
 mergeWithLogging(this::mergeTerms, segmentWriteState, segmentReadState, 
"postings", numMerged);

Review Comment:
   Presumably, the worst-case scenario is not much worse than if you have many 
concurrent merges, so I think it's fine.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Add new parallel merge task executor for parallel actions within a single merge action [lucene]

2024-03-14 Thread via GitHub


benwtrent commented on code in PR #13124:
URL: https://github.com/apache/lucene/pull/13124#discussion_r1525266457


##
lucene/core/src/java/org/apache/lucene/index/SegmentMerger.java:
##
@@ -130,19 +135,31 @@ MergeState merge() throws IOException {
 IOContext.READ,
 segmentWriteState.segmentSuffix);
 
+TaskExecutor taskExecutor = new 
TaskExecutor(mergeState.intraMergeTaskExecutor);
+List> mergingTasks = new ArrayList<>();
+
 if (mergeState.mergeFieldInfos.hasNorms()) {
   mergeWithLogging(this::mergeNorms, segmentWriteState, segmentReadState, 
"norms", numMerged);
 }
 
 mergeWithLogging(this::mergeTerms, segmentWriteState, segmentReadState, 
"postings", numMerged);

Review Comment:
   OK, I added a commit that does this. The candidate merge time is now:
   
   `forceMerge took 4965 ms`
   vs. baseline (i re-ran to make sure)
   `forceMerge took 15783 ms`
   
   So, 3x faster? Seems pretty good. The major downside is that now memory 
usage on merge might increase as we are potentially doing all this merge 
activity at the same time. If we are cool with that, this seems like a nice 
speed improvement.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[I] Avoid recalculating the norm of the target vector when using cosine metric [lucene]

2024-03-14 Thread via GitHub


bugmakerr opened a new issue, #13185:
URL: https://github.com/apache/lucene/issues/13185

   ### Description
   
   Currently, in the KNN retrieval process, we use 
`VectorSimilarityFunction#compare` to calculate the score between the target 
vector and the current vector. This method requires recalculating the norm of 
the target vector each time. To avoid this repetition, we can pass the square 
norm of the target vector to score method. I suggest modifying the relevant 
interface as follows:
   ```
   @functionalInterface
   public interface ByteVectorScorer {
   float score(byte[] vector);
   }
   
   @functionalInterface
   public interface FloatVectorScorer {
   float score(float[] vector);
   }
   
   public enum VectorSimilarityFunction {
   
   EUCLIDEAN {
   @Override
   public ByteVectorScorer getVectorScorer(byte[] target) {
   return vector -> 1 / (1f + squareDistance(target, vector));
   }
   
   @Override
   public FloatVectorScorer getVectorScorer(float[] target) {
   return vector -> 1 / (1 + squareDistance(target, vector));
   }
   },
   
   COSINE {
   @Override
   public ByteVectorScorer getVectorScorer(byte[] target) {
   int squareNorm = dotProduct(target, target);
   return vector -> (1 + cosine(target, vector, squareNorm)) / 2;
   }
   
   @Override
   public FloatVectorScorer getVectorScorer(float[] target) {
   double squareNorm = dotProduct(target, target);
   return vector -> Math.max((1 + cosine(target, vector, 
squareNorm)) / 2, 0);
   }
   };
   
   public abstract ByteVectorScorer getVectorScorer(byte[] target);
   
   public abstract FloatVectorScorer getVectorScorer(float[] target);
   }
   ```
   
   Any thoughts?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Add new parallel merge task executor for parallel actions within a single merge action [lucene]

2024-03-14 Thread via GitHub


jpountz commented on code in PR #13124:
URL: https://github.com/apache/lucene/pull/13124#discussion_r1525217090


##
lucene/core/src/java/org/apache/lucene/index/SegmentMerger.java:
##
@@ -130,19 +135,31 @@ MergeState merge() throws IOException {
 IOContext.READ,
 segmentWriteState.segmentSuffix);
 
+TaskExecutor taskExecutor = new 
TaskExecutor(mergeState.intraMergeTaskExecutor);
+List> mergingTasks = new ArrayList<>();
+
 if (mergeState.mergeFieldInfos.hasNorms()) {
   mergeWithLogging(this::mergeNorms, segmentWriteState, segmentReadState, 
"norms", numMerged);
 }
 
 mergeWithLogging(this::mergeTerms, segmentWriteState, segmentReadState, 
"postings", numMerged);

Review Comment:
   Can we have a parallel task that handles norms + terms so that the order is 
respected?



##
lucene/core/src/java/org/apache/lucene/index/ConcurrentMergeScheduler.java:
##
@@ -902,12 +932,57 @@ private static String getSegmentName(MergePolicy.OneMerge 
merge) {
   }
 
   static {
-TestSecrets.setConcurrentMergeSchedulerAccess(
-new ConcurrentMergeSchedulerAccess() {
-  @Override
-  public void setSuppressExceptions(ConcurrentMergeScheduler cms) {
-cms.setSuppressExceptions();
+
TestSecrets.setConcurrentMergeSchedulerAccess(ConcurrentMergeScheduler::setSuppressExceptions);
+  }
+
+  private class ScaledExecutor implements Executor {
+
+private final AtomicInteger activeCount = new AtomicInteger(0);
+private final ThreadPoolExecutor executor;
+
+public ScaledExecutor() {
+  this.executor =
+  new ThreadPoolExecutor(0, 1024, 1L, TimeUnit.MINUTES, new 
SynchronousQueue<>());
+}
+
+void shutdown() {
+  executor.shutdown();
+}
+
+@Override
+public void execute(Runnable command) {
+  assert mergeThreads.contains(Thread.currentThread()) : "caller is not a 
merge thread";

Review Comment:
   I'd expect this assertion to no longer be valid, since SegmentMerger may 
fork into this executor for vectors, and then the task for vectors may want to 
further fork into a separate thread? So the caller may be either a MergeThread, 
or a thread from the wrapper thread pool?



##
lucene/core/src/java/org/apache/lucene/index/ConcurrentMergeScheduler.java:
##
@@ -910,4 +936,55 @@ public void setSuppressExceptions(ConcurrentMergeScheduler 
cms) {
   }
 });
   }
+
+  private class ScaledExecutor implements Executor {

Review Comment:
   nit: should it be called `CachedExecutor` now?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Add new VectorScorer interface to vector value iterators [lucene]

2024-03-14 Thread via GitHub


mccullocht commented on code in PR #13181:
URL: https://github.com/apache/lucene/pull/13181#discussion_r1525162554


##
lucene/core/src/java/org/apache/lucene/search/AbstractKnnVectorQuery.java:
##
@@ -190,12 +190,13 @@ protected TopDocs exactSearch(LeafReaderContext context, 
DocIdSetIterator accept
 if (vectorScorer == null) {
   return NO_RESULTS;
 }
+DocIdSetIterator vectorIterator = vectorScorer.iterator();
 HitQueue queue = new HitQueue(k, true);
 ScoreDoc topDoc = queue.top();
 int doc;
 while ((doc = acceptIterator.nextDoc()) != DocIdSetIterator.NO_MORE_DOCS) {
-  boolean advanced = vectorScorer.advanceExact(doc);
-  assert advanced;
+  vectorIterator.advance(doc);
+  assert vectorIterator.docID() == doc;

Review Comment:
   skip the document if it doesn't match? i'm guessing it would be an easily 
predicted branch and it's possible that the filter query could yield a document 
that doesn't have a vector.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[PR] Make the HitQueue size more appropriate for KNN exact search [lucene]

2024-03-14 Thread via GitHub


bugmakerr opened a new pull request, #13184:
URL: https://github.com/apache/lucene/pull/13184

   ### Description
   Currently, when performing KNN exact search, we consistently set the 
HitQueue size to `k`. However, there may be instances where the number of 
candidates is actually lower than `k`.
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] gh-13147: use dense bit-encoding for frequent terms [lucene]

2024-03-14 Thread via GitHub


msokolov commented on PR #13153:
URL: https://github.com/apache/lucene/pull/13153#issuecomment-1997699021

   It seems to especially make phrase and span queries slower? Possibly the 
decoding of positions is not good? I could try restricting to fields without 
positions (which seem likely to be the ones that would benefit anyway)
   
   ```
   TaskQPS baseline  StdDevQPS 
my_modified_version  StdDevPct diff p-value 
[87/1978]
 HighPhrase4.62  (4.1%)4.45  
(3.9%)   -3.7% ( -11% -4%) 0.004
  MedPhrase   67.01  (1.9%)   65.37  
(2.0%)   -2.4% (  -6% -1%) 0.000
 IntNRQ   15.92 (13.8%)   15.53 
(13.2%)   -2.4% ( -25% -   28%) 0.569
   HighSloppyPhrase3.60  (2.4%)3.52  
(2.8%)   -2.3% (  -7% -2%) 0.005
  LowPhrase9.44  (2.5%)9.23  
(2.4%)   -2.2% (  -6% -2%) 0.004
   AndHighHighDayTaxoFacets2.92  (2.3%)2.88  
(3.5%)   -1.5% (  -7% -4%) 0.108
LowSloppyPhrase   26.88  (2.5%)   26.48  
(2.5%)   -1.5% (  -6% -3%) 0.065
   
MedSloppyPhrase   18.10  (3.2%)   17.85  
(3.5%)   -1.4% (  -7% -5%) 0.191
   
  HighTermDayOfYearSort  156.73  (2.2%)  154.64  
(2.8%)   -1.3% (  -6% -3%) 0.096
   
 TermDTSort   69.58  (2.6%)   68.83  
(2.9%)   -1.1% (  -6% -4%) 0.216
   
AndHighMedDayTaxoFacets   15.06  (0.9%)   14.91  
(1.6%)   -1.0% (  -3% -1%) 0.015
   
MedSpanNear   31.29  (1.5%)   30.97  
(1.1%)   -1.0% (  -3% -1%) 0.017
   
   HighIntervalsOrdered3.92  (4.2%)3.88  
(4.5%)   -1.0% (  -9% -8%) 0.465
   
LowIntervalsOrdered1.42  (3.0%)1.41  
(3.0%)   -0.9% (  -6% -5%) 0.335   
LowSpanNear   14.23  (1.3%)   14.10  
(0.8%)   -0.9% (  -2% -1%) 0.009   
   OrNotHighMed  207.12  (5.3%)  205.50  
(5.8%)   -0.8% ( -11% -   10%) 0.655
  OrNotHighHigh  126.12  (5.2%)  125.14  
(5.8%)   -0.8% ( -11% -   10%) 0.655 
  HighTermTitleSort   79.33  (2.5%)   78.82  
(2.0%)   -0.6% (  -5% -4%) 0.376  
   BrowseDateTaxoFacets2.73  (5.7%)2.71  
(6.9%)   -0.6% ( -12% -   12%) 0.752
BrowseRandomLabelTaxoFacets2.11 (11.5%)2.10 
(11.9%)   -0.6% ( -21% -   25%) 0.868
MedIntervalsOrdered3.17  (3.9%)3.15  
(4.4%)   -0.5% (  -8% -8%) 0.694
   HighSpanNear3.15  (3.1%)3.14  
(2.2%)   -0.5% (  -5% -4%) 0.576
  OrHighLow  247.95  (2.2%)  246.79  
(1.9%)   -0.5% (  -4% -3%) 0.467
 AndHighLow  396.47  (2.2%)  394.68  
(1.5%)   -0.4% (  -4% -3%) 0.460
 Fuzzy1   65.30  (1.5%)   65.01  
(1.8%)   -0.4% (  -3% -2%) 0.402
  BrowseDayOfYearTaxoFacets2.77  (5.4%)2.76  
(6.2%)   -0.4% ( -11% -   11%) 0.817
  OrHighMed  142.70  (2.1%)  142.13  
(2.0%)   -0.4% (  -4% -3%) 0.543
   OrHighNotMed  279.04  (5.4%)  277.96  
(6.2%)   -0.4% ( -11% -   11%) 0.832  
  BrowseMonthSSDVFacets2.38  (2.7%)2.37  
(2.9%)   -0.4% (  -5% -5%) 0.663
 Fuzzy2   50.49  (1.3%)   50.32  
(1.4%)   -0.3% (  -3% -2%) 0.420
   
 OrHighHigh   21.69  (2.4%)   21.62  
(3.6%)   -0.3% (  -6% -5%) 0.746
Respell   32.27  (2.4%)   32.19  
(2.2%)   -0.2% (  -4% -4%) 0.761
  BrowseMonthTaxoFacets2.87  (1.4%)2.86  
(1.2%)   -0.2% (  -2% -2%) 0.654
   
   PKLookup  142.74  (2.4%)  142.50  
(1.9%)   -0.2% (  -4% -4%) 0.809 
 AndHighMed   53.54  (2.3%)   53.45  
(2.2%)   -0.2% (  -4% 

Re: [PR] Change BP reordering logic to help support document blocks later on. [lucene]

2024-03-14 Thread via GitHub


jpountz commented on code in PR #13123:
URL: https://github.com/apache/lucene/pull/13123#discussion_r1525040811


##
lucene/misc/src/java/org/apache/lucene/misc/index/BPIndexReorderer.java:
##
@@ -341,116 +344,94 @@ protected void compute() {
  */
 private boolean shuffle(
 ForwardIndex forwardIndex,
-IntsRef left,
-IntsRef right,
+IntsRef docIDs,
+int midPoint,
 int[] leftDocFreqs,
 int[] rightDocFreqs,
-float[] gains,
+float[] biases,
 int iter)
 throws IOException {
-  assert left.ints == right.ints;
-  assert left.offset + left.length == right.offset;
 
-  // Computing gains is typically a bottleneck, because each iteration 
needs to iterate over all
-  // postings to recompute gains, and the total number of postings is 
usually one order of
+  // Computing biases is typically a bottleneck, because each iteration 
needs to iterate over
+  // all postings to recompute biases, and the total number of postings is 
usually one order of
   // magnitude or more larger than the number of docs. So we try to 
parallelize it.
-  ComputeGainsTask leftGainsTask =
-  new ComputeGainsTask(
-  left.ints,
-  gains,
-  left.offset,
-  left.offset + left.length,
+  new ComputeBiasTask(
+  docIDs.ints,
+  biases,
+  docIDs.offset,
+  docIDs.offset + docIDs.length,
   leftDocFreqs,
   rightDocFreqs,
   threadLocal,
-  depth);
-  ComputeGainsTask rightGainsTask =
-  new ComputeGainsTask(
-  right.ints,
-  gains,
-  right.offset,
-  right.offset + right.length,
-  rightDocFreqs,
-  leftDocFreqs,
-  threadLocal,
-  depth);
-  if (shouldFork(docIDs.length, docIDs.ints.length)) {
-invokeAll(leftGainsTask, rightGainsTask);
-  } else {
-leftGainsTask.compute();
-rightGainsTask.compute();
+  depth)
+  .compute();
+
+  float maxLeftBias = Float.NEGATIVE_INFINITY;
+  for (int i = docIDs.offset; i < midPoint; ++i) {
+maxLeftBias = Math.max(maxLeftBias, biases[i]);
+  }
+  float minRightBias = Float.POSITIVE_INFINITY;
+  for (int i = midPoint, end = docIDs.offset + docIDs.length; i < end; 
++i) {
+minRightBias = Math.min(minRightBias, biases[i]);
+  }
+  float gain = maxLeftBias - minRightBias;
+  // This uses the simulated annealing proposed by Mackenzie et al in 
"Tradeoff Options for
+  // Bipartite Graph Partitioning" by comparing the gain of swapping the 
doc from the left side
+  // that is most attracted to the right and the doc from the right side 
that is most attracted
+  // to the left against `iter` rather than zero.
+  if (gain <= iter) {
+return false;
   }
 
-  class ByDescendingGainSorter extends IntroSorter {
+  new IntroSelector() {
 
 int pivotDoc;
-float pivotGain;
+float pivotBias;
 
 @Override
 protected void setPivot(int i) {
-  pivotDoc = left.ints[i];
-  pivotGain = gains[i];
+  pivotDoc = docIDs.ints[i];
+  pivotBias = biases[i];
 }
 
 @Override
 protected int comparePivot(int j) {
-  // Compare in reverse order to get a descending sort
-  int cmp = Float.compare(gains[j], pivotGain);
+  int cmp = Float.compare(pivotBias, biases[j]);

Review Comment:
   @rishabhmaurya Yes! Please feel free to open a PR and let's discuss there!



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] gh-13147: use dense bit-encoding for frequent terms [lucene]

2024-03-14 Thread via GitHub


msokolov commented on PR #13153:
URL: https://github.com/apache/lucene/pull/13153#issuecomment-1997659513

   right that last change seemed promising! But on luceneutil tasks it didn't 
show much impact, basically a regression to teh mean compared to the first 
revision I tested; possibly a slight negative trend overall. I'll paste the 
results here. I was considering creating a more targeted benchmark to 
exaggerate the impact using a field and query type where we would expect to see 
a benefit, if only as a debugging tool, but I haven't had a chance to do that 
yet.  Another idea I had was to try to help branch prediction by using a 
specialized BlockDecoder or so rather than embedding if statements in every 
advance/nextDoc call. I'm also not sure whether writing full longs versus 
truncating the final long in the bitset and writing only as many bytes as 
needed (added in the recent revision) might be making some difference? Seems 
unlikely.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Made DocIdsWriter use DISI when reading documents with an IntersectVisitor [lucene]

2024-03-14 Thread via GitHub


jpountz commented on PR #13149:
URL: https://github.com/apache/lucene/pull/13149#issuecomment-1997568632

   ++ on progress over perfection
   
   That said, I wonder if this change is legal: `DocIdSetIterator` must return 
doc IDs in order, but it looks like it wouldn't always be the case with your 
change?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Reduce duplication in taxonomy facets; always do counts [lucene]

2024-03-14 Thread via GitHub


stefanvodita commented on PR #12966:
URL: https://github.com/apache/lucene/pull/12966#issuecomment-1997515955

   @gsmiller - I know you may not have time to review, but I want to at least 
notify you, since this is a big change and you've been very invovled in this 
area of the code.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Replace Collections.synchronizedSet() with ConcurrentHashMap.newKeySet() [lucene]

2024-03-14 Thread via GitHub


uschindler commented on PR #13142:
URL: https://github.com/apache/lucene/pull/13142#issuecomment-1997212215

   Hi,
   I am fine to apply the "safe" changes where iteration or an atomic 
add+remove is not required.
   But all others should be reverted and external synchronization using 
synchronizedSet should be used.
   Let's only do this for simple sets where we never iterate and where we only 
check and add single entries from different threads.
   Uwe


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Replace Collections.synchronizedSet() with ConcurrentHashMap.newKeySet() [lucene]

2024-03-14 Thread via GitHub


uschindler commented on code in PR #13142:
URL: https://github.com/apache/lucene/pull/13142#discussion_r1524663419


##
lucene/replicator/src/java/org/apache/lucene/replicator/nrt/PrimaryNode.java:
##
@@ -158,10 +158,7 @@ public long getPrimaryGen() {
*/
   public boolean flushAndRefresh() throws IOException {
 message("top: now flushAndRefresh");
-Set completedMergeFiles;

Review Comment:
   I agree, better keep old code.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Replace Collections.synchronizedSet() with ConcurrentHashMap.newKeySet() [lucene]

2024-03-14 Thread via GitHub


uschindler commented on code in PR #13142:
URL: https://github.com/apache/lucene/pull/13142#discussion_r1524661584


##
lucene/core/src/java/org/apache/lucene/store/TrackingDirectoryWrapper.java:
##
@@ -61,10 +61,8 @@ public void copyFrom(Directory from, String src, String 
dest, IOContext context)
   @Override
   public void rename(String source, String dest) throws IOException {
 in.rename(source, dest);
-synchronized (createdFileNames) {
-  createdFileNames.add(dest);
-  createdFileNames.remove(source);
-}
+createdFileNames.add(dest);
+createdFileNames.remove(source);

Review Comment:
   I have the same problem. We need to avoid that. Unless there's an add+remove 
atomic call to ConcurrentSet we need to synchronize.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Replace Collections.synchronizedSet() with ConcurrentHashMap.newKeySet() [lucene]

2024-03-14 Thread via GitHub


benwtrent commented on code in PR #13142:
URL: https://github.com/apache/lucene/pull/13142#discussion_r1524632573


##
lucene/core/src/java/org/apache/lucene/store/TrackingDirectoryWrapper.java:
##
@@ -61,10 +61,8 @@ public void copyFrom(Directory from, String src, String 
dest, IOContext context)
   @Override
   public void rename(String source, String dest) throws IOException {
 in.rename(source, dest);
-synchronized (createdFileNames) {
-  createdFileNames.add(dest);
-  createdFileNames.remove(source);
-}
+createdFileNames.add(dest);
+createdFileNames.remove(source);

Review Comment:
   This means that another thread can mutate `createdFileNames` between `add` 
and `remove`. Are we sure this doesn't add any weird race conditions?



##
lucene/replicator/src/java/org/apache/lucene/replicator/nrt/ReplicaNode.java:
##
@@ -625,19 +624,17 @@ public void run(CopyJob job) {
 for (String fileName : curNRTCopy.getFileNamesToCopy()) {
   assert lastCommitFiles.contains(fileName) == false
   : "fileName=" + fileName + " is in lastCommitFiles and is being 
copied?";
-  synchronized (mergeCopyJobs) {
-for (CopyJob mergeJob : mergeCopyJobs) {
-  if (mergeJob.getFileNames().contains(fileName)) {
-// TODO: we could maybe transferAndCancel here?  except CopyJob 
can't transferAndCancel
-// more than one currently
-message(
-"top: now cancel merge copy job="
-+ mergeJob
-+ ": file "
-+ fileName
-+ " is now being copied via NRT point");
-mergeJob.cancel("newNRTPoint is copying over the same file", null);
-  }
+  for (CopyJob mergeJob : mergeCopyJobs) {
+if (mergeJob.getFileNames().contains(fileName)) {

Review Comment:
   Since this no longer locks on `mergeCopyJobs`, this means during the 
iteration, the collection can be mutated. Are we sure this is OK?



##
lucene/replicator/src/java/org/apache/lucene/replicator/nrt/ReplicaNode.java:
##
@@ -625,19 +624,17 @@ public void run(CopyJob job) {
 for (String fileName : curNRTCopy.getFileNamesToCopy()) {
   assert lastCommitFiles.contains(fileName) == false
   : "fileName=" + fileName + " is in lastCommitFiles and is being 
copied?";
-  synchronized (mergeCopyJobs) {
-for (CopyJob mergeJob : mergeCopyJobs) {
-  if (mergeJob.getFileNames().contains(fileName)) {
-// TODO: we could maybe transferAndCancel here?  except CopyJob 
can't transferAndCancel
-// more than one currently
-message(
-"top: now cancel merge copy job="
-+ mergeJob
-+ ": file "
-+ fileName
-+ " is now being copied via NRT point");
-mergeJob.cancel("newNRTPoint is copying over the same file", null);
-  }
+  for (CopyJob mergeJob : mergeCopyJobs) {
+if (mergeJob.getFileNames().contains(fileName)) {

Review Comment:
   `mergeCopyJobs` is protected, implying its use with sub-classes. You can see 
this in `SimpleReplicaNode`, which takes action while 
`synchronized(mergeCopyJobs)`. 
   
   A quick search on github shows that other repositories take similar action. 
   
   Are we sure this doesn't add a race condition to library user's code that 
sub-class the `ReplicaNode` class & synchronize on `mergeCopyJobs`?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[PR] Fix TestIndexWriter.testDeleteUnusedFiles failure on Windows 11 [lucene]

2024-03-14 Thread via GitHub


vsop-479 opened a new pull request, #13183:
URL: https://github.com/apache/lucene/pull/13183

   Fix https://github.com/apache/lucene/issues/12524
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Replace Collections.synchronizedSet() with ConcurrentHashMap.newKeySet() [lucene]

2024-03-14 Thread via GitHub


dweiss commented on PR #13142:
URL: https://github.com/apache/lucene/pull/13142#issuecomment-1996843277

   I'm out of office this week. If anybody can pick this up, please do. 
Otherwise I'll return to it next wek.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Remove halt() call in TestSimpleServer (part of TestStressNRTReplication [lucene]

2024-03-14 Thread via GitHub


dweiss commented on PR #13177:
URL: https://github.com/apache/lucene/pull/13177#issuecomment-1996840513

   I think it's in the method's documentation (current()) that the returned 
handle can't be used to stop yourself. Indeed, I tried it too. ;)


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org