Re: [PR] Make FSTPostingFormat to build FST off-heap [lucene]
github-actions[bot] commented on PR #12980: URL: https://github.com/apache/lucene/pull/12980#issuecomment-1998682382 This PR has not had activity in the past 2 weeks, labeling it as stale. If the PR is waiting for review, notify the d...@lucene.apache.org list. Thank you for your contribution! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [I] TestIDVersionPostingsFormat failure [lucene]
benwtrent commented on issue #13127: URL: https://github.com/apache/lucene/issues/13127#issuecomment-1998480248 More investigation is needed. The only other method that updates `DWDQ#nextSeqNo` is `DWDQ#skipSequenceNumbers(long)`. The only place that `DWDQ#skipSequenceNumbers(long)` is called is within `DW#lockAndAbortAll` which is synchronized on `DW` which disallows `DW#resetDeleteQueue` or `DW#getNextSequenceNumber` from messing with the queue. I am not sure we have a race condition or just actually bad logic somewhere in these paths. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [I] TestIDVersionPostingsFormat failure [lucene]
benwtrent commented on issue #13127: URL: https://github.com/apache/lucene/issues/13127#issuecomment-1998402170 Well, that race-condition wasn't the cause. I have seen another failure. ``` ./gradlew test --tests TestIDVersionPostingsFormat.testGlobalVersions -Dtests.seed=DEC45C861B1BCFEE -Dtests.locale=ar-YE -Dtests.timezone=Etc/GMT-8 -Dtests.asserts=true -Dtests.file.encoding=UTF-8 ``` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Add new parallel merge task executor for parallel actions within a single merge action [lucene]
benwtrent merged PR #13124: URL: https://github.com/apache/lucene/pull/13124 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Make the HitQueue size more appropriate for KNN exact search [lucene]
benwtrent commented on code in PR #13184: URL: https://github.com/apache/lucene/pull/13184#discussion_r1525341872 ## lucene/join/src/java/org/apache/lucene/search/join/DiversifyingChildrenFloatKnnVectorQuery.java: ## @@ -98,7 +98,8 @@ protected TopDocs exactSearch(LeafReaderContext context, DocIdSetIterator accept parentBitSet, query, fi.getVectorSimilarityFunction()); -HitQueue queue = new HitQueue(k, true); +final int queueSize = Math.min(k, Math.toIntExact(acceptIterator.cost())); Review Comment: I don't know a better way, but, since this diversifies over parent doc ids, its possible that the hitqueue is still much smaller than `acceptIterator.cost()` as `acceptIterator.cost()` is the iterator over CHILD docs (e.g. passage vector docs). I think any further optimization (e.g. counting the number of relevant parents) would add undo overhead. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Change BP reordering logic to help support document blocks later on. [lucene]
rishabhmaurya commented on code in PR #13123: URL: https://github.com/apache/lucene/pull/13123#discussion_r1525325740 ## lucene/misc/src/java/org/apache/lucene/misc/index/BPIndexReorderer.java: ## @@ -341,116 +344,94 @@ protected void compute() { */ private boolean shuffle( ForwardIndex forwardIndex, -IntsRef left, -IntsRef right, +IntsRef docIDs, +int midPoint, int[] leftDocFreqs, int[] rightDocFreqs, -float[] gains, +float[] biases, int iter) throws IOException { - assert left.ints == right.ints; - assert left.offset + left.length == right.offset; - // Computing gains is typically a bottleneck, because each iteration needs to iterate over all - // postings to recompute gains, and the total number of postings is usually one order of + // Computing biases is typically a bottleneck, because each iteration needs to iterate over + // all postings to recompute biases, and the total number of postings is usually one order of // magnitude or more larger than the number of docs. So we try to parallelize it. - ComputeGainsTask leftGainsTask = - new ComputeGainsTask( - left.ints, - gains, - left.offset, - left.offset + left.length, + new ComputeBiasTask( + docIDs.ints, + biases, + docIDs.offset, + docIDs.offset + docIDs.length, leftDocFreqs, rightDocFreqs, threadLocal, - depth); - ComputeGainsTask rightGainsTask = - new ComputeGainsTask( - right.ints, - gains, - right.offset, - right.offset + right.length, - rightDocFreqs, - leftDocFreqs, - threadLocal, - depth); - if (shouldFork(docIDs.length, docIDs.ints.length)) { -invokeAll(leftGainsTask, rightGainsTask); - } else { -leftGainsTask.compute(); -rightGainsTask.compute(); + depth) + .compute(); + + float maxLeftBias = Float.NEGATIVE_INFINITY; + for (int i = docIDs.offset; i < midPoint; ++i) { +maxLeftBias = Math.max(maxLeftBias, biases[i]); + } + float minRightBias = Float.POSITIVE_INFINITY; + for (int i = midPoint, end = docIDs.offset + docIDs.length; i < end; ++i) { +minRightBias = Math.min(minRightBias, biases[i]); + } + float gain = maxLeftBias - minRightBias; + // This uses the simulated annealing proposed by Mackenzie et al in "Tradeoff Options for + // Bipartite Graph Partitioning" by comparing the gain of swapping the doc from the left side + // that is most attracted to the right and the doc from the right side that is most attracted + // to the left against `iter` rather than zero. + if (gain <= iter) { +return false; } - class ByDescendingGainSorter extends IntroSorter { + new IntroSelector() { int pivotDoc; -float pivotGain; +float pivotBias; @Override protected void setPivot(int i) { - pivotDoc = left.ints[i]; - pivotGain = gains[i]; + pivotDoc = docIDs.ints[i]; + pivotBias = biases[i]; } @Override protected int comparePivot(int j) { - // Compare in reverse order to get a descending sort - int cmp = Float.compare(gains[j], pivotGain); + int cmp = Float.compare(pivotBias, biases[j]); Review Comment: @jpountz done https://github.com/apache/lucene/pull/13186. I wasn't sure if its worthy a PR, thus started discussion here. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[PR] Avoid iterations: cooling using simulated annealing [lucene]
rishabhmaurya opened a new pull request, #13186: URL: https://github.com/apache/lucene/pull/13186 ### Description As described in the paper (Tradeoff Options for Bipartite Graph Partitioning), simulated annealing-type mechanism be employed to reduce number of swaps with each iteration. If projected advantage swapping 2 docs across partitions isn't as appealing, in this case less than `iter` bits, then swap can be avoided. Swapping documents is a heavy operation as it requires recomputing biases for documents which are affected by the swap i.e. have common terms with other documents in the partition. Currently, its only employed once before starting the shuffle operation, where we check against the max gain possible for a given iteration and if its at least `iter` bits apart. This condition can be checked with each swap when performing quick select, if its worthy. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Add new parallel merge task executor for parallel actions within a single merge action [lucene]
zhaih commented on code in PR #13124: URL: https://github.com/apache/lucene/pull/13124#discussion_r1525307991 ## lucene/core/src/java/org/apache/lucene/index/ConcurrentMergeScheduler.java: ## @@ -902,12 +932,52 @@ private static String getSegmentName(MergePolicy.OneMerge merge) { } static { -TestSecrets.setConcurrentMergeSchedulerAccess( -new ConcurrentMergeSchedulerAccess() { - @Override - public void setSuppressExceptions(ConcurrentMergeScheduler cms) { -cms.setSuppressExceptions(); - } -}); + TestSecrets.setConcurrentMergeSchedulerAccess(ConcurrentMergeScheduler::setSuppressExceptions); + } + + private class CachedExecutor implements Executor { Review Comment: Can we have some simple javadoc explaining what this executor is doing? E.g. will use a spare thread if it is within configured max thread and will run directly on caller thread if we have no extra thread available? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Support disabling IndexSearcher.maxClauseCount with a value of -1 [lucene]
dsmiley commented on PR #13178: URL: https://github.com/apache/lucene/pull/13178#issuecomment-1998018450 Problem: clarity in being able to turn it off; a -1 value would clarify the configuration intent. It has become even less useful, not that I quite wanted it in the first place in the past either. In the production clusters I work with, there are other limits (query parsing level enforced) and they are adequate. I could use a bunch of 9's and add a comment in the configuration but shouldn't Lucene let you turn it off? Why argue against a simple `if` condition for a check the user doesn't want to enforce. I understand your point that it probably costs pathetically little compared to executing the query anyway. However note the sad `TermInSetQuery.visit()` implementation. Additionally/alternatively, making getNumClausesCheckVisitor protected and non-static could be useful. Additionally/alternatively, remove this limit. Keep a sample QueryVisitor impl with instructional javadocs to show how trivial it is for a user to layer this on if they so choose (subclass `rewrite`) -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Add new VectorScorer interface to vector value iterators [lucene]
benwtrent commented on code in PR #13181: URL: https://github.com/apache/lucene/pull/13181#discussion_r1525282411 ## lucene/core/src/java/org/apache/lucene/search/AbstractKnnVectorQuery.java: ## @@ -190,12 +190,13 @@ protected TopDocs exactSearch(LeafReaderContext context, DocIdSetIterator accept if (vectorScorer == null) { return NO_RESULTS; } +DocIdSetIterator vectorIterator = vectorScorer.iterator(); HitQueue queue = new HitQueue(k, true); ScoreDoc topDoc = queue.top(); int doc; while ((doc = acceptIterator.nextDoc()) != DocIdSetIterator.NO_MORE_DOCS) { - boolean advanced = vectorScorer.advanceExact(doc); - assert advanced; + vectorIterator.advance(doc); + assert vectorIterator.docID() == doc; Review Comment: The iterator already does a filter on the field existing. But I agree, this interaction needs improved. Especially as adding the filter doesn't really seem necessary and could unnecessarily thrash any filter clause caching that exists. The iterator query that is actually created upstream of this code: ```java if (filter != null) { BooleanQuery booleanQuery = new BooleanQuery.Builder() .add(filter, BooleanClause.Occur.FILTER) .add(new FieldExistsQuery(field), BooleanClause.Occur.FILTER) .build(); Query rewritten = indexSearcher.rewrite(booleanQuery); filterWeight = indexSearcher.createWeight(rewritten, ScoreMode.COMPLETE_NO_SCORES, 1f); } else { filterWeight = null; } ``` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [I] Avoid recalculating the norm of the target vector when using cosine metric [lucene]
benwtrent commented on issue #13185: URL: https://github.com/apache/lucene/issues/13185#issuecomment-1997999092 I would rather not change anything related to this enumeration until we figure out: https://github.com/apache/lucene/issues/13182 As an aside, I think cosine as a metric is fairly useless. Folks using Lucene should just normalize everything and use dot product, or use max-inner-product. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Add new parallel merge task executor for parallel actions within a single merge action [lucene]
jpountz commented on code in PR #13124: URL: https://github.com/apache/lucene/pull/13124#discussion_r1525270702 ## lucene/core/src/java/org/apache/lucene/index/SegmentMerger.java: ## @@ -130,19 +135,31 @@ MergeState merge() throws IOException { IOContext.READ, segmentWriteState.segmentSuffix); +TaskExecutor taskExecutor = new TaskExecutor(mergeState.intraMergeTaskExecutor); +List> mergingTasks = new ArrayList<>(); + if (mergeState.mergeFieldInfos.hasNorms()) { mergeWithLogging(this::mergeNorms, segmentWriteState, segmentReadState, "norms", numMerged); } mergeWithLogging(this::mergeTerms, segmentWriteState, segmentReadState, "postings", numMerged); Review Comment: Presumably, the worst-case scenario is not much worse than if you have many concurrent merges, so I think it's fine. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Add new parallel merge task executor for parallel actions within a single merge action [lucene]
benwtrent commented on code in PR #13124: URL: https://github.com/apache/lucene/pull/13124#discussion_r1525266457 ## lucene/core/src/java/org/apache/lucene/index/SegmentMerger.java: ## @@ -130,19 +135,31 @@ MergeState merge() throws IOException { IOContext.READ, segmentWriteState.segmentSuffix); +TaskExecutor taskExecutor = new TaskExecutor(mergeState.intraMergeTaskExecutor); +List> mergingTasks = new ArrayList<>(); + if (mergeState.mergeFieldInfos.hasNorms()) { mergeWithLogging(this::mergeNorms, segmentWriteState, segmentReadState, "norms", numMerged); } mergeWithLogging(this::mergeTerms, segmentWriteState, segmentReadState, "postings", numMerged); Review Comment: OK, I added a commit that does this. The candidate merge time is now: `forceMerge took 4965 ms` vs. baseline (i re-ran to make sure) `forceMerge took 15783 ms` So, 3x faster? Seems pretty good. The major downside is that now memory usage on merge might increase as we are potentially doing all this merge activity at the same time. If we are cool with that, this seems like a nice speed improvement. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[I] Avoid recalculating the norm of the target vector when using cosine metric [lucene]
bugmakerr opened a new issue, #13185: URL: https://github.com/apache/lucene/issues/13185 ### Description Currently, in the KNN retrieval process, we use `VectorSimilarityFunction#compare` to calculate the score between the target vector and the current vector. This method requires recalculating the norm of the target vector each time. To avoid this repetition, we can pass the square norm of the target vector to score method. I suggest modifying the relevant interface as follows: ``` @functionalInterface public interface ByteVectorScorer { float score(byte[] vector); } @functionalInterface public interface FloatVectorScorer { float score(float[] vector); } public enum VectorSimilarityFunction { EUCLIDEAN { @Override public ByteVectorScorer getVectorScorer(byte[] target) { return vector -> 1 / (1f + squareDistance(target, vector)); } @Override public FloatVectorScorer getVectorScorer(float[] target) { return vector -> 1 / (1 + squareDistance(target, vector)); } }, COSINE { @Override public ByteVectorScorer getVectorScorer(byte[] target) { int squareNorm = dotProduct(target, target); return vector -> (1 + cosine(target, vector, squareNorm)) / 2; } @Override public FloatVectorScorer getVectorScorer(float[] target) { double squareNorm = dotProduct(target, target); return vector -> Math.max((1 + cosine(target, vector, squareNorm)) / 2, 0); } }; public abstract ByteVectorScorer getVectorScorer(byte[] target); public abstract FloatVectorScorer getVectorScorer(float[] target); } ``` Any thoughts? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Add new parallel merge task executor for parallel actions within a single merge action [lucene]
jpountz commented on code in PR #13124: URL: https://github.com/apache/lucene/pull/13124#discussion_r1525217090 ## lucene/core/src/java/org/apache/lucene/index/SegmentMerger.java: ## @@ -130,19 +135,31 @@ MergeState merge() throws IOException { IOContext.READ, segmentWriteState.segmentSuffix); +TaskExecutor taskExecutor = new TaskExecutor(mergeState.intraMergeTaskExecutor); +List> mergingTasks = new ArrayList<>(); + if (mergeState.mergeFieldInfos.hasNorms()) { mergeWithLogging(this::mergeNorms, segmentWriteState, segmentReadState, "norms", numMerged); } mergeWithLogging(this::mergeTerms, segmentWriteState, segmentReadState, "postings", numMerged); Review Comment: Can we have a parallel task that handles norms + terms so that the order is respected? ## lucene/core/src/java/org/apache/lucene/index/ConcurrentMergeScheduler.java: ## @@ -902,12 +932,57 @@ private static String getSegmentName(MergePolicy.OneMerge merge) { } static { -TestSecrets.setConcurrentMergeSchedulerAccess( -new ConcurrentMergeSchedulerAccess() { - @Override - public void setSuppressExceptions(ConcurrentMergeScheduler cms) { -cms.setSuppressExceptions(); + TestSecrets.setConcurrentMergeSchedulerAccess(ConcurrentMergeScheduler::setSuppressExceptions); + } + + private class ScaledExecutor implements Executor { + +private final AtomicInteger activeCount = new AtomicInteger(0); +private final ThreadPoolExecutor executor; + +public ScaledExecutor() { + this.executor = + new ThreadPoolExecutor(0, 1024, 1L, TimeUnit.MINUTES, new SynchronousQueue<>()); +} + +void shutdown() { + executor.shutdown(); +} + +@Override +public void execute(Runnable command) { + assert mergeThreads.contains(Thread.currentThread()) : "caller is not a merge thread"; Review Comment: I'd expect this assertion to no longer be valid, since SegmentMerger may fork into this executor for vectors, and then the task for vectors may want to further fork into a separate thread? So the caller may be either a MergeThread, or a thread from the wrapper thread pool? ## lucene/core/src/java/org/apache/lucene/index/ConcurrentMergeScheduler.java: ## @@ -910,4 +936,55 @@ public void setSuppressExceptions(ConcurrentMergeScheduler cms) { } }); } + + private class ScaledExecutor implements Executor { Review Comment: nit: should it be called `CachedExecutor` now? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Add new VectorScorer interface to vector value iterators [lucene]
mccullocht commented on code in PR #13181: URL: https://github.com/apache/lucene/pull/13181#discussion_r1525162554 ## lucene/core/src/java/org/apache/lucene/search/AbstractKnnVectorQuery.java: ## @@ -190,12 +190,13 @@ protected TopDocs exactSearch(LeafReaderContext context, DocIdSetIterator accept if (vectorScorer == null) { return NO_RESULTS; } +DocIdSetIterator vectorIterator = vectorScorer.iterator(); HitQueue queue = new HitQueue(k, true); ScoreDoc topDoc = queue.top(); int doc; while ((doc = acceptIterator.nextDoc()) != DocIdSetIterator.NO_MORE_DOCS) { - boolean advanced = vectorScorer.advanceExact(doc); - assert advanced; + vectorIterator.advance(doc); + assert vectorIterator.docID() == doc; Review Comment: skip the document if it doesn't match? i'm guessing it would be an easily predicted branch and it's possible that the filter query could yield a document that doesn't have a vector. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[PR] Make the HitQueue size more appropriate for KNN exact search [lucene]
bugmakerr opened a new pull request, #13184: URL: https://github.com/apache/lucene/pull/13184 ### Description Currently, when performing KNN exact search, we consistently set the HitQueue size to `k`. However, there may be instances where the number of candidates is actually lower than `k`. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] gh-13147: use dense bit-encoding for frequent terms [lucene]
msokolov commented on PR #13153: URL: https://github.com/apache/lucene/pull/13153#issuecomment-1997699021 It seems to especially make phrase and span queries slower? Possibly the decoding of positions is not good? I could try restricting to fields without positions (which seem likely to be the ones that would benefit anyway) ``` TaskQPS baseline StdDevQPS my_modified_version StdDevPct diff p-value [87/1978] HighPhrase4.62 (4.1%)4.45 (3.9%) -3.7% ( -11% -4%) 0.004 MedPhrase 67.01 (1.9%) 65.37 (2.0%) -2.4% ( -6% -1%) 0.000 IntNRQ 15.92 (13.8%) 15.53 (13.2%) -2.4% ( -25% - 28%) 0.569 HighSloppyPhrase3.60 (2.4%)3.52 (2.8%) -2.3% ( -7% -2%) 0.005 LowPhrase9.44 (2.5%)9.23 (2.4%) -2.2% ( -6% -2%) 0.004 AndHighHighDayTaxoFacets2.92 (2.3%)2.88 (3.5%) -1.5% ( -7% -4%) 0.108 LowSloppyPhrase 26.88 (2.5%) 26.48 (2.5%) -1.5% ( -6% -3%) 0.065 MedSloppyPhrase 18.10 (3.2%) 17.85 (3.5%) -1.4% ( -7% -5%) 0.191 HighTermDayOfYearSort 156.73 (2.2%) 154.64 (2.8%) -1.3% ( -6% -3%) 0.096 TermDTSort 69.58 (2.6%) 68.83 (2.9%) -1.1% ( -6% -4%) 0.216 AndHighMedDayTaxoFacets 15.06 (0.9%) 14.91 (1.6%) -1.0% ( -3% -1%) 0.015 MedSpanNear 31.29 (1.5%) 30.97 (1.1%) -1.0% ( -3% -1%) 0.017 HighIntervalsOrdered3.92 (4.2%)3.88 (4.5%) -1.0% ( -9% -8%) 0.465 LowIntervalsOrdered1.42 (3.0%)1.41 (3.0%) -0.9% ( -6% -5%) 0.335 LowSpanNear 14.23 (1.3%) 14.10 (0.8%) -0.9% ( -2% -1%) 0.009 OrNotHighMed 207.12 (5.3%) 205.50 (5.8%) -0.8% ( -11% - 10%) 0.655 OrNotHighHigh 126.12 (5.2%) 125.14 (5.8%) -0.8% ( -11% - 10%) 0.655 HighTermTitleSort 79.33 (2.5%) 78.82 (2.0%) -0.6% ( -5% -4%) 0.376 BrowseDateTaxoFacets2.73 (5.7%)2.71 (6.9%) -0.6% ( -12% - 12%) 0.752 BrowseRandomLabelTaxoFacets2.11 (11.5%)2.10 (11.9%) -0.6% ( -21% - 25%) 0.868 MedIntervalsOrdered3.17 (3.9%)3.15 (4.4%) -0.5% ( -8% -8%) 0.694 HighSpanNear3.15 (3.1%)3.14 (2.2%) -0.5% ( -5% -4%) 0.576 OrHighLow 247.95 (2.2%) 246.79 (1.9%) -0.5% ( -4% -3%) 0.467 AndHighLow 396.47 (2.2%) 394.68 (1.5%) -0.4% ( -4% -3%) 0.460 Fuzzy1 65.30 (1.5%) 65.01 (1.8%) -0.4% ( -3% -2%) 0.402 BrowseDayOfYearTaxoFacets2.77 (5.4%)2.76 (6.2%) -0.4% ( -11% - 11%) 0.817 OrHighMed 142.70 (2.1%) 142.13 (2.0%) -0.4% ( -4% -3%) 0.543 OrHighNotMed 279.04 (5.4%) 277.96 (6.2%) -0.4% ( -11% - 11%) 0.832 BrowseMonthSSDVFacets2.38 (2.7%)2.37 (2.9%) -0.4% ( -5% -5%) 0.663 Fuzzy2 50.49 (1.3%) 50.32 (1.4%) -0.3% ( -3% -2%) 0.420 OrHighHigh 21.69 (2.4%) 21.62 (3.6%) -0.3% ( -6% -5%) 0.746 Respell 32.27 (2.4%) 32.19 (2.2%) -0.2% ( -4% -4%) 0.761 BrowseMonthTaxoFacets2.87 (1.4%)2.86 (1.2%) -0.2% ( -2% -2%) 0.654 PKLookup 142.74 (2.4%) 142.50 (1.9%) -0.2% ( -4% -4%) 0.809 AndHighMed 53.54 (2.3%) 53.45 (2.2%) -0.2% ( -4%
Re: [PR] Change BP reordering logic to help support document blocks later on. [lucene]
jpountz commented on code in PR #13123: URL: https://github.com/apache/lucene/pull/13123#discussion_r1525040811 ## lucene/misc/src/java/org/apache/lucene/misc/index/BPIndexReorderer.java: ## @@ -341,116 +344,94 @@ protected void compute() { */ private boolean shuffle( ForwardIndex forwardIndex, -IntsRef left, -IntsRef right, +IntsRef docIDs, +int midPoint, int[] leftDocFreqs, int[] rightDocFreqs, -float[] gains, +float[] biases, int iter) throws IOException { - assert left.ints == right.ints; - assert left.offset + left.length == right.offset; - // Computing gains is typically a bottleneck, because each iteration needs to iterate over all - // postings to recompute gains, and the total number of postings is usually one order of + // Computing biases is typically a bottleneck, because each iteration needs to iterate over + // all postings to recompute biases, and the total number of postings is usually one order of // magnitude or more larger than the number of docs. So we try to parallelize it. - ComputeGainsTask leftGainsTask = - new ComputeGainsTask( - left.ints, - gains, - left.offset, - left.offset + left.length, + new ComputeBiasTask( + docIDs.ints, + biases, + docIDs.offset, + docIDs.offset + docIDs.length, leftDocFreqs, rightDocFreqs, threadLocal, - depth); - ComputeGainsTask rightGainsTask = - new ComputeGainsTask( - right.ints, - gains, - right.offset, - right.offset + right.length, - rightDocFreqs, - leftDocFreqs, - threadLocal, - depth); - if (shouldFork(docIDs.length, docIDs.ints.length)) { -invokeAll(leftGainsTask, rightGainsTask); - } else { -leftGainsTask.compute(); -rightGainsTask.compute(); + depth) + .compute(); + + float maxLeftBias = Float.NEGATIVE_INFINITY; + for (int i = docIDs.offset; i < midPoint; ++i) { +maxLeftBias = Math.max(maxLeftBias, biases[i]); + } + float minRightBias = Float.POSITIVE_INFINITY; + for (int i = midPoint, end = docIDs.offset + docIDs.length; i < end; ++i) { +minRightBias = Math.min(minRightBias, biases[i]); + } + float gain = maxLeftBias - minRightBias; + // This uses the simulated annealing proposed by Mackenzie et al in "Tradeoff Options for + // Bipartite Graph Partitioning" by comparing the gain of swapping the doc from the left side + // that is most attracted to the right and the doc from the right side that is most attracted + // to the left against `iter` rather than zero. + if (gain <= iter) { +return false; } - class ByDescendingGainSorter extends IntroSorter { + new IntroSelector() { int pivotDoc; -float pivotGain; +float pivotBias; @Override protected void setPivot(int i) { - pivotDoc = left.ints[i]; - pivotGain = gains[i]; + pivotDoc = docIDs.ints[i]; + pivotBias = biases[i]; } @Override protected int comparePivot(int j) { - // Compare in reverse order to get a descending sort - int cmp = Float.compare(gains[j], pivotGain); + int cmp = Float.compare(pivotBias, biases[j]); Review Comment: @rishabhmaurya Yes! Please feel free to open a PR and let's discuss there! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] gh-13147: use dense bit-encoding for frequent terms [lucene]
msokolov commented on PR #13153: URL: https://github.com/apache/lucene/pull/13153#issuecomment-1997659513 right that last change seemed promising! But on luceneutil tasks it didn't show much impact, basically a regression to teh mean compared to the first revision I tested; possibly a slight negative trend overall. I'll paste the results here. I was considering creating a more targeted benchmark to exaggerate the impact using a field and query type where we would expect to see a benefit, if only as a debugging tool, but I haven't had a chance to do that yet. Another idea I had was to try to help branch prediction by using a specialized BlockDecoder or so rather than embedding if statements in every advance/nextDoc call. I'm also not sure whether writing full longs versus truncating the final long in the bitset and writing only as many bytes as needed (added in the recent revision) might be making some difference? Seems unlikely. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Made DocIdsWriter use DISI when reading documents with an IntersectVisitor [lucene]
jpountz commented on PR #13149: URL: https://github.com/apache/lucene/pull/13149#issuecomment-1997568632 ++ on progress over perfection That said, I wonder if this change is legal: `DocIdSetIterator` must return doc IDs in order, but it looks like it wouldn't always be the case with your change? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Reduce duplication in taxonomy facets; always do counts [lucene]
stefanvodita commented on PR #12966: URL: https://github.com/apache/lucene/pull/12966#issuecomment-1997515955 @gsmiller - I know you may not have time to review, but I want to at least notify you, since this is a big change and you've been very invovled in this area of the code. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Replace Collections.synchronizedSet() with ConcurrentHashMap.newKeySet() [lucene]
uschindler commented on PR #13142: URL: https://github.com/apache/lucene/pull/13142#issuecomment-1997212215 Hi, I am fine to apply the "safe" changes where iteration or an atomic add+remove is not required. But all others should be reverted and external synchronization using synchronizedSet should be used. Let's only do this for simple sets where we never iterate and where we only check and add single entries from different threads. Uwe -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Replace Collections.synchronizedSet() with ConcurrentHashMap.newKeySet() [lucene]
uschindler commented on code in PR #13142: URL: https://github.com/apache/lucene/pull/13142#discussion_r1524663419 ## lucene/replicator/src/java/org/apache/lucene/replicator/nrt/PrimaryNode.java: ## @@ -158,10 +158,7 @@ public long getPrimaryGen() { */ public boolean flushAndRefresh() throws IOException { message("top: now flushAndRefresh"); -Set completedMergeFiles; Review Comment: I agree, better keep old code. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Replace Collections.synchronizedSet() with ConcurrentHashMap.newKeySet() [lucene]
uschindler commented on code in PR #13142: URL: https://github.com/apache/lucene/pull/13142#discussion_r1524661584 ## lucene/core/src/java/org/apache/lucene/store/TrackingDirectoryWrapper.java: ## @@ -61,10 +61,8 @@ public void copyFrom(Directory from, String src, String dest, IOContext context) @Override public void rename(String source, String dest) throws IOException { in.rename(source, dest); -synchronized (createdFileNames) { - createdFileNames.add(dest); - createdFileNames.remove(source); -} +createdFileNames.add(dest); +createdFileNames.remove(source); Review Comment: I have the same problem. We need to avoid that. Unless there's an add+remove atomic call to ConcurrentSet we need to synchronize. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Replace Collections.synchronizedSet() with ConcurrentHashMap.newKeySet() [lucene]
benwtrent commented on code in PR #13142: URL: https://github.com/apache/lucene/pull/13142#discussion_r1524632573 ## lucene/core/src/java/org/apache/lucene/store/TrackingDirectoryWrapper.java: ## @@ -61,10 +61,8 @@ public void copyFrom(Directory from, String src, String dest, IOContext context) @Override public void rename(String source, String dest) throws IOException { in.rename(source, dest); -synchronized (createdFileNames) { - createdFileNames.add(dest); - createdFileNames.remove(source); -} +createdFileNames.add(dest); +createdFileNames.remove(source); Review Comment: This means that another thread can mutate `createdFileNames` between `add` and `remove`. Are we sure this doesn't add any weird race conditions? ## lucene/replicator/src/java/org/apache/lucene/replicator/nrt/ReplicaNode.java: ## @@ -625,19 +624,17 @@ public void run(CopyJob job) { for (String fileName : curNRTCopy.getFileNamesToCopy()) { assert lastCommitFiles.contains(fileName) == false : "fileName=" + fileName + " is in lastCommitFiles and is being copied?"; - synchronized (mergeCopyJobs) { -for (CopyJob mergeJob : mergeCopyJobs) { - if (mergeJob.getFileNames().contains(fileName)) { -// TODO: we could maybe transferAndCancel here? except CopyJob can't transferAndCancel -// more than one currently -message( -"top: now cancel merge copy job=" -+ mergeJob -+ ": file " -+ fileName -+ " is now being copied via NRT point"); -mergeJob.cancel("newNRTPoint is copying over the same file", null); - } + for (CopyJob mergeJob : mergeCopyJobs) { +if (mergeJob.getFileNames().contains(fileName)) { Review Comment: Since this no longer locks on `mergeCopyJobs`, this means during the iteration, the collection can be mutated. Are we sure this is OK? ## lucene/replicator/src/java/org/apache/lucene/replicator/nrt/ReplicaNode.java: ## @@ -625,19 +624,17 @@ public void run(CopyJob job) { for (String fileName : curNRTCopy.getFileNamesToCopy()) { assert lastCommitFiles.contains(fileName) == false : "fileName=" + fileName + " is in lastCommitFiles and is being copied?"; - synchronized (mergeCopyJobs) { -for (CopyJob mergeJob : mergeCopyJobs) { - if (mergeJob.getFileNames().contains(fileName)) { -// TODO: we could maybe transferAndCancel here? except CopyJob can't transferAndCancel -// more than one currently -message( -"top: now cancel merge copy job=" -+ mergeJob -+ ": file " -+ fileName -+ " is now being copied via NRT point"); -mergeJob.cancel("newNRTPoint is copying over the same file", null); - } + for (CopyJob mergeJob : mergeCopyJobs) { +if (mergeJob.getFileNames().contains(fileName)) { Review Comment: `mergeCopyJobs` is protected, implying its use with sub-classes. You can see this in `SimpleReplicaNode`, which takes action while `synchronized(mergeCopyJobs)`. A quick search on github shows that other repositories take similar action. Are we sure this doesn't add a race condition to library user's code that sub-class the `ReplicaNode` class & synchronize on `mergeCopyJobs`? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[PR] Fix TestIndexWriter.testDeleteUnusedFiles failure on Windows 11 [lucene]
vsop-479 opened a new pull request, #13183: URL: https://github.com/apache/lucene/pull/13183 Fix https://github.com/apache/lucene/issues/12524 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Replace Collections.synchronizedSet() with ConcurrentHashMap.newKeySet() [lucene]
dweiss commented on PR #13142: URL: https://github.com/apache/lucene/pull/13142#issuecomment-1996843277 I'm out of office this week. If anybody can pick this up, please do. Otherwise I'll return to it next wek. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Remove halt() call in TestSimpleServer (part of TestStressNRTReplication [lucene]
dweiss commented on PR #13177: URL: https://github.com/apache/lucene/pull/13177#issuecomment-1996840513 I think it's in the method's documentation (current()) that the returned handle can't be used to stop yourself. Indeed, I tried it too. ;) -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org