Re: [PR] Removing thread sleep calls from TestIndexWriter.testThreadInterruptDeadlock and TestDirectoryReader.testStressTryIncRef [lucene]
iamsanjay commented on code in PR #13037: URL: https://github.com/apache/lucene/pull/13037#discussion_r1473877541 ## lucene/core/src/test/org/apache/lucene/index/TestDirectoryReader.java: ## @@ -1005,55 +1010,59 @@ public void testTryIncRef() throws IOException { dir.close(); } - public void testStressTryIncRef() throws IOException, InterruptedException { + public void testStressTryIncRef() throws Exception { Directory dir = newDirectory(); IndexWriter writer = new IndexWriter(dir, newIndexWriterConfig(new MockAnalyzer(random(; -writer.addDocument(new Document()); -writer.commit(); -DirectoryReader r = DirectoryReader.open(dir); -int numThreads = atLeast(2); +final ScheduledExecutorService closeReaderBeforeJoiningThreads = +Executors.newSingleThreadScheduledExecutor(new NamedThreadFactory("testStressTryIncRef")); -IncThread[] threads = new IncThread[numThreads]; -for (int i = 0; i < threads.length; i++) { - threads[i] = new IncThread(r, random()); - threads[i].start(); -} -Thread.sleep(100); - -assertTrue(r.tryIncRef()); -r.decRef(); -r.close(); - -for (IncThread thread : threads) { - thread.join(); - assertNull(thread.failed); -} -assertFalse(r.tryIncRef()); -writer.close(); -dir.close(); - } - - static class IncThread extends Thread { -final IndexReader toInc; -final Random random; -Throwable failed; - -IncThread(IndexReader toInc, Random random) { - this.toInc = toInc; - this.random = random; -} Review Comment: I refactored `testStressTryIncRef` test bit further, removed the IncThread class and convert it into a runnable. And with that remove the call to `random()` which we do not use anymore. Also introduce the try-with-resources statement. I would love to get some feedback on it! I observed that usually we are using `Thread.sleep`, in many cases, where we want to delay the execution of some lines of code in the main thread. For instance, closing IndexReader in above case before stress test the `tryIncRef()` and `decRef()` via running two threads simultaneously. To replace `Thread.sleep`, I inspect the code for which we want the delayed execution and then schedule that block via `ScheduledExecutorService`. Having said that, I have my suspicion that whether we want to pivot in that direction. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Removing thread sleep calls from TestIndexWriter.testThreadInterruptDeadlock and TestDirectoryReader.testStressTryIncRef [lucene]
iamsanjay commented on code in PR #13037: URL: https://github.com/apache/lucene/pull/13037#discussion_r1473877541 ## lucene/core/src/test/org/apache/lucene/index/TestDirectoryReader.java: ## @@ -1005,55 +1010,59 @@ public void testTryIncRef() throws IOException { dir.close(); } - public void testStressTryIncRef() throws IOException, InterruptedException { + public void testStressTryIncRef() throws Exception { Directory dir = newDirectory(); IndexWriter writer = new IndexWriter(dir, newIndexWriterConfig(new MockAnalyzer(random(; -writer.addDocument(new Document()); -writer.commit(); -DirectoryReader r = DirectoryReader.open(dir); -int numThreads = atLeast(2); +final ScheduledExecutorService closeReaderBeforeJoiningThreads = +Executors.newSingleThreadScheduledExecutor(new NamedThreadFactory("testStressTryIncRef")); -IncThread[] threads = new IncThread[numThreads]; -for (int i = 0; i < threads.length; i++) { - threads[i] = new IncThread(r, random()); - threads[i].start(); -} -Thread.sleep(100); - -assertTrue(r.tryIncRef()); -r.decRef(); -r.close(); - -for (IncThread thread : threads) { - thread.join(); - assertNull(thread.failed); -} -assertFalse(r.tryIncRef()); -writer.close(); -dir.close(); - } - - static class IncThread extends Thread { -final IndexReader toInc; -final Random random; -Throwable failed; - -IncThread(IndexReader toInc, Random random) { - this.toInc = toInc; - this.random = random; -} Review Comment: I refactored `testStressTryIncRef` test bit further, removed the IncThread class and convert it into a runnable. And with that use the call to `random()` which we do not use anymore. Also introduce the try-with-resources statement. I would love to get some feedback on it! I observed that usually we are using `Thread.sleep`, in many cases, where we want to delay the execution of some lines of code in the main thread. For instance, closing IndexReader in above case before stress test the `tryIncRef()` and `decRef()` via running two threads simultaneously. To replace `Thread.sleep`, I inspect the code for which we want the delayed execution and then schedule that block via `ScheduledExecutorService`. Having said that, I have my suspicion that whether we want to pivot in that direction. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Move synonym map off-heap for SynonymGraphFilter [lucene]
msfroh commented on PR #13054: URL: https://github.com/apache/lucene/pull/13054#issuecomment-1920420714 I did some rough benchmarks using the large synonym file attached to https://issues.apache.org/jira/browse/LUCENE-3233 The benchmark code and input is at https://github.com/msfroh/lucene/commit/174f98a91e8709ee66dc2ed84b5c0b54e1a10635 |Attempt|On-heap load|Off-heap load|Off-heap reload|On-heap process|Off-heap process|Off-heap reload process |---||-|---|---||--- |1|1146.022381|1117.004359|4.099065|569.120851|656.430684|613.475144 |2|1079.578922|1060.926854|1.761465|456.203168|655.596275|622.534246 |3|1035.911388|1076.611629|1.750233|579.41094|655.955431|614.788388 |4|1037.825728|1085.513933|2.074129|696.390519|688.664985|613.266972 |5|1017.489384|1008.209808|1.717748|485.510526|620.800148|620.708538 |6|1014.641653|1024.412669|1.740371|483.617261|619.696259|619.910897 |7|1027.691397|1045.129567|1.727786|670.49456|622.48759|616.303549 |8|984.005971|1009.265777|1.736832|513.543926|615.448442|613.06279 |9|1027.841112|1027.057453|1.732985|486.502644|622.535269|620.285635 |10|981.689573|1074.613506|1.71059|707.810107|613.417977|624.34832 |11|1026.165712|1065.3181|1.689407|479.610417|621.454353|616.183786 |12|994.949905|1046.898091|1.730394|498.938696|612.279425|619.965444 |13|1035.144288|1043.119169|1.739726|472.821155|619.267425|613.029508 |14|996.056368|1017.663948|1.699742|692.135015|619.725163|620.454352 |15|1046.605644|1018.287866|1.713526|470.391592|619.723699|612.068366 |16|1007.579733|1042.062818|1.70251|508.481346|619.481298|619.178419 |17|1038.166702|1054.039165|1.683814|485.439337|620.901934|616.017789 |18|1000.900448|1058.492139|1.7267|515.185816|622.204031|627.560895 |19|1236.416447|1080.877889|1.643654|434.73928|624.825435|625.622426 |20|997.663619|1038.478411|1.657257|497.232157|623.337627|620.943519 |**Mean**|1036.617319|1049.699158|1.8518967|535.1789657|628.7116725|618.4854492 |**Stddev**|59.71799264|28.44516049|0.535792004|86.95026923|19.55324941|4.52695571 So, it looks like the time to load synonyms is mostly unchanged (1050ms versus 1037ms), and loading "pre-compiled" synonyms is super-duper fast. We do seem to take a 17.5% hit on processing time. (629ms versus 535ms.) I might try profiling to see where that time is being spent. If it's doing FST operations, I'll assume it's a cost of doing business. If it's spent loading the also off-heap output words, I might consider moving those (optionally?) back on heap. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [I] Explore moving HNSW's NeighborQueue to a radix heap [LUCENE-10383] [lucene]
angadp commented on issue #11419: URL: https://github.com/apache/lucene/issues/11419#issuecomment-1920390020 I see what you were saying earlier now, Ben! So when we add the candidate we add them using `outOfOrder` method which is non-montonic and then try to find the least diverse candidate. Hence we don't use a priority queue for maintaining the candidates. The nearest neighbor tracking is monotonic. Beginner questions here: What was the decision behind adding these candidates `outOfOrder`? Can we not try to modify the method to do it `inOrder` and then try to find the least diverse candidate to remove? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [I] Stack overflow fix for Java 1.8? [lucene]
AngledLuffa commented on issue #13064: URL: https://github.com/apache/lucene/issues/13064#issuecomment-1919930097 Yeah, it's probably about time to move on from 8. Will need to give a heads up to our userbase, though -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [I] Stack overflow fix for Java 1.8? [lucene]
AngledLuffa closed issue #13064: Stack overflow fix for Java 1.8? URL: https://github.com/apache/lucene/issues/13064 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [I] Stack overflow fix for Java 1.8? [lucene]
dweiss commented on issue #13064: URL: https://github.com/apache/lucene/issues/13064#issuecomment-1919922375 Your best bet would be to bump the minimum requirements to a more recent Java version and use a more recent Lucene release. There are benefits reaching beyond just that particular bug - new features, new codecs, faster searches. Lucene 8 does occasionally receive patches in the joint Lucene/Solr repository, here: https://github.com/apache/lucene-solr/commits/branch_8_11/ I don't think there'll be a lot of interest to backport larger things to that branch though. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[I] Stack overflow fix for Java 1.8? [lucene]
AngledLuffa opened a new issue, #13064: URL: https://github.com/apache/lucene/issues/13064 Hi, I am the primary maintainer of Stanford CoreNLP. We use the Lucene libraries for various things in our software. I found that there's a fix for a stack overflow error in recent versions of Lucene: https://github.com/apache/lucene/pull/12462 This was in response to the following issue a user raised on our software: https://github.com/stanfordnlp/CoreNLP/issues/1408 However, when I went to update the Lucene libraries we depend on, it turns out the latest releases target Java 11. I don't suppose there's a release of Lucene which targets 8? We have kept our software compatible with Java 8 for a while, although we're aware that eventually we'll want to move on to newer versions. Thanks in advance -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Fix knn vector visit limit fence post error [lucene]
benwtrent commented on PR #13058: URL: https://github.com/apache/lucene/pull/13058#issuecomment-1919842218 @jpountz good point. I can update the query limit to adjust instead of allowing the visit limit to increase above :). -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Speedup concurrent multi-segment HNWS graph search 2 [lucene]
mayya-sharipova commented on PR #12962: URL: https://github.com/apache/lucene/pull/12962#issuecomment-1919701631 I've re-ran the sets o with latest changes on this PR (candidate) and main branch (baseline): I have also done experiments using Cohere dataset, as as seen below: - for 10M docs dataset, the speedups with the proposed approach are 1.7-2.5x times. - for 10M docs dataset, where k+fanout = 1000, QPS is close to the QPS of a single segment, while recall is better. ## Cohere/wikipedia-22-12-en-embeddings - [Cohere/wikipedia-22-12-en-embeddings](https://huggingface.co/datasets/Cohere/wikipedia-22-12-en-embeddings) dataset - 768 dims - interval to synchronize with global queue: 255 visited docs ### 1M vectors k=10, fanout=90 | |Avg visited nodes | QPS| Recall| | :--- |---: | ---: |---: | | Baseline Single segment | | |0.880| | Baseline 8 segments concurrent | 13927| 815|0.974| | Candidate2_with_queue | 12670| 859| 0.964| k=100, fanout=900 | |Avg visited nodes | QPS| Recall| | :--- |---: | ---: |---: | | Baseline Single segment || |0.964| | Baseline 8 segments concurrent | 81824| 126|0.997| | Candidate2_with_queue | 62085| 165|0.995| ### 10M vectors k=10, fanout=90 | |Avg visited nodes | QPS| Recall| | :--- |---: | ---: |---: | | Baseline Single segment | | |0.929| | Baseline 19 segments concurrent |37656| 271|0.951| | Candidate2_with_queue | 21921| 443|0.927| k=100, fanout=900 | |Avg visited nodes | QPS| Recall| | :--- |---: | ---: |---: | | Baseline Single segment | | |0.950 | | Baseline 19 segments concurrent |229945|44|0.990| | Candidate2_with_queue | 101970| 91|0.984| -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Speedup concurrent multi-segment HNWS graph search 2 [lucene]
tveasey commented on PR #12962: URL: https://github.com/apache/lucene/pull/12962#issuecomment-1919650070 > @jimczi @tveasey I've addressed your comments. Are we ok to merge as it is now. I'm happy -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Fix knn vector visit limit fence post error [lucene]
jpountz commented on PR #13058: URL: https://github.com/apache/lucene/pull/13058#issuecomment-1919616921 I'm not sure about this one. Intuitively, if I configure a limit on the number of visited nodes, I may consider it a bug if we end up visiting more nodes than this limit. Maybe we should instead fix the queries to set a limit of `cost+1`? (plus logic to protect from overflows) -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Optimize counts on two clause term disjunctions [lucene]
jpountz commented on code in PR #13036: URL: https://github.com/apache/lucene/pull/13036#discussion_r1473219731 ## lucene/core/src/test/org/apache/lucene/search/TestBooleanQuery.java: ## @@ -962,6 +962,46 @@ public void testDisjunctionMatchesCount() throws IOException { dir.close(); } + public void testTwoClauseTermDisjunctionCountOptimization() throws Exception { +List docContent = +Arrays.asList( +new String[] {"A", "B"}, +new String[] {"A"}, +new String[] {}, +new String[] {"A", "B", "C"}, +new String[] {"B"}, +new String[] {"B", "C"}); + +try (Directory dir = newDirectory()) { + try (IndexWriter w = + new IndexWriter(dir, newIndexWriterConfig().setMergePolicy(newLogMergePolicy( { + +for (String[] values : docContent) { + Document doc = new Document(); + for (String value : values) { +doc.add(new StringField("foo", value, Field.Store.NO)); + } + w.addDocument(doc); +} +w.forceMerge(1); + } + + try (IndexReader reader = DirectoryReader.open(dir)) { +IndexSearcher searcher = newSearcher(reader); + +Query query = +new BooleanQuery.Builder() +.add(new TermQuery(new Term("foo", "A")), BooleanClause.Occur.SHOULD) +.add(new TermQuery(new Term("foo", "B")), BooleanClause.Occur.SHOULD) +.setMinimumNumberShouldMatch(1) +.build(); + +int count = searcher.count(query); +assertEquals(5, count); Review Comment: Can you also test corner cases, e.g. one clause has no matches, or both clauses have no matches? I'm not sure how to do it, but it would also be nice to double check we're indeed applying your optimization, and that the heuristic is not disabling the opto. ## lucene/core/src/java/org/apache/lucene/search/BooleanQuery.java: ## @@ -179,6 +180,36 @@ boolean isPureDisjunction() { return clauses.size() == getClauses(Occur.SHOULD).size() && minimumNumberShouldMatch <= 1; } + /** Whether this query is a two clause disjunction with two term query clauses. */ + boolean isTwoClausePureDisjunctionWithTerms() { +return clauses.size() == 2 +&& isPureDisjunction() +&& clauses.get(0).getQuery() instanceof TermQuery +&& clauses.get(1).getQuery() instanceof TermQuery; + } + + /** + * Rewrite a single two clause disjunction query with terms to two term queries and a conjunction + * query using the inclusion–exclusion principle. + */ + Query[] rewriteTwoClauseDisjunctionWithTermsForCount(IndexSearcher indexSearcher) + throws IOException { +BooleanQuery.Builder newQuery = new BooleanQuery.Builder(); +Query[] queries = new Query[3]; +for (int i = 0; i < clauses.size(); i++) { + TermQuery termQuery = (TermQuery) clauses.get(i).getQuery(); + if (termQuery.getTermStates() == null) { +termQuery = +new TermQuery( +termQuery.getTerm(), TermStates.build(indexSearcher, termQuery.getTerm(), true)); Review Comment: We don't need term statistics here, just a cache for term states. ```suggestion termQuery.getTerm(), TermStates.build(indexSearcher, termQuery.getTerm(), false)); ``` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Speedup concurrent multi-segment HNWS graph search 2 [lucene]
benwtrent commented on PR #12962: URL: https://github.com/apache/lucene/pull/12962#issuecomment-1919489643 > but I think we need to run more experiments on smaller dims datasets as well, how about we leave this for the follow up? I am 100% fine with this. It was a crazy idea and it only gives us an incremental change. This PR gives a much better improvement as it is already. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Speedup concurrent multi-segment HNWS graph search 2 [lucene]
mayya-sharipova commented on PR #12962: URL: https://github.com/apache/lucene/pull/12962#issuecomment-1919480036 @benwtrent Thanks for running additional tests. Looks like running with dynamic `k` can speed up searches, but I think we need to run more experiments on smaller dims datasets as well, so I think we should leave this in the follow up. @jimczi @tveasey I've addressed your comments. Are we ok to merge as it is now. --- The following is left for the follow up work: 1. Dynamic k based on segment size. Currently we gather at least k documents for every segment before starting synchronizing with the global queue. We need to adjust k based on the segment size. 2. Improve synchronization schedule with the global queue. Currently we do synchronization with the global queue every 255 visited docs. Can we improve schedule as per as per [comment](https://github.com/apache/lucene/pull/12962#discussion_r1453323532). -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Fail the test if waiters are blocked for more than 100 seconds [lucene]
sabi0 closed pull request #13063: Fail the test if waiters are blocked for more than 100 seconds URL: https://github.com/apache/lucene/pull/13063 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [I] TestDocumentsWriterStallControl.assertState() does not do what it appears it would [lucene]
sabi0 commented on issue #13061: URL: https://github.com/apache/lucene/issues/13061#issuecomment-1919391573 You were faster :-) -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [I] TestDocumentsWriterStallControl.assertState() does not do what it appears it would [lucene]
s1monw commented on issue #13061: URL: https://github.com/apache/lucene/issues/13061#issuecomment-1919370836 I opened a pr for this.. thanks -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[PR] Fix broken loop in TestDocumentsWriterStallControl.assertState() [lucene]
s1monw opened a new pull request, #13062: URL: https://github.com/apache/lucene/pull/13062 The loop in assertState prematurely exists due to a broken break steament. Closes #13061 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [I] SegmentDocValuesProducer checkIntegrity might open a dropped segment [lucene]
s1monw commented on issue #13020: URL: https://github.com/apache/lucene/issues/13020#issuecomment-1919352603 @soren-xzk can you reproduce this on a sane file system and can you provide a test-case that reproduces the issue? Sounds like file locking issue on hdfs? A test-case for this would be great. Also can you provide more information about what HDFS filesystem impl you are using? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [I] TestDocumentsWriterStallControl.assertState() does not do what it appears it would [lucene]
sabi0 commented on issue #13061: URL: https://github.com/apache/lucene/issues/13061#issuecomment-1919335755 I think the idea was to wait for the blocked threads for up to 2 minutes. And then fail the test if there are still blocked threads. The progressive sleep time build-up is likely to avoid the test taking too long if the threads are unblocked quickly. Currently however it just sleeps for 100 ms and then exits. The fix is then to remove the `break` on line 190. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Speedup concurrent multi-segment HNWS graph search 2 [lucene]
mayya-sharipova commented on code in PR #12962: URL: https://github.com/apache/lucene/pull/12962#discussion_r1473009083 ## lucene/core/src/java/org/apache/lucene/search/AbstractKnnVectorQuery.java: ## @@ -79,24 +82,30 @@ public Query rewrite(IndexSearcher indexSearcher) throws IOException { filterWeight = null; } +BlockingFloatHeap globalScoreQueue = new BlockingFloatHeap(k); Review Comment: Addressed in the last commits -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Speedup concurrent multi-segment HNWS graph search 2 [lucene]
mayya-sharipova commented on code in PR #12962: URL: https://github.com/apache/lucene/pull/12962#discussion_r1473008106 ## lucene/core/src/java/org/apache/lucene/index/LeafReader.java: ## @@ -280,12 +289,20 @@ public final TopDocs searchNearestVectors( * @param k the number of docs to return * @param acceptDocs {@link Bits} that represents the allowed documents to match, or {@code null} * if they are all allowed to match. - * @param visitedLimit the maximum number of nodes that the search is allowed to visit + * @param visitedLimit the maximum number of nodes that the search is allowed to visit *@param + * globalScoreQueue the global score queue used to track the top scores collected across all + * leaves * @return the k nearest neighbor documents, along with their (searchStrategy-specific) scores. * @lucene.experimental */ public final TopDocs searchNearestVectors( - String field, byte[] target, int k, Bits acceptDocs, int visitedLimit) throws IOException { + String field, + byte[] target, + int k, + Bits acceptDocs, + int visitedLimit, + BlockingFloatHeap globalScoreQueue) + throws IOException { Review Comment: Addressed in the last commit ## lucene/core/src/java/org/apache/lucene/index/LeafReader.java: ## @@ -280,12 +289,20 @@ public final TopDocs searchNearestVectors( * @param k the number of docs to return * @param acceptDocs {@link Bits} that represents the allowed documents to match, or {@code null} * if they are all allowed to match. - * @param visitedLimit the maximum number of nodes that the search is allowed to visit + * @param visitedLimit the maximum number of nodes that the search is allowed to visit *@param + * globalScoreQueue the global score queue used to track the top scores collected across all + * leaves * @return the k nearest neighbor documents, along with their (searchStrategy-specific) scores. * @lucene.experimental */ public final TopDocs searchNearestVectors( - String field, byte[] target, int k, Bits acceptDocs, int visitedLimit) throws IOException { + String field, + byte[] target, + int k, + Bits acceptDocs, + int visitedLimit, + BlockingFloatHeap globalScoreQueue) + throws IOException { Review Comment: Addressed in the last commits -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Speedup concurrent multi-segment HNWS graph search 2 [lucene]
benwtrent commented on PR #12962: URL: https://github.com/apache/lucene/pull/12962#issuecomment-1919297975 I fixed my data and ran with 1.5M cohere: static_k is this PR dynamic_k is this PR + scaling the `k` explored by ``` loat v = (float)Math.log(sumVectorCount / (double) leafSize); float filterWeightValue = 1/v; ``` Scaling the `k` is another nice incremental change. Up and to the left is best. We get better recall with visiting fewer consistently. ``` plt.plot([2304, 2485, 3189, 4028, 5610, 16165], [0.875, 0.886, 0.916, 0.938, 0.960, 0.992], marker='o', label='baseline_single') plt.plot([43015, 45946, 56864, 69149, 90772, 210727], [0.980, 0.982, 0.989, 0.992, 0.996, 1.000], marker='o', label='baseline_multi') plt.plot([22706, 23749, 27142, 30893, 37162, 76032], [0.959, 0.962, 0.970, 0.976, 0.983, 0.996], marker='o', label='candidate_static_k') plt.plot([16099, 17183, 20788, 24809, 31334, 64921], [0.937, 0.945, 0.962, 0.973, 0.983, 0.996], marker='o', label='candidate_dynamic_k') ``` ![image](https://github.com/apache/lucene/assets/4357155/acb38940-717b-4a7a-a9c7-1c6564df97fa) -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [I] TestDocumentsWriterStallControl.assertState() does not do what it appears it would [lucene]
s1monw commented on issue #13061: URL: https://github.com/apache/lucene/issues/13061#issuecomment-1919290712 oh I wish I knew what I was thinking back then 12 years ago 🤣 I think that code tries to break out only of the for loop which needs a label I guess to go back to the while loop. Do you wanna take a stab at it and fix it. I think we need to make sure that the for loop runs through at one point and then step out of the outer while loop. Tricky... @mikemccand do you recall exactly what we tried to test back then -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] LUCENE-10366: Override #readVInt and #readVLong for ByteBufferDataInput to avoid the abstraction confusion of #readByte. [lucene]
jpountz commented on PR #592: URL: https://github.com/apache/lucene/pull/592#issuecomment-1919261691 I'm pushing an annotation, this triggered a speedup in PKLookup: http://people.apache.org/~mikemccand/lucenebench/PKLookup.html. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Modernize BWC testing with parameterized tests [lucene]
s1monw commented on PR #13046: URL: https://github.com/apache/lucene/pull/13046#issuecomment-1919215269 thanks everybody... I will go and backport this to 9.x as well -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Modernize BWC testing with parameterized tests [lucene]
s1monw merged PR #13046: URL: https://github.com/apache/lucene/pull/13046 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [I] HnwsGraph creates disconnected components [lucene]
benwtrent commented on issue #12627: URL: https://github.com/apache/lucene/issues/12627#issuecomment-1919070837 OK, I did some more digging, and it seems like my data is garbage, or at least how I am reading it in. I looked at one of these extremely disconnected graphs and found that all the float32 vectors were exactly the same. So, this appears to be a bug in my code. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [I] Occasional OOMEs when running the test suite [lucene]
dweiss commented on issue #12949: URL: https://github.com/apache/lucene/issues/12949#issuecomment-1918837748 This looks like the garbage collector (gradle's JVM) got close to the heap limit and it suffocated trying to release memory while other threads kept allocating it (GC overhead limit exceeded). There's a lot going on in gradle's JVM, especially with parallel builds... and some of it is system-dependent (task parallelism, gc speed). I suggest we can keep this open and see how frequently this happens. Could be a gradle bug (memory leak somewhere), could be the heap that's too small for the number of tasks running in parallel. Hard to tell. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Clean up AnyQueryNode code [lucene]
dweiss merged PR #13053: URL: https://github.com/apache/lucene/pull/13053 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org