[GitHub] [lucene] mocobeta commented on a diff in pull request #917: LUCENE-10531: Disable distribution test (gui test) on windows.
mocobeta commented on code in PR #917: URL: https://github.com/apache/lucene/pull/917#discussion_r878647487 ## .github/workflows/distribution.yml: ## @@ -19,8 +19,9 @@ jobs: runs-on: ${{ matrix.os }} strategy: matrix: -# we run the distribution tests on all major OSs. -os: [ubuntu-latest, macos-latest, windows-latest] +# we want to run the distribution tests on all major OSs, but it's occasionally too slow (or hangs or the forked process is started at all..., not sure the cause) on windows. Review Comment: ```suggestion # we want to run the distribution tests on all major OSs, but it's occasionally too slow (or hangs or the forked process is not started at all..., not sure the cause) on windows. ``` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] mocobeta commented on a diff in pull request #917: LUCENE-10531: Disable distribution test (gui test) on windows.
mocobeta commented on code in PR #917: URL: https://github.com/apache/lucene/pull/917#discussion_r878647263 ## .github/workflows/distribution.yml: ## @@ -19,8 +19,9 @@ jobs: runs-on: ${{ matrix.os }} strategy: matrix: -# we run the distribution tests on all major OSs. -os: [ubuntu-latest, macos-latest, windows-latest] +# we want to run the distribution tests on all major OSs, but it's occasionally too slow on windows. Review Comment: ```suggestion # we want to run the distribution tests on all major OSs, but it's occasionally too slow (or hangs or the forked process is started at all..., not sure the cause) on windows. ``` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] mocobeta commented on pull request #913: Lucene 10577
mocobeta commented on PR #913: URL: https://github.com/apache/lucene/pull/913#issuecomment-1133536049 > Looks like some other issue blocking the forked process. Yes I think it could completely hang for some reason. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] dweiss commented on pull request #913: Lucene 10577
dweiss commented on PR #913: URL: https://github.com/apache/lucene/pull/913#issuecomment-1133533455 > I'm planning to disable the test on windows VM - it looks too slow in GitHub Actions. I'm not sure it's the slowness anymore. Looks like some other issue blocking the forked process. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] mocobeta commented on pull request #913: Lucene 10577
mocobeta commented on PR #913: URL: https://github.com/apache/lucene/pull/913#issuecomment-1133532803 Also, it should run on headless JVM; we are running it on CI servers without displays. https://github.com/apache/lucene/blob/71a9acb2e2aa55257021eefce1e5d8d390bc7048/lucene/luke/src/java/org/apache/lucene/luke/app/desktop/LukeMain.java#L75-L78 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] mocobeta commented on pull request #893: LUCENE-10531: Add @RequiresGUI test group for GUI tests
mocobeta commented on PR #893: URL: https://github.com/apache/lucene/pull/893#issuecomment-1133530902 We increased the timeout to 120 seconds, but the test still occasionally fails on Windows. I opened #917 to disable it on windows for now. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] mocobeta opened a new pull request, #917: LUCENE-10531: Disable distribution test (gui test) on windows.
mocobeta opened a new pull request, #917: URL: https://github.com/apache/lucene/pull/917 The test occasionally is too slow and fails on Windows VM. We can't increase the timeout infinitely, I would disable it on GitHub Actions for now (we still run it on Jenkins Servers). -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] mocobeta commented on pull request #913: Lucene 10577
mocobeta commented on PR #913: URL: https://github.com/apache/lucene/pull/913#issuecomment-1133529000 > hm this "luke-can-be-launched" test failed for me locally too. I guess it needs a JVM with a head? It's not a part of the mandatory test suite but you can run it headlessly (if you are on Linux). Please see: https://github.com/apache/lucene/blob/71a9acb2e2aa55257021eefce1e5d8d390bc7048/help/tests.txt#L131 As for the occasional test failure of the GUI test for Luke, I'm planning to disable the test on windows VM - it looks too slow in GitHub Actions. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] rmuir commented on a diff in pull request #913: Lucene 10577
rmuir commented on code in PR #913: URL: https://github.com/apache/lucene/pull/913#discussion_r878641687 ## lucene/core/src/java/org/apache/lucene/util/VectorUtil.java: ## @@ -213,4 +213,21 @@ public static void add(float[] u, float[] v) { u[i] += v[i]; } } + + public static float dotProduct(BytesRef a, int aOffset, BytesRef b, int bOffset, int len) { +// fixme -- move to codec? What if later we want to access the bytes some other way? +int total = 0; +for (int i = 0; i < len; i++) { + total += a.bytes[aOffset++] * b.bytes[bOffset++]; Review Comment: i wonder if we can coerce autovectorization of this loop somehow too. it is also worth looking into as it wouldn't require any vector api at all, maybe just manipulating the code. it is integer math: doesn't have the restrictions that floating point does (order of operations etc), so theoretically the compiler could do it. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] mocobeta opened a new pull request, #916: Refine contribution guide and pull request template
mocobeta opened a new pull request, #916: URL: https://github.com/apache/lucene/pull/916 Some part of the pull request template has been a bit obsoleted, and there are overlaps between the contribution guide and the template. I noticed the majority of developers (including me...) delete all texts in the template and start with a blank sheet (90% of PRs, I roughly estimate) - then how about just encouraging people to refer to the contribution guide in the template? To avoid information loss, I moved checklist items that did not appear in the contribution guide to it. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-10312) Add PersianStemmer
[ https://issues.apache.org/jira/browse/LUCENE-10312?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17540341#comment-17540341 ] ASF subversion and git services commented on LUCENE-10312: -- Commit 71a9acb2e2aa55257021eefce1e5d8d390bc7048 in lucene's branch refs/heads/main from Tomoko Uchida [ https://gitbox.apache.org/repos/asf?p=lucene.git;h=71a9acb2e2a ] LUCENE-10312: MIGRATE entry and small follow-ups (#908) > Add PersianStemmer > -- > > Key: LUCENE-10312 > URL: https://issues.apache.org/jira/browse/LUCENE-10312 > Project: Lucene - Core > Issue Type: Wish > Components: modules/analysis >Affects Versions: 9.0 >Reporter: Ramin Alirezaee >Priority: Major > Fix For: 10.0 (main), 9.2 > > Attachments: image.png > > Time Spent: 10h 20m > Remaining Estimate: 0h > -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] mocobeta merged pull request #908: LUCENE-10312: small follow-ups
mocobeta merged PR #908: URL: https://github.com/apache/lucene/pull/908 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] mocobeta commented on a diff in pull request #912: Initial rewrite of MMapDirectory for JDK-19 preview Panama APIs (>= JDK-19-ea+23)
mocobeta commented on code in PR #912: URL: https://github.com/apache/lucene/pull/912#discussion_r878635805 ## lucene/core/src/java/org/apache/lucene/store/MMapDirectory.java: ## @@ -232,55 +224,62 @@ public IndexInput openInput(String name, IOContext context) throws IOException { ensureOpen(); ensureCanRead(name); Path path = directory.resolve(name); -try (FileChannel c = FileChannel.open(path, StandardOpenOption.READ)) { - final String resourceDescription = "MMapIndexInput(path=\"" + path.toString() + "\")"; - final boolean useUnmap = getUseUnmap(); - return ByteBufferIndexInput.newInstance( - resourceDescription, - map(resourceDescription, c, 0, c.size()), - c.size(), - chunkSizePower, - new ByteBufferGuard(resourceDescription, useUnmap ? CLEANER : null)); +final String resourceDescription = "MemorySegmentIndexInput(path=\"" + path.toString() + "\")"; + +// Work around for JDK-8259028: we need to unwrap our test-only file system layers +path = Unwrappable.unwrapAll(path); + +boolean success = false; +final MemorySession session = MemorySession.openShared(); +try (var fc = FileChannel.open(path)) { + final long fileSize = fc.size(); + final MemorySegment[] segments = map(session, resourceDescription, fc, fileSize); + final IndexInput in = + MemorySegmentIndexInput.newInstance( + resourceDescription, session, segments, fileSize, chunkSizePower); + success = true; + return in; +} finally { + if (success == false) { +session.close(); + } } } - /** Maps a file into a set of buffers */ - final ByteBuffer[] map(String resourceDescription, FileChannel fc, long offset, long length) + /** Maps a file into a set of segments */ + final MemorySegment[] map( + MemorySession session, String resourceDescription, FileChannel fc, long length) throws IOException { if ((length >>> chunkSizePower) >= Integer.MAX_VALUE) - throw new IllegalArgumentException( - "RandomAccessFile too big for chunk size: " + resourceDescription); + throw new IllegalArgumentException("File too big for chunk size: " + resourceDescription); final long chunkSize = 1L << chunkSizePower; -// we always allocate one more buffer, the last one may be a 0 byte one -final int nrBuffers = (int) (length >>> chunkSizePower) + 1; +// we always allocate one more segments, the last one may be a 0 byte one +final int nrSegments = (int) (length >>> chunkSizePower) + 1; -ByteBuffer[] buffers = new ByteBuffer[nrBuffers]; +final MemorySegment[] segments = new MemorySegment[nrSegments]; -long bufferStart = 0L; -for (int bufNr = 0; bufNr < nrBuffers; bufNr++) { - int bufSize = - (int) ((length > (bufferStart + chunkSize)) ? chunkSize : (length - bufferStart)); - MappedByteBuffer buffer; +long startOffset = 0L; +for (int segNr = 0; segNr < nrSegments; segNr++) { + long segSize = (length > (startOffset + chunkSize)) ? chunkSize : (length - startOffset); Review Comment: minor: can be final? also, thanks for changing the type to `long` - I was wondering why the old `bufSize` should be int. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] gsmiller commented on pull request #915: LUCENE-10585: Scrub copy/paste code in the facets module and attempt to simplify a bit
gsmiller commented on PR #915: URL: https://github.com/apache/lucene/pull/915#issuecomment-1133483198 Ran benchmarks to make sure this didn't regress performance in any meaningful way. I'm not seeing any regressions: ``` TaskQPS baseline StdDevQPS candidate StdDevPct diff p-value Wildcard 337.70 (16.7%) 323.50 (16.7%) -4.2% ( -32% - 35%) 0.426 BrowseMonthTaxoFacets 29.43 (13.8%) 28.72 (19.1%) -2.4% ( -31% - 35%) 0.649 BrowseDateSSDVFacets2.68 (17.5%)2.63 (17.0%) -1.8% ( -30% - 39%) 0.743 BrowseDayOfYearSSDVFacets 14.94 (16.5%) 14.68 (13.6%) -1.8% ( -27% - 33%) 0.709 HighTermMonthSort 147.90 (22.4%) 145.50 (19.4%) -1.6% ( -35% - 51%) 0.807 HighTermDayOfYearSort 61.68 (14.9%) 60.80 (15.3%) -1.4% ( -27% - 33%) 0.766 OrHighNotHigh 1107.68 (4.1%) 1092.99 (2.8%) -1.3% ( -7% -5%) 0.232 LowTerm 2181.38 (3.5%) 2153.14 (3.1%) -1.3% ( -7% -5%) 0.213 OrHighNotMed 1420.79 (2.7%) 1404.11 (3.1%) -1.2% ( -6% -4%) 0.203 TermDTSort 117.08 (15.5%) 115.77 (15.3%) -1.1% ( -27% - 35%) 0.818 BrowseRandomLabelTaxoFacets 18.20 (10.0%) 18.01 (13.2%) -1.1% ( -22% - 24%) 0.769 LowSloppyPhrase 121.57 (4.8%) 120.52 (3.7%) -0.9% ( -8% -8%) 0.527 HighTerm 1427.75 (4.8%) 1416.29 (3.0%) -0.8% ( -8% -7%) 0.529 MedSloppyPhrase 92.18 (5.1%) 91.45 (4.0%) -0.8% ( -9% -8%) 0.589 BrowseDateTaxoFacets 20.98 (12.4%) 20.82 (15.5%) -0.7% ( -25% - 30%) 0.866 OrHighNotLow 1233.19 (4.3%) 1223.99 (3.8%) -0.7% ( -8% -7%) 0.557 OrHighLow 515.20 (2.8%) 511.43 (2.7%) -0.7% ( -6% -4%) 0.400 BrowseDayOfYearTaxoFacets 20.91 (12.5%) 20.77 (15.7%) -0.6% ( -25% - 31%) 0.886 AndHighHigh 40.18 (4.6%) 40.06 (4.2%) -0.3% ( -8% -8%) 0.823 OrNotHighMed 923.89 (3.3%) 921.97 (2.8%) -0.2% ( -6% -6%) 0.830 HighSloppyPhrase 17.59 (6.1%) 17.55 (4.3%) -0.2% ( -10% - 10%) 0.906 AndHighMed 137.42 (3.7%) 137.17 (3.5%) -0.2% ( -7% -7%) 0.870 PKLookup 168.11 (2.6%) 168.01 (3.4%) -0.1% ( -5% -6%) 0.951 BrowseRandomLabelSSDVFacets 10.14 (3.6%) 10.14 (3.3%)0.0% ( -6% -7%) 0.989 IntNRQ 204.47 (0.9%) 204.53 (0.6%)0.0% ( -1% -1%) 0.907 OrNotHighHigh 1480.42 (3.4%) 1480.99 (2.4%)0.0% ( -5% -6%) 0.967 Prefix3 221.07 (15.0%) 221.20 (13.6%)0.1% ( -24% - 33%) 0.989 Fuzzy1 79.60 (1.8%) 79.74 (1.4%)0.2% ( -2% -3%) 0.727 Fuzzy2 75.46 (1.9%) 75.62 (1.6%)0.2% ( -3% -3%) 0.697 MedTerm 1563.09 (5.2%) 1567.15 (3.5%)0.3% ( -8% -9%) 0.853 MedSpanNear 27.45 (1.7%) 27.54 (1.9%)0.3% ( -3% -3%) 0.579 LowPhrase 123.19 (2.2%) 123.66 (1.7%)0.4% ( -3% -4%) 0.539 Respell 46.77 (1.8%) 46.97 (1.4%)0.4% ( -2% -3%) 0.422 MedPhrase 476.52 (2.1%) 478.54 (1.9%)0.4% ( -3% -4%) 0.497 HighPhrase 222.77 (2.3%) 223.82 (1.7%)0.5% ( -3% -4%) 0.458 AndHighLow 1161.41 (2.3%) 1167.56 (2.4%)0.5% ( -4% -5%) 0.482 OrHighMed 128.58 (3.7%) 129.28 (4.3%)0.5% ( -7% -8%) 0.671 LowSpanNear 13.26 (2.3%) 13.34 (2.2%)0.6% ( -3% -5%) 0.421 HighSpanNear 10.68 (2.2%) 10.74 (2.1%)0.6% ( -3% -5%) 0.392 OrNotHighLow 1314.95 (3.3%) 1324.37 (2.5%)0.7% ( -4% -6%) 0.443 OrHighHigh
[GitHub] [lucene] gsmiller opened a new pull request, #915: LUCENE-10585: Scrub copy/paste code in the facets module and attempt to simplify a bit
gsmiller opened a new pull request, #915: URL: https://github.com/apache/lucene/pull/915 # Description The facets module has had a fair amount of recent development, some resulting in a fair amount of copy/paste duplication. This attempts to clean it up a bit and simplify/refactor some of the internal functionality. # Solution Minor internal refactoring of some facet implementations and extraction of a common parent class to contain some duplicated logic in the SSDV faceting implementations. # Tests No new tests were written. All existing testing is still passing. # Checklist Please review the following and check all that apply: - [x] I have reviewed the guidelines for [How to Contribute](https://github.com/apache/lucene/blob/main/CONTRIBUTING.md) and my code conforms to the standards described there to the best of my ability. - [x] I have given Lucene maintainers [access](https://help.github.com/en/articles/allowing-changes-to-a-pull-request-branch-created-from-a-fork) to contribute to my PR branch. (optional but recommended) - [x] I have developed this patch against the `main` branch. - [x] I have run `./gradlew check`. - [ ] I have added tests for my changes. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] Yuti-G commented on pull request #777: LUCENE-10488: Optimize Facets#getTopDims in ConcurrentSortedSetDocValuesFacetCounts
Yuti-G commented on PR #777: URL: https://github.com/apache/lucene/pull/777#issuecomment-1133481903 > This looks great! Thanks @Yuti-G! It would be nice if we could create a common abstract class to hold some of the common logic between this and the non-concurrent implementation. Seems like a lot of copy/paste going on. This isn't a new problem though, so let's not do that work as part of this PR. What do you think of opening a separate issue to see if we can consolidate some common logic? You probably have a better idea of how feasible this is after working on these changes, so I'm curious what you think. Thanks again for taking this on! Thanks @gsmiller for the great idea! I will open a separate issue to consolidate the common logic between ConcurrentSSDVFacetCounts and SSDVFacetCounts after LUCENE-10550 are merged. There are many duplicate codes between these two classes in https://github.com/apache/lucene/pull/914 as well. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] Yuti-G opened a new pull request, #914: LUCENE-10550: Add getAllChildren functionality to facets
Yuti-G opened a new pull request, #914: URL: https://github.com/apache/lucene/pull/914 # Description This new API returns all children for a specified path and allows users to sort in their desired criteria # Solution * Added getAllChildren function in the Facets class. * Override getAllChildren in subclasses # Tests Added new testing for all the overridden implementations of getAllChildren in Facets # Checklist Please review the following and check all that apply: - [X] I have reviewed the guidelines for [How to Contribute](https://github.com/apache/lucene/blob/main/CONTRIBUTING.md) and my code conforms to the standards described there to the best of my ability. - [X] I have given Lucene maintainers [access](https://help.github.com/en/articles/allowing-changes-to-a-pull-request-branch-created-from-a-fork) to contribute to my PR branch. (optional but recommended) - [X] I have developed this patch against the `main` branch. - [X] I have run `./gradlew check`. - [X] I have added tests for my changes. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Resolved] (LUCENE-10370) Fix classpath/module path of tests forking their own Java (TestNRTReplication)
[ https://issues.apache.org/jira/browse/LUCENE-10370?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dawid Weiss resolved LUCENE-10370. -- Resolution: Fixed > Fix classpath/module path of tests forking their own Java (TestNRTReplication) > -- > > Key: LUCENE-10370 > URL: https://issues.apache.org/jira/browse/LUCENE-10370 > Project: Lucene - Core > Issue Type: Sub-task >Reporter: Dawid Weiss >Assignee: Dawid Weiss >Priority: Minor > Fix For: 9.3 > > Time Spent: 1h 20m > Remaining Estimate: 0h > > TestNRTReplication fails because it assumes classpath can just be copied to a > sub-process - this is no longer the case. > PR at: > https://github.com/apache/lucene/pull/909 -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] msokolov commented on a diff in pull request #913: Lucene 10577
msokolov commented on code in PR #913: URL: https://github.com/apache/lucene/pull/913#discussion_r878580201 ## lucene/core/src/java/org/apache/lucene/util/VectorUtil.java: ## @@ -213,4 +213,21 @@ public static void add(float[] u, float[] v) { u[i] += v[i]; } } + + public static float dotProduct(BytesRef a, int aOffset, BytesRef b, int bOffset, int len) { +// fixme -- move to codec? What if later we want to access the bytes some other way? +int total = 0; +for (int i = 0; i < len; i++) { + total += a.bytes[aOffset++] * b.bytes[bOffset++]; Review Comment: I will give castShape/convert/convertShape a try. I also found reduceLanesToLong that seems promising -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Updated] (LUCENE-10370) Fix classpath/module path of tests forking their own Java (TestNRTReplication)
[ https://issues.apache.org/jira/browse/LUCENE-10370?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dawid Weiss updated LUCENE-10370: - Fix Version/s: 9.3 > Fix classpath/module path of tests forking their own Java (TestNRTReplication) > -- > > Key: LUCENE-10370 > URL: https://issues.apache.org/jira/browse/LUCENE-10370 > Project: Lucene - Core > Issue Type: Sub-task >Reporter: Dawid Weiss >Assignee: Dawid Weiss >Priority: Minor > Fix For: 9.3 > > Time Spent: 1h 20m > Remaining Estimate: 0h > > TestNRTReplication fails because it assumes classpath can just be copied to a > sub-process - this is no longer the case. > PR at: > https://github.com/apache/lucene/pull/909 -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-10370) Fix classpath/module path of tests forking their own Java (TestNRTReplication)
[ https://issues.apache.org/jira/browse/LUCENE-10370?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17540315#comment-17540315 ] ASF subversion and git services commented on LUCENE-10370: -- Commit ab19b4986aa2eea82d24f0011619eec629191c59 in lucene's branch refs/heads/branch_9x from Dawid Weiss [ https://gitbox.apache.org/repos/asf?p=lucene.git;h=ab19b4986aa ] LUCENE-10370: pass proper classpath/module arguments for forking jvms from within tests. (#909) > Fix classpath/module path of tests forking their own Java (TestNRTReplication) > -- > > Key: LUCENE-10370 > URL: https://issues.apache.org/jira/browse/LUCENE-10370 > Project: Lucene - Core > Issue Type: Sub-task >Reporter: Dawid Weiss >Assignee: Dawid Weiss >Priority: Minor > Time Spent: 1h 20m > Remaining Estimate: 0h > > TestNRTReplication fails because it assumes classpath can just be copied to a > sub-process - this is no longer the case. > PR at: > https://github.com/apache/lucene/pull/909 -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-10370) Fix classpath/module path of tests forking their own Java (TestNRTReplication)
[ https://issues.apache.org/jira/browse/LUCENE-10370?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17540314#comment-17540314 ] ASF subversion and git services commented on LUCENE-10370: -- Commit 63b66e06cdf648f2da47fb8a430518079f16b83c in lucene's branch refs/heads/main from Dawid Weiss [ https://gitbox.apache.org/repos/asf?p=lucene.git;h=63b66e06cdf ] LUCENE-10370: pass proper classpath/module arguments for forking jvms from within tests. (#909) > Fix classpath/module path of tests forking their own Java (TestNRTReplication) > -- > > Key: LUCENE-10370 > URL: https://issues.apache.org/jira/browse/LUCENE-10370 > Project: Lucene - Core > Issue Type: Sub-task >Reporter: Dawid Weiss >Assignee: Dawid Weiss >Priority: Minor > Time Spent: 1h 20m > Remaining Estimate: 0h > > TestNRTReplication fails because it assumes classpath can just be copied to a > sub-process - this is no longer the case. > PR at: > https://github.com/apache/lucene/pull/909 -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] dweiss merged pull request #909: LUCENE-10370: pass proper classpath/module arguments for forking jvms from within tests
dweiss merged PR #909: URL: https://github.com/apache/lucene/pull/909 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] dweiss commented on pull request #909: LUCENE-10370: pass proper classpath/module arguments for forking jvms from within tests
dweiss commented on PR #909: URL: https://github.com/apache/lucene/pull/909#issuecomment-1133434751 > If we make it an assumption, it makes the results look a bit nicer? Depends. It can go unnoticed then because it superficially looks just as if the test was ignored. I'd rather have it fail because then it draws more attention. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Comment Edited] (LUCENE-10544) Should ExitableTermsEnum wrap postings and impacts?
[ https://issues.apache.org/jira/browse/LUCENE-10544?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17540309#comment-17540309 ] Greg Miller edited comment on LUCENE-10544 at 5/20/22 10:03 PM: As I understood Adrien's suggestion, I think the idea is to create a new {{BulkScorer}} sub-class that would wrap another {{BulkScorer}} (provided as {{in}} in Adrien's code snippet). This class would override the {{score}} method as Adrien shows above to periodically check timeouts, but otherwise just delegate to {{in}} if the query has not yet timed out. I imagine somewhere in {{IndexSearcher}} you would instantiate this new "timeout enforcing bulk scorer", wrapping the {{BulkScorer}} provided by the query's weight. Does that help? Also, can I request that we move this conversation over to [LUCENE-10151|https://issues.apache.org/jira/browse/LUCENE-10151]? This issue is really about modifying {{ExitableTermsEnum}}, which we may want to eventually due independent of adding timeout support to {{IndexSearcher}}. Since this discussion is really about adding timeout support to {{IndexSearcher}}, it would be best to capture the conversation in LUCENE-10151 to make it easier to dig up in the future. Thank you! was (Author: gsmiller): As I understood Adrien's suggestion, I think the idea is to create a new {{BulkScorer}} sub-class that would wrap another {{BulkScorer}} (provided as {{in}} in Adrien's code snippet). This class would override the {{score}} method as Adrien shows above to periodically check timeouts, but otherwise just delegate to {{in}} if the query has not yet timed out. I image somewhere in {{IndexSearcher}} you would instantiate this new "timeout enforcing bulk scorer", wrapping the {{BulkScorer}} provided by the query's weight. Does that help? Also, can I request that we move this conversation over to [LUCENE-10151|https://issues.apache.org/jira/browse/LUCENE-10151]? This issue is really about modifying {{ExitableTermsEnum}}, which we may want to eventually due independent of adding timeout support to {{IndexSearcher}}. Since this discussion is really about adding timeout support to {{IndexSearcher}}, it would be best to capture the conversation in LUCENE-10151 to make it easier to dig up in the future. Thank you! > Should ExitableTermsEnum wrap postings and impacts? > --- > > Key: LUCENE-10544 > URL: https://issues.apache.org/jira/browse/LUCENE-10544 > Project: Lucene - Core > Issue Type: Bug > Components: core/index >Reporter: Greg Miller >Priority: Major > > While looking into options for LUCENE-10151, I noticed that > {{ExitableDirectoryReader}} doesn't actually do any timeout checking once you > start iterating postings/impacts. It *does* create a {{ExitableTermsEnum}} > wrapper when loading a {{{}TermsEnum{}}}, but that wrapper doesn't do > anything to wrap postings or impacts. So timeouts will be enforced when > moving to the "next" term, but not when iterating the postings/impacts > associated with a term. > I think we ought to wrap the postings/impacts as well with some form of > timeout checking so timeouts can be enforced on long-running queries. I'm not > sure why this wasn't done originally (back in 2014), but it was questioned > back in 2020 on the original Jira SOLR-5986. Does anyone know of a good > reason why we shouldn't enforce timeouts in this way? > Related, we may also want to wrap things like {{seekExact}} and {{seekCeil}} > given that only {{next}} is being wrapped currently. -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-10544) Should ExitableTermsEnum wrap postings and impacts?
[ https://issues.apache.org/jira/browse/LUCENE-10544?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17540309#comment-17540309 ] Greg Miller commented on LUCENE-10544: -- As I understood Adrien's suggestion, I think the idea is to create a new {{BulkScorer}} sub-class that would wrap another {{BulkScorer}} (provided as {{in}} in Adrien's code snippet). This class would override the {{score}} method as Adrien shows above to periodically check timeouts, but otherwise just delegate to {{in}} if the query has not yet timed out. I image somewhere in {{IndexSearcher}} you would instantiate this new "timeout enforcing bulk scorer", wrapping the {{BulkScorer}} provided by the query's weight. Does that help? Also, can I request that we move this conversation over to [LUCENE-10151|https://issues.apache.org/jira/browse/LUCENE-10151]? This issue is really about modifying {{ExitableTermsEnum}}, which we may want to eventually due independent of adding timeout support to {{IndexSearcher}}. Since this discussion is really about adding timeout support to {{IndexSearcher}}, it would be best to capture the conversation in LUCENE-10151 to make it easier to dig up in the future. Thank you! > Should ExitableTermsEnum wrap postings and impacts? > --- > > Key: LUCENE-10544 > URL: https://issues.apache.org/jira/browse/LUCENE-10544 > Project: Lucene - Core > Issue Type: Bug > Components: core/index >Reporter: Greg Miller >Priority: Major > > While looking into options for LUCENE-10151, I noticed that > {{ExitableDirectoryReader}} doesn't actually do any timeout checking once you > start iterating postings/impacts. It *does* create a {{ExitableTermsEnum}} > wrapper when loading a {{{}TermsEnum{}}}, but that wrapper doesn't do > anything to wrap postings or impacts. So timeouts will be enforced when > moving to the "next" term, but not when iterating the postings/impacts > associated with a term. > I think we ought to wrap the postings/impacts as well with some form of > timeout checking so timeouts can be enforced on long-running queries. I'm not > sure why this wasn't done originally (back in 2014), but it was questioned > back in 2020 on the original Jira SOLR-5986. Does anyone know of a good > reason why we shouldn't enforce timeouts in this way? > Related, we may also want to wrap things like {{seekExact}} and {{seekCeil}} > given that only {{next}} is being wrapped currently. -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] jtibshirani commented on a diff in pull request #910: LUCENE-10582: Fix merging of CollectionStatistics in CombinedFieldQuery
jtibshirani commented on code in PR #910: URL: https://github.com/apache/lucene/pull/910#discussion_r878568768 ## lucene/sandbox/src/test/org/apache/lucene/sandbox/search/TestCombinedFieldQuery.java: ## @@ -589,4 +589,52 @@ public SimScorer scorer( return new BM25Similarity().scorer(boost, collectionStats, termStats); } } + + public void testDistributedCollectionStatistics() throws IOException { +Directory dir = newDirectory(); +IndexWriterConfig iwc = new IndexWriterConfig(); +iwc.setSimilarity(randomCompatibleSimilarity()); +RandomIndexWriter w = new RandomIndexWriter(random(), dir, iwc); + +String queryString = "foo"; + +Document doc0 = new Document(); +doc0.add(new TextField("f", "foo", Store.NO)); +doc0.add(new TextField("g", "foo baz", Store.NO)); +w.addDocument(doc0); + +IndexReader reader = w.getReader(); +IndexSearcher searcher = +new IndexSearcher(reader) { + @Override + public CollectionStatistics collectionStatistics(String field) throws IOException { +CollectionStatistics shardStatistics = super.collectionStatistics(field); +int extraMaxDoc = randomIntBetween(0, 10); +int extraDocCount = randomIntBetween(0, extraMaxDoc); +int extraSumDocFreq = extraDocCount + randomIntBetween(0, 10); +int extraSumTotalTermFreq = extraSumDocFreq + randomIntBetween(0, 10); +CollectionStatistics globalStatistics = +new CollectionStatistics( +field, +shardStatistics.maxDoc() + extraMaxDoc, +shardStatistics.docCount() + extraDocCount, +shardStatistics.sumTotalTermFreq() + extraSumTotalTermFreq, +shardStatistics.sumDocFreq() + extraSumDocFreq); +return globalStatistics; + } +}; +searcher.setSimilarity(new BM25Similarity()); +CombinedFieldQuery query = +new CombinedFieldQuery.Builder() +.addField("f") +.addField("g") +.addTerm(new BytesRef(queryString)) +.build(); +// just check that search does not fail +searcher.search(query, 10); Review Comment: It'd be nice to assert something stronger here, to check that `CombinedFieldQuery` still works as expected when collection stats are overridden. Maybe we could compare the output of two query strategies like we do in `testCopyField`. ## lucene/sandbox/src/test/org/apache/lucene/sandbox/search/TestCombinedFieldQuery.java: ## @@ -589,4 +589,52 @@ public SimScorer scorer( return new BM25Similarity().scorer(boost, collectionStats, termStats); } } + + public void testDistributedCollectionStatistics() throws IOException { +Directory dir = newDirectory(); +IndexWriterConfig iwc = new IndexWriterConfig(); +iwc.setSimilarity(randomCompatibleSimilarity()); +RandomIndexWriter w = new RandomIndexWriter(random(), dir, iwc); + +String queryString = "foo"; + +Document doc0 = new Document(); +doc0.add(new TextField("f", "foo", Store.NO)); +doc0.add(new TextField("g", "foo baz", Store.NO)); +w.addDocument(doc0); + +IndexReader reader = w.getReader(); +IndexSearcher searcher = +new IndexSearcher(reader) { + @Override + public CollectionStatistics collectionStatistics(String field) throws IOException { +CollectionStatistics shardStatistics = super.collectionStatistics(field); +int extraMaxDoc = randomIntBetween(0, 10); +int extraDocCount = randomIntBetween(0, extraMaxDoc); +int extraSumDocFreq = extraDocCount + randomIntBetween(0, 10); +int extraSumTotalTermFreq = extraSumDocFreq + randomIntBetween(0, 10); +CollectionStatistics globalStatistics = +new CollectionStatistics( +field, +shardStatistics.maxDoc() + extraMaxDoc, +shardStatistics.docCount() + extraDocCount, +shardStatistics.sumTotalTermFreq() + extraSumTotalTermFreq, +shardStatistics.sumDocFreq() + extraSumDocFreq); +return globalStatistics; + } +}; +searcher.setSimilarity(new BM25Similarity()); Review Comment: It's unusual to search with a different similarity than was used during indexing -- I think we could remove this line. ## lucene/sandbox/src/test/org/apache/lucene/sandbox/search/TestCombinedFieldQuery.java: ## @@ -589,4 +589,52 @@ public SimScorer scorer( return new BM25Similarity().scorer(boost, collectionStats, termStats); } } + + public void testDistributedCollectionStatistics() throws IOException { Review Comment: Small comment, maybe we could call this `testOverrideCollectionStatistics`? Lucene doesn't really have a native concept of "distributed
[GitHub] [lucene] msokolov commented on a diff in pull request #913: Lucene 10577
msokolov commented on code in PR #913: URL: https://github.com/apache/lucene/pull/913#discussion_r878557225 ## lucene/core/src/java/org/apache/lucene/util/VectorUtil.java: ## @@ -213,4 +213,21 @@ public static void add(float[] u, float[] v) { u[i] += v[i]; } } + + public static float dotProduct(BytesRef a, int aOffset, BytesRef b, int bOffset, int len) { +// fixme -- move to codec? What if later we want to access the bytes some other way? +int total = 0; +for (int i = 0; i < len; i++) { + total += a.bytes[aOffset++] * b.bytes[bOffset++]; Review Comment: yeah, I see that now. Well ... this *is* what we want it to do. I mean we could potentially reduce the precision still further (to ~7~ umm 4! bits) and then each pair of bytes could multiplied safely. But we do still need to add them up, too. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] rmuir commented on a diff in pull request #913: Lucene 10577
rmuir commented on code in PR #913: URL: https://github.com/apache/lucene/pull/913#discussion_r878562615 ## lucene/core/src/java/org/apache/lucene/util/VectorUtil.java: ## @@ -213,4 +213,21 @@ public static void add(float[] u, float[] v) { u[i] += v[i]; } } + + public static float dotProduct(BytesRef a, int aOffset, BytesRef b, int bOffset, int len) { +// fixme -- move to codec? What if later we want to access the bytes some other way? +int total = 0; +for (int i = 0; i < len; i++) { + total += a.bytes[aOffset++] * b.bytes[bOffset++]; Review Comment: I think our result vector could be ShortVector and to try to use conversion (in some way that isnt slow) such as https://docs.oracle.com/en/java/javase/16/docs/api/jdk.incubator.vector/jdk/incubator/vector/VectorOperators.html#B2S in a way that is efficient? for the adding-up part, with the floats we did reduceLanes(), so I think we'd have to use same trick with conversion of result to IntVector/LongVector to create the sum. There are gigantic paragraphs on Vector.java javadoc on how to do these conversions (maybe they are actually efficient?), but I haven't tried. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene-solr] madrob opened a new pull request, #2660: SOLR-16209
madrob opened a new pull request, #2660: URL: https://github.com/apache/lucene-solr/pull/2660 Depends on #2641 will commit both at the same time -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] msokolov commented on pull request #913: Lucene 10577
msokolov commented on PR #913: URL: https://github.com/apache/lucene/pull/913#issuecomment-1133402474 hm this "luke-can-be-launched" test failed for me locally too. I guess it needs a JVM with a head? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] msokolov commented on a diff in pull request #913: Lucene 10577
msokolov commented on code in PR #913: URL: https://github.com/apache/lucene/pull/913#discussion_r878557225 ## lucene/core/src/java/org/apache/lucene/util/VectorUtil.java: ## @@ -213,4 +213,21 @@ public static void add(float[] u, float[] v) { u[i] += v[i]; } } + + public static float dotProduct(BytesRef a, int aOffset, BytesRef b, int bOffset, int len) { +// fixme -- move to codec? What if later we want to access the bytes some other way? +int total = 0; +for (int i = 0; i < len; i++) { + total += a.bytes[aOffset++] * b.bytes[bOffset++]; Review Comment: yeah, I see that now. Well ... this *is* what we want it to do. I mean we could potentially reduce the precision still further (to 7 bits) and then each pair of bytes could multiplied safely. But we do still need to add them up, too. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] rmuir commented on a diff in pull request #913: Lucene 10577
rmuir commented on code in PR #913: URL: https://github.com/apache/lucene/pull/913#discussion_r878555057 ## lucene/core/src/java/org/apache/lucene/util/VectorUtil.java: ## @@ -213,4 +213,21 @@ public static void add(float[] u, float[] v) { u[i] += v[i]; } } + + public static float dotProduct(BytesRef a, int aOffset, BytesRef b, int bOffset, int len) { +// fixme -- move to codec? What if later we want to access the bytes some other way? +int total = 0; +for (int i = 0; i < len; i++) { + total += a.bytes[aOffset++] * b.bytes[bOffset++]; Review Comment: well its going to multiply each `a` byte * each `b` byte and store each result in `c` byte. So you will overflow right there, before you even do any sum. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] msokolov commented on a diff in pull request #913: Lucene 10577
msokolov commented on code in PR #913: URL: https://github.com/apache/lucene/pull/913#discussion_r878553114 ## lucene/core/src/java/org/apache/lucene/util/VectorUtil.java: ## @@ -213,4 +213,21 @@ public static void add(float[] u, float[] v) { u[i] += v[i]; } } + + public static float dotProduct(BytesRef a, int aOffset, BytesRef b, int bOffset, int len) { +// fixme -- move to codec? What if later we want to access the bytes some other way? +int total = 0; +for (int i = 0; i < len; i++) { + total += a.bytes[aOffset++] * b.bytes[bOffset++]; Review Comment: hm, well in the worst case (128 * 128) = 2^7*2^7 = 2^14 and we can sum lots of these in an int without any concern about overflow. I think the limit is on the dimension of the vector; I mean 2^31/2^14 = 2^17 and I don't think we would ever have a 2^17-dimensional vector. Perhaps this gives us a principled way to choose that max dimension :) I guess 2^17 = 128K. Looking at what `ByteVector.mul()` does ... -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] rmuir commented on a diff in pull request #913: Lucene 10577
rmuir commented on code in PR #913: URL: https://github.com/apache/lucene/pull/913#discussion_r878550812 ## lucene/core/src/java/org/apache/lucene/util/VectorUtil.java: ## @@ -213,4 +213,21 @@ public static void add(float[] u, float[] v) { u[i] += v[i]; } } + + public static float dotProduct(BytesRef a, int aOffset, BytesRef b, int bOffset, int len) { +// fixme -- move to codec? What if later we want to access the bytes some other way? +int total = 0; +for (int i = 0; i < len; i++) { + total += a.bytes[aOffset++] * b.bytes[bOffset++]; Review Comment: Is this really the way the function should work? it is doing the equivalent of casting both `a` and `b` to `int` first, and then adding result to another `int`. That's just how java promotion works. But is this what we want? What about overflow? why `int` and not `long`? This is not at all what `ByteVector.mul()` does. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] msokolov commented on pull request #909: LUCENE-10370: pass proper classpath/module arguments for forking jvms from within tests
msokolov commented on PR #909: URL: https://github.com/apache/lucene/pull/909#issuecomment-1133380108 If we make it an assumption, it makes the results look a bit nicer? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] jtibshirani commented on pull request #833: LUCENE-10411: Add NN vectors support to ExitableDirectoryReader
jtibshirani commented on PR #833: URL: https://github.com/apache/lucene/pull/833#issuecomment-1133369782 Thanks @mocobeta, it is so nice to have benchmarks for ExitableDirectoryReader now. It looks like the impact is not significant for vector search. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] msokolov opened a new pull request, #913: Lucene 10577
msokolov opened a new pull request, #913: URL: https://github.com/apache/lucene/pull/913 This patch revisits the idea of storing vectors in 8-bit precision. It incorporates some @rcmuir and @jpountz suggestions from https://github.com/apache/lucene/pull/899 which was the first attempt. The highlights here are: 1. Adds a new VectorSimilarity to represent this use case 2. Expects vectors to be supplied (as float[]) with values in the range [-128,127] 3. All dot-product operations are done over byte using integer math when this similarity is used The good parts: Index size is reduced; I ran a test with 100K (200-dimensional) GloVe vectors and the index size goes from 231M to 173M for an M=200 graph and 154M->97M for an M=100 graph. The vector portion of the storage (.vec files) is reduced nearly 75% as expected. With careful tuning of the scaling factor, recall is pretty much unchanged compared to the full-precision baseline. I think these GloVe vectors are not well-conditioned numerically, so to do the scaling I did not use min/max values as the limits, rather I discarded outlying values and used 0.1/0.99 pctile values. Indexing and search performance as measured with KnnGraphTester improved somewhere in the range of 15-25%. I'd like to run luceneutil and/or ann-benchmarks to get another look, but this was a very pleasant surprise. I think the prospect is bright for vectorization too, but this improvement is achievable using our scalar impl. The not-so-good-parts: I think we probably wouldn't want to commit as-is since it would make changes to the codec that aren't forwards-compatible. To go forward I guess we'd need to create a new version of the HnswVectorsFormat. I had to add some branches and generics to handle the two different data representations. I don't think it's too terribly ugly, but I know we've had concerns about this kind of thing in the past - maybe there's a better way that doesn't require massive code duplication? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] msokolov commented on a diff in pull request #899: Lucene 10577
msokolov commented on code in PR #899: URL: https://github.com/apache/lucene/pull/899#discussion_r878531565 ## lucene/core/src/java/org/apache/lucene/codecs/lucene92/ExpandingRandomAccessVectorValues.java: ## @@ -0,0 +1,57 @@ +package org.apache.lucene.codecs.lucene92; + +import org.apache.lucene.index.RandomAccessVectorValues; +import org.apache.lucene.index.RandomAccessVectorValuesProducer; +import org.apache.lucene.util.BytesRef; + +import java.io.IOException; + +public class ExpandingRandomAccessVectorValues implements RandomAccessVectorValuesProducer { + + private final RandomAccessVectorValuesProducer delegate; + private final float scale; + + /** + * Wraps an existing vector values producer. Floating point vector values will be produced by scaling + * byte-quantized values read from the values produced by the input. + */ + protected ExpandingRandomAccessVectorValues(RandomAccessVectorValuesProducer in, float scale) { +this.delegate = in; +assert scale != 0; +this.scale = scale; + } + + @Override + public RandomAccessVectorValues randomAccess() throws IOException { +RandomAccessVectorValues delegateValues = delegate.randomAccess(); +float[] value = new float[delegateValues.dimension()];; + +return new RandomAccessVectorValues() { + + @Override + public int size() { +return delegateValues.size(); + } + + @Override + public int dimension() { +return delegateValues.dimension(); + } + + @Override + public float[] vectorValue(int targetOrd) throws IOException { +BytesRef binaryValue = delegateValues.binaryValue(targetOrd); +byte[] bytes = binaryValue.bytes; +for (int i = 0, j = binaryValue.offset; i < value.length; i++, j++) { + value[i] = bytes[j] * scale; Review Comment: I did some git booboo and messed this up -- I'll close and open a new PR since it's a substantially different approach anyway -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] msokolov closed pull request #899: Lucene 10577
msokolov closed pull request #899: Lucene 10577 URL: https://github.com/apache/lucene/pull/899 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-10574) Remove O(n^2) from TieredMergePolicy or change defaults to one that doesn't do this
[ https://issues.apache.org/jira/browse/LUCENE-10574?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17540283#comment-17540283 ] Robert Muir commented on LUCENE-10574: -- The benchmark was originally some code written to reproduce a performance issue: LUCENE-9827 So it is definitely a bit pathological. Purpose was to really zero in on the stored fields merging and nothing else, and not to be noisy. > Remove O(n^2) from TieredMergePolicy or change defaults to one that doesn't > do this > --- > > Key: LUCENE-10574 > URL: https://issues.apache.org/jira/browse/LUCENE-10574 > Project: Lucene - Core > Issue Type: Bug >Reporter: Robert Muir >Priority: Major > Fix For: 9.3 > > Time Spent: 2h 10m > Remaining Estimate: 0h > > Remove {{floorSegmentBytes}} parameter, or change lucene's default to a merge > policy that doesn't merge in an O(n^2) way. > I have the feeling it might have to be the latter, as folks seem really wed > to this crazy O(n^2) behavior. -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-9625) Benchmark KNN search with ann-benchmarks
[ https://issues.apache.org/jira/browse/LUCENE-9625?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17540284#comment-17540284 ] Michael Sokolov commented on LUCENE-9625: - I realized maybe this deserves a better explanation: I didn't use multi-threading in the KnnGraphTester that builds the index since my goal at the time was really to evaluate whether our algorithm implementation is correct and how it performs on a single HNSW graph index. If we use multiple threads, this is going to lead to a more fragmented graph due to the way Lucene indexes segments, and while this would be a useful point of comparison, it also creates a different variable to tune in the benchmark evaluation. If you do want to pursue this, I would suggest configuring the IndexWriterConfig with large buffers so that each thread creates a single segment, and exposing the number of threads/segments as a tunable parameter since it is going to impact the recall and latency reported by the benchmark > Benchmark KNN search with ann-benchmarks > > > Key: LUCENE-9625 > URL: https://issues.apache.org/jira/browse/LUCENE-9625 > Project: Lucene - Core > Issue Type: New Feature >Reporter: Michael Sokolov >Priority: Major > Time Spent: 10m > Remaining Estimate: 0h > > In addition to benchmarking with luceneutil, it would be good to be able to > make use of ann-benchmarks, which is publishing results from many approximate > knn algorithms, including the hnsw implementation from its authors. We don't > expect to challenge the performance of these native code libraries, however > it would be good to know just how far off we are. > I started looking into this and posted a fork of ann-benchmarks that uses > KnnGraphTester class to run these: > https://github.com/msokolov/ann-benchmarks. It's still a WIP; you have to > manually copy jars and the KnnGraphTester.class to the test host machine > rather than downloading from a distribution. KnnGraphTester needs some > modifications in order to support this process - this issue is mostly about > that. > One thing I noticed is that some of the index builds with higher fanout > (efConstruction) settings time out at 2h (on an AWS c5 instance), so this is > concerning and I'll open a separate issue for trying to improve that. -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Assigned] (LUCENE-10585) Cleanup copy/paste code in facets, particularly in SSDV
[ https://issues.apache.org/jira/browse/LUCENE-10585?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Greg Miller reassigned LUCENE-10585: Assignee: Greg Miller > Cleanup copy/paste code in facets, particularly in SSDV > --- > > Key: LUCENE-10585 > URL: https://issues.apache.org/jira/browse/LUCENE-10585 > Project: Lucene - Core > Issue Type: Improvement > Components: modules/facet >Reporter: Greg Miller >Assignee: Greg Miller >Priority: Minor > > We've accumulated some copy/paste code in the facets modules, especially in > SSDV-related classes. I'm going to take a pass at cleaning this up to help > make the code more readable and easier to maintain. -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Created] (LUCENE-10585) Cleanup copy/paste code in facets, particularly in SSDV
Greg Miller created LUCENE-10585: Summary: Cleanup copy/paste code in facets, particularly in SSDV Key: LUCENE-10585 URL: https://issues.apache.org/jira/browse/LUCENE-10585 Project: Lucene - Core Issue Type: Improvement Components: modules/facet Reporter: Greg Miller We've accumulated some copy/paste code in the facets modules, especially in SSDV-related classes. I'm going to take a pass at cleaning this up to help make the code more readable and easier to maintain. -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Assigned] (LUCENE-10584) SSDV facets should support hierarchical paths in #getSpecificValue
[ https://issues.apache.org/jira/browse/LUCENE-10584?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Greg Miller reassigned LUCENE-10584: Assignee: Greg Miller > SSDV facets should support hierarchical paths in #getSpecificValue > -- > > Key: LUCENE-10584 > URL: https://issues.apache.org/jira/browse/LUCENE-10584 > Project: Lucene - Core > Issue Type: Improvement > Components: modules/facet >Reporter: Greg Miller >Assignee: Greg Miller >Priority: Major > > We added hierarchical pathing capabilities to SSDV faceting recently but it > looks like we didn't update #getSpecificValue to work with hierarchical paths. -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-10574) Remove O(n^2) from TieredMergePolicy or change defaults to one that doesn't do this
[ https://issues.apache.org/jira/browse/LUCENE-10574?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17540280#comment-17540280 ] Michael Sokolov commented on LUCENE-10574: -- Very nice! I confess I don't really know how those benchmarks are constructed -- do they have "typical" merge policy/update rate/etc or are they deliberately exploring this pathological case? > Remove O(n^2) from TieredMergePolicy or change defaults to one that doesn't > do this > --- > > Key: LUCENE-10574 > URL: https://issues.apache.org/jira/browse/LUCENE-10574 > Project: Lucene - Core > Issue Type: Bug >Reporter: Robert Muir >Priority: Major > Fix For: 9.3 > > Time Spent: 2h 10m > Remaining Estimate: 0h > > Remove {{floorSegmentBytes}} parameter, or change lucene's default to a merge > policy that doesn't merge in an O(n^2) way. > I have the feeling it might have to be the latter, as folks seem really wed > to this crazy O(n^2) behavior. -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Created] (LUCENE-10584) SSDV facets should support hierarchical paths in #getSpecificValue
Greg Miller created LUCENE-10584: Summary: SSDV facets should support hierarchical paths in #getSpecificValue Key: LUCENE-10584 URL: https://issues.apache.org/jira/browse/LUCENE-10584 Project: Lucene - Core Issue Type: Improvement Components: modules/facet Reporter: Greg Miller We added hierarchical pathing capabilities to SSDV faceting recently but it looks like we didn't update #getSpecificValue to work with hierarchical paths. -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] mikemccand commented on pull request #633: LUCENE-10216: Use MergeScheduler and MergePolicy to run addIndexes(CodecReader[]) merges.
mikemccand commented on PR #633: URL: https://github.com/apache/lucene/pull/633#issuecomment-1133249474 Hmm I beasted the PR on current `main` branch using `./gradlew :lucene:core:test --tests "org.apache.lucene.index.TestAddIndexes" -Dtests.iters=500 `. The first time, it seemed to deadlock, but I screwed up trying to get the thread stacks. The 2nd time it failed with this exciting failure! It looks like we are missing some locking on updating IW's `HashMap` tracking merges?: ``` org.apache.lucene.index.TestAddIndexes > test suite's output saved to /l/vigya/lucene/core/build/test-results/test/outputs/OUTPUT-org.apache.lucene.index.TestAddIndexes.txt, copied below: 1> java.util.ConcurrentModificationException 1>at java.base/java.util.HashMap$HashIterator.nextNode(HashMap.java:1597) 1>at java.base/java.util.HashMap$KeyIterator.next(HashMap.java:1620) 1>at org.apache.lucene.index.IndexWriter.forceMerge(IndexWriter.java:2122) 1>at org.apache.lucene.index.IndexWriter.forceMerge(IndexWriter.java:2081) 1>at org.apache.lucene.index.TestAddIndexes$CommitAndAddIndexes3.doBody(TestAddIndexes.java:1221) 1>at org.apache.lucene.index.TestAddIndexes$RunAddIndexesThreads$1.run(TestAddIndexes.java:1023) > java.lang.AssertionError > at __randomizedtesting.SeedInfo.seed([3AB557F4BFC8763A:D625EE2D285D2CAD]:0) > at org.junit.Assert.fail(Assert.java:87) > at org.junit.Assert.assertTrue(Assert.java:42) > at org.junit.Assert.assertTrue(Assert.java:53) > at org.apache.lucene.index.TestAddIndexes.testAddIndexesWithRollback(TestAddIndexes.java:1305) > at jdk.internal.reflect.GeneratedMethodAccessor38.invoke(Unknown Source) > at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.base/java.lang.reflect.Method.invoke(Method.java:568) > at com.carrotsearch.randomizedtesting.RandomizedRunner.invoke(RandomizedRunner.java:1754) > at com.carrotsearch.randomizedtesting.RandomizedRunner$8.evaluate(RandomizedRunner.java:942) > at com.carrotsearch.randomizedtesting.RandomizedRunner$9.evaluate(RandomizedRunner.java:978) > at com.carrotsearch.randomizedtesting.RandomizedRunner$10.evaluate(RandomizedRunner.java:992) > at org.apache.lucene.tests.util.TestRuleSetupTeardownChained$1.evaluate(TestRuleSetupTeardownChained.java:44) > at org.apache.lucene.tests.util.AbstractBeforeAfterRule$1.evaluate(AbstractBeforeAfterRule.java:43) > at org.apache.lucene.tests.util.TestRuleThreadAndTestName$1.evaluate(TestRuleThreadAndTestName.java:45) > at org.apache.lucene.tests.util.TestRuleIgnoreAfterMaxFailures$1.evaluate(TestRuleIgnoreAfterMaxFailures.java:60) > at org.apache.lucene.tests.util.TestRuleMarkFailure$1.evaluate(TestRuleMarkFailure.java:44) > at org.junit.rules.RunRules.evaluate(RunRules.java:20) > at com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36) > at com.carrotsearch.randomizedtesting.ThreadLeakControl$StatementRunner.run(ThreadLeakControl.java:370) > at com.carrotsearch.randomizedtesting.ThreadLeakControl.forkTimeoutingTask(ThreadLeakControl.java:819) > at com.carrotsearch.randomizedtesting.ThreadLeakControl$3.evaluate(ThreadLeakControl.java:470) > at com.carrotsearch.randomizedtesting.RandomizedRunner.runSingleTest(RandomizedRunner.java:951) > at com.carrotsearch.randomizedtesting.RandomizedRunner$5.evaluate(RandomizedRunner.java:836) > at com.carrotsearch.randomizedtesting.RandomizedRunner$6.evaluate(RandomizedRunner.java:887) > at com.carrotsearch.randomizedtesting.RandomizedRunner$7.evaluate(RandomizedRunner.java:898) > at org.apache.lucene.tests.util.AbstractBeforeAfterRule$1.evaluate(AbstractBeforeAfterRule.java:43) > at com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36) > at org.apache.lucene.tests.util.TestRuleStoreClassName$1.evaluate(TestRuleStoreClassName.java:38) > at com.carrotsearch.randomizedtesting.rules.NoShadowingOrOverridesOnMethodsRule$1.evaluate(NoShadowingOrOverridesOnMethodsRule.java:40) > at com.carrotsearch.randomizedtesting.rules.NoShadowingOrOverridesOnMethodsRule$1.evaluate(NoShadowingOrOverridesOnMethodsRule.java:40) > at com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36) > at com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36) > at
[jira] [Created] (LUCENE-10583) Deadlock with MMapDirectory while waitForMerges
Thomas Hoffmann created LUCENE-10583: Summary: Deadlock with MMapDirectory while waitForMerges Key: LUCENE-10583 URL: https://issues.apache.org/jira/browse/LUCENE-10583 Project: Lucene - Core Issue Type: Bug Components: core/index Affects Versions: 8.11.1 Environment: Java 17 OS: Windows 2016 Reporter: Thomas Hoffmann Hello, a deadlock situation happened in our application. We are using MMapDirectory on Windows 2016 and got the following stacktrace: {code:java} "https-openssl-nio-443-exec-30" #166 daemon prio=5 os_prio=0 cpu=78703.13ms "https-openssl-nio-443-exec-30" #166 daemon prio=5 os_prio=0 cpu=78703.13ms elapsed=81248.18s tid=0x2860af10 nid=0x237c in Object.wait()  [0x413fc000]   java.lang.Thread.State: TIMED_WAITING (on object monitor)   at java.lang.Object.wait(java.base@17.0.2/Native Method)   - waiting on   at org.apache.lucene.index.IndexWriter.doWait(IndexWriter.java:4983)   - locked <0x0006ef1fc020> (a org.apache.lucene.index.IndexWriter)   at org.apache.lucene.index.IndexWriter.waitForMerges(IndexWriter.java:2697)   - locked <0x0006ef1fc020> (a org.apache.lucene.index.IndexWriter)   at org.apache.lucene.index.IndexWriter.shutdown(IndexWriter.java:1236)   at org.apache.lucene.index.IndexWriter.close(IndexWriter.java:1278)   at com.speed4trade.ebs.module.search.SearchService.updateSearchIndex(SearchService.java:1723)   - locked <0x0006d5c00208> (a org.apache.lucene.store.MMapDirectory)   at com.speed4trade.ebs.module.businessrelations.ticket.TicketChangedListener.postUpdate(TicketChangedListener.java:142) ...{code} All threads were waiting to lock <0x0006d5c00208> which got never released. A lucene thread was also blocked, I dont know if this is relevant: {code:java} "Lucene Merge Thread #0" #18466 daemon prio=5 os_prio=0 cpu=15.63ms elapsed=3499.07s tid=0x459453e0 nid=0x1f8 waiting for monitor entry  [0x5da9e000]   java.lang.Thread.State: BLOCKED (on object monitor)   at org.apache.lucene.store.FSDirectory.deletePendingFiles(FSDirectory.java:346)   - waiting to lock <0x0006d5c00208> (a org.apache.lucene.store.MMapDirectory)   at org.apache.lucene.store.FSDirectory.maybeDeletePendingFiles(FSDirectory.java:363)   at org.apache.lucene.store.FSDirectory.createOutput(FSDirectory.java:248)   at org.apache.lucene.store.LockValidatingDirectoryWrapper.createOutput(LockValidatingDirectoryWrapper.java:44)   at org.apache.lucene.index.ConcurrentMergeScheduler$1.createOutput(ConcurrentMergeScheduler.java:289)   at org.apache.lucene.store.TrackingDirectoryWrapper.createOutput(TrackingDirectoryWrapper.java:43)   at org.apache.lucene.codecs.compressing.CompressingStoredFieldsWriter.(CompressingStoredFieldsWriter.java:121)   at org.apache.lucene.codecs.compressing.CompressingStoredFieldsFormat.fieldsWriter(CompressingStoredFieldsFormat.java:130)   at org.apache.lucene.codecs.lucene87.Lucene87StoredFieldsFormat.fieldsWriter(Lucene87StoredFieldsFormat.java:141)   at org.apache.lucene.index.SegmentMerger.mergeFields(SegmentMerger.java:227)   at org.apache.lucene.index.SegmentMerger.merge(SegmentMerger.java:105)   at org.apache.lucene.index.IndexWriter.mergeMiddle(IndexWriter.java:4757)   at org.apache.lucene.index.IndexWriter.merge(IndexWriter.java:4361)   at org.apache.lucene.index.IndexWriter$IndexWriterMergeSource.merge(IndexWriter.java:5920)   at org.apache.lucene.index.ConcurrentMergeScheduler.doMerge(ConcurrentMergeScheduler.java:626)   at org.apache.lucene.index.ConcurrentMergeScheduler$MergeThread.run(ConcurrentMergeScheduler.java:684){code} If looks like the merge operation never finished and released the lock. Is there any option to prevent this deadlock or how to investigate it further? A load-test didn't show this problem unfortunately. -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] uschindler opened a new pull request, #912: Initial rewrite of MMapDirectory for JDK-19 preview Panama APIs (>= JDK-19-ea+23)
uschindler opened a new pull request, #912: URL: https://github.com/apache/lucene/pull/912 **INFO: This is a followup of #518: It's the same code base, but with API changes from JDK 19 applied** This is just a draft PR for a first insight on memory mapping improvements in JDK 19+. Some background information: Starting with JDK-14, there is a new incubating module "jdk.incubator.foreign" that has a new, not yet stable API for accessing off-heap memory (and later it will also support calling functions using classical MethodHandles that are located in libraries like .so or .dll files). This incubator module has several versions: - first version: https://openjdk.java.net/jeps/370 (slow, very buggy and thread confinement, so making it unuseable with Lucene) - second version: https://openjdk.java.net/jeps/383 (still thread confinement, but now allows transfer of "ownership" to other threads; this is still impossible to use with Lucene. - third version in JDK 16: https://openjdk.java.net/jeps/393 (this version has included "Support for shared segments"). This now allows us to safely use the same external mmaped memory from different threads and also unmap it! This was implemented in the previous pull request #173 - fourth version in JDK 17: https://openjdk.java.net/jeps/412 . This mainly changes the API around the scopes. Instead of having segments explicitely made "shared", we can assign them to some resource scope which control their behaviour. The resourceScope is produced one time for each IndexInput instance (not clones) and owns all segments. When the resourceScope is closed, all segments get invalid - and we throw `AlreadyClosedException`. The big problem is slowness due to heavy use of new instances just to copy memory between segments and java heap. This drives garbage collector crazy. This was implemented in previous PR #177 - fifth version in JDK 18, included in build 26: https://openjdk.java.net/jeps/419. This mainly cleans up the API. From Lucene's persepctive the `MemorySegment` API now has `System.arraycopy()`-like APIs to copy memory between heap and memory segments. This improves speed. It also handles byte-swapping automatically. This version of the PR also uses `ValueLayout` instead of varhandles, as it makes code more readable and type-safe. - sixth version in JDK 19, included with build 23: https://openjdk.java.net/jeps/424 (actual version). This version moves the whole incubation API as a stable "preview API". Some classes were renamed and bugs (e.g. on Windows with huge mappings) were resolved. This new preview API more or less overcomes several problems: - ByteBuffer API is limited to 32bit (in fact MMapDirectory has to chunk in 1 GiB portions) - There is no official way to unmap ByteBuffers when the file is no longer used. There is a way to use `sun.misc.Unsafe` and forcefully unmap segments, but any IndexInput accessing the file from another thread will crush the JVM with SIGSEGV or SIGBUS. We learned to live with that and we happily apply the unsafe unmapping, but that's the main issue. @uschindler had many discussions with the team at OpenJDK and finally with the third incubator, we have an API that works with Lucene. It was very fruitful discussions (thanks to @mcimadamore !) With the third incubator we are now finally able to do some tests (especially performance). With JDK 19, we can do testing the following way by tweaking the build system a bit: - disable `-Werror` everywhere - upgrade minimum and release Java version to 19 (this breaks linting with ECJ, so precommit won't work, but running tests does) - add `--enable-preview` parameter for all modules and test runner The code basically just modifies `MMapDirectory` to use LONG instead of INT for the chunk size parameter. In addition it adds `MemorySegmentIndexInput` that is a copy of our `ByteBufferIndexInput` (still there, but unused), but using MemorySegment instead of ByteBuffer behind the scenes. It works in exactly the same way, just the try/catch blocks for supporting EOFException or moving to another segment were rewritten. It passes all tests and it looks like you can use it to read indexes. The default chunk size is now 16 GiB (but you can raise or lower it as you like; tests are doing this). Of course you can set it to Long.MAX_VALUE, in that case every index file is always mapped to one big memory mapping. My testing with Windows 10 have shown, that this is *not a good idea!!!*. Huge mappings fragment address space over time and as we can only use like 43 or 46 bits (depending on OS), the fragmentation will at some point kill you. So 16 GiB looks like a good compromise: Most files will be smaller than 6 GiB anyways (unless you optimize your index to one huge segment). So for most Lucene installations, the number of segments will equal the number of open files, so Elasticsearch
[GitHub] [lucene] uschindler closed pull request #911: Initial rewrite of MMapDirectory for JDK-19 preview Panama APIs (>= JDK-19-ea+23)
uschindler closed pull request #911: Initial rewrite of MMapDirectory for JDK-19 preview Panama APIs (>= JDK-19-ea+23) URL: https://github.com/apache/lucene/pull/911 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] mikemccand commented on pull request #633: LUCENE-10216: Use MergeScheduler and MergePolicy to run addIndexes(CodecReader[]) merges.
mikemccand commented on PR #633: URL: https://github.com/apache/lucene/pull/633#issuecomment-1133136314 I think this is ready! What a great change, giving `MergePolicy` purview over how `addIndexes` does its merging too. I'm also beasting `TestAddIndexes` and will report back. I agree we should push only to `main` to start and let this bake for a while before backporting to 9.x. Thank you @vigyasharma! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] rmuir commented on pull request #909: LUCENE-10370: pass proper classpath/module arguments for forking jvms from within tests
rmuir commented on PR #909: URL: https://github.com/apache/lucene/pull/909#issuecomment-1133125938 i'm suspicious if some of these tests would even run directly from the IDE anyway. even if you fallback to classpath, TestIndexWriterOnJRECrash might not work from IDEA. I installed idea the other day, it defaults to running tests from gradle, so IMO its a non-issue for that IDE. for eclipse it is probably best to simply fail. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] uschindler opened a new pull request, #911: Initial rewrite of MMapDirectory for JDK-19 preview Panama APIs (>= JDK-19-ea+23)
uschindler opened a new pull request, #911: URL: https://github.com/apache/lucene/pull/911 **INFO: This is a followup of #518: It's the same code base, but with API changes from JDK 19 applied** This is just a draft PR for a first insight on memory mapping improvements in JDK 19+. Some background information: Starting with JDK-14, there is a new incubating module "jdk.incubator.foreign" that has a new, not yet stable API for accessing off-heap memory (and later it will also support calling functions using classical MethodHandles that are located in libraries like .so or .dll files). This incubator module has several versions: - first version: https://openjdk.java.net/jeps/370 (slow, very buggy and thread confinement, so making it unuseable with Lucene) - second version: https://openjdk.java.net/jeps/383 (still thread confinement, but now allows transfer of "ownership" to other threads; this is still impossible to use with Lucene. - third version in JDK 16: https://openjdk.java.net/jeps/393 (this version has included "Support for shared segments"). This now allows us to safely use the same external mmaped memory from different threads and also unmap it! This was implemented in the previous pull request #173 - fourth version in JDK 17: https://openjdk.java.net/jeps/412 . This mainly changes the API around the scopes. Instead of having segments explicitely made "shared", we can assign them to some resource scope which control their behaviour. The resourceScope is produced one time for each IndexInput instance (not clones) and owns all segments. When the resourceScope is closed, all segments get invalid - and we throw `AlreadyClosedException`. The big problem is slowness due to heavy use of new instances just to copy memory between segments and java heap. This drives garbage collector crazy. This was implemented in previous PR #177 - fifth version in JDK 18, included in build 26: https://openjdk.java.net/jeps/419. This mainly cleans up the API. From Lucene's persepctive the `MemorySegment` API now has `System.arraycopy()`-like APIs to copy memory between heap and memory segments. This improves speed. It also handles byte-swapping automatically. This version of the PR also uses `ValueLayout` instead of varhandles, as it makes code more readable and type-safe. - sixth version in JDK 19, included with build 23: https://openjdk.java.net/jeps/424 (actual version). This version moves the whole incubation API as a stable "preview API". Some classes were renamed and bugs (e.g. on Windows with huge mappings) were resolved. This new preview API more or less overcomes several problems: - ByteBuffer API is limited to 32bit (in fact MMapDirectory has to chunk in 1 GiB portions) - There is no official way to unmap ByteBuffers when the file is no longer used. There is a way to use `sun.misc.Unsafe` and forcefully unmap segments, but any IndexInput accessing the file from another thread will crush the JVM with SIGSEGV or SIGBUS. We learned to live with that and we happily apply the unsafe unmapping, but that's the main issue. @uschindler had many discussions with the team at OpenJDK and finally with the third incubator, we have an API that works with Lucene. It was very fruitful discussions (thanks to @mcimadamore !) With the third incubator we are now finally able to do some tests (especially performance). With JDK 19, we can do testing the following way by tweaking the build system a bit: - disable `-Werror` everywhere - upgrade minimum and release Java version to 19 (this breaks linting with ECJ, so precommit won't work, but running tests does) - add `--enable-preview` parameter for all modules and test runner The code basically just modifies `MMapDirectory` to use LONG instead of INT for the chunk size parameter. In addition it adds `MemorySegmentIndexInput` that is a copy of our `ByteBufferIndexInput` (still there, but unused), but using MemorySegment instead of ByteBuffer behind the scenes. It works in exactly the same way, just the try/catch blocks for supporting EOFException or moving to another segment were rewritten. It passes all tests and it looks like you can use it to read indexes. The default chunk size is now 16 GiB (but you can raise or lower it as you like; tests are doing this). Of course you can set it to Long.MAX_VALUE, in that case every index file is always mapped to one big memory mapping. My testing with Windows 10 have shown, that this is *not a good idea!!!*. Huge mappings fragment address space over time and as we can only use like 43 or 46 bits (depending on OS), the fragmentation will at some point kill you. So 16 GiB looks like a good compromise: Most files will be smaller than 6 GiB anyways (unless you optimize your index to one huge segment). So for most Lucene installations, the number of segments will equal the number of open files, so Elasticsearch
[GitHub] [lucene] dweiss commented on pull request #909: LUCENE-10370: pass proper classpath/module arguments for forking jvms from within tests
dweiss commented on PR #909: URL: https://github.com/apache/lucene/pull/909#issuecomment-1133086554 @uschindler Are you ok with the patch as is? Should I rework it into an assumption error? Should I fall back to classpath if nothing is available (I'm not sold on this one)? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] mocobeta commented on pull request #833: LUCENE-10411: Add NN vectors support to ExitableDirectoryReader
mocobeta commented on PR #833: URL: https://github.com/apache/lucene/pull/833#issuecomment-1133038858 For your information, luceneutil now supports ExitableDirectoryReader (https://github.com/mikemccand/luceneutil/pull/172) and I ran the benchmark with vector values. Data: 500k English Wikipedia docs ([couldn't index 1 million docs for some reason](https://github.com/mikemccand/luceneutil/pull/178)), 200 dimensions. This is a sampled result and I saw similar results in multiple runs: ``` TaskQPS baseline StdDevQPS exitable_directory_reader StdDevPct diff p-value AndHighMedVector 430.56 (7.3%) 404.12 (11.1%) -6.1% ( -22% - 13%) 0.057 PKLookup 203.88 (33.2%) 194.26 (29.5%) -4.7% ( -50% - 86%) 0.662 LowTermVector 428.08 (9.7%) 410.96 (9.7%) -4.0% ( -21% - 17%) 0.229 MedTermVector 426.44 (14.4%) 413.25 (12.6%) -3.1% ( -26% - 27%) 0.504 AndHighHighVector 461.11 (15.4%) 457.82 (11.8%) -0.7% ( -24% - 31%) 0.880 HighTermVector 444.54 (17.1%) 443.67 (13.9%) -0.2% ( -26% - 37%) 0.971 AndHighLowVector 431.03 (19.9%) 434.53 (11.7%)0.8% ( -25% - 40%) 0.885 ``` (Note that I ran the benchmark on a hard disk drive.) -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] ywelsch opened a new pull request, #910: LUCENE-10582: Fix merging of CollectionStatistics in CombinedFieldQuery
ywelsch opened a new pull request, #910: URL: https://github.com/apache/lucene/pull/910 CombinedFieldQuery does not properly combine distributed collection statistics, resulting in an IllegalArgumentException during searches. Originally surfaced in this Elasticsearch issue: https://github.com/elastic/elasticsearch/issues/82817 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Created] (LUCENE-10582) CombinedFieldQuery fails with distributed field statistics
Yannick Welsch created LUCENE-10582: --- Summary: CombinedFieldQuery fails with distributed field statistics Key: LUCENE-10582 URL: https://issues.apache.org/jira/browse/LUCENE-10582 Project: Lucene - Core Issue Type: Bug Components: modules/sandbox Reporter: Yannick Welsch CombinedFieldQuery does not properly combine distributed collection statistics, resulting in an IllegalArgumentException during searches. Originally surfaced in this Elasticsearch issue: https://github.com/elastic/elasticsearch/issues/82817 -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Resolved] (LUCENE-10579) fix smoketester backwards-check to not parse stdout
[ https://issues.apache.org/jira/browse/LUCENE-10579?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Robert Muir resolved LUCENE-10579. -- Resolution: Fixed > fix smoketester backwards-check to not parse stdout > --- > > Key: LUCENE-10579 > URL: https://issues.apache.org/jira/browse/LUCENE-10579 > Project: Lucene - Core > Issue Type: Bug >Reporter: Robert Muir >Priority: Major > Fix For: 9.3 > > Attachments: backwards.log.gz > > Time Spent: 0.5h > Remaining Estimate: 0h > > The smoketester parses the output of TestBackwardsCompatibility -verbose > looking for certain prints for each index release. > But I think this is a noisier channel than you might expect. I added a hack > to log the stuff its trying to parse... it is legit crazy. See attachment > Let's rethink, maybe we should just examine the zip files? -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-10579) fix smoketester backwards-check to not parse stdout
[ https://issues.apache.org/jira/browse/LUCENE-10579?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17540069#comment-17540069 ] ASF subversion and git services commented on LUCENE-10579: -- Commit 2090ac431843f689ca6f124723ef37c3b12a2826 in lucene's branch refs/heads/main from Robert Muir [ https://gitbox.apache.org/repos/asf?p=lucene.git;h=2090ac43184 ] LUCENE-10579: fix smoketester backwards-check to not parse stdout (#903) This is very noisy, can contain gradle status updates, various other tests.verbose prints from other threads, you name it. It causes the check to be flaky, and randomly "miss" seeing a test that executed. Instead, let's look at the zip files. We can still preserve the essence of what the test wants to do, but without any flakiness. > fix smoketester backwards-check to not parse stdout > --- > > Key: LUCENE-10579 > URL: https://issues.apache.org/jira/browse/LUCENE-10579 > Project: Lucene - Core > Issue Type: Bug >Reporter: Robert Muir >Priority: Major > Fix For: 9.3 > > Attachments: backwards.log.gz > > Time Spent: 0.5h > Remaining Estimate: 0h > > The smoketester parses the output of TestBackwardsCompatibility -verbose > looking for certain prints for each index release. > But I think this is a noisier channel than you might expect. I added a hack > to log the stuff its trying to parse... it is legit crazy. See attachment > Let's rethink, maybe we should just examine the zip files? -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-10579) fix smoketester backwards-check to not parse stdout
[ https://issues.apache.org/jira/browse/LUCENE-10579?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17540070#comment-17540070 ] ASF subversion and git services commented on LUCENE-10579: -- Commit 5a5c8bca4e691d91801d04d881aa25560413835c in lucene's branch refs/heads/branch_9x from Robert Muir [ https://gitbox.apache.org/repos/asf?p=lucene.git;h=5a5c8bca4e6 ] LUCENE-10579: fix smoketester backwards-check to not parse stdout (#903) This is very noisy, can contain gradle status updates, various other tests.verbose prints from other threads, you name it. It causes the check to be flaky, and randomly "miss" seeing a test that executed. Instead, let's look at the zip files. We can still preserve the essence of what the test wants to do, but without any flakiness. > fix smoketester backwards-check to not parse stdout > --- > > Key: LUCENE-10579 > URL: https://issues.apache.org/jira/browse/LUCENE-10579 > Project: Lucene - Core > Issue Type: Bug >Reporter: Robert Muir >Priority: Major > Fix For: 9.3 > > Attachments: backwards.log.gz > > Time Spent: 0.5h > Remaining Estimate: 0h > > The smoketester parses the output of TestBackwardsCompatibility -verbose > looking for certain prints for each index release. > But I think this is a noisier channel than you might expect. I added a hack > to log the stuff its trying to parse... it is legit crazy. See attachment > Let's rethink, maybe we should just examine the zip files? -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] rmuir merged pull request #903: LUCENE-10579: fix smoketester backwards-check to not parse stdout
rmuir merged PR #903: URL: https://github.com/apache/lucene/pull/903 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-9625) Benchmark KNN search with ann-benchmarks
[ https://issues.apache.org/jira/browse/LUCENE-9625?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17540066#comment-17540066 ] Balmukund Mandal commented on LUCENE-9625: -- Thank you Michael > Benchmark KNN search with ann-benchmarks > > > Key: LUCENE-9625 > URL: https://issues.apache.org/jira/browse/LUCENE-9625 > Project: Lucene - Core > Issue Type: New Feature >Reporter: Michael Sokolov >Priority: Major > Time Spent: 10m > Remaining Estimate: 0h > > In addition to benchmarking with luceneutil, it would be good to be able to > make use of ann-benchmarks, which is publishing results from many approximate > knn algorithms, including the hnsw implementation from its authors. We don't > expect to challenge the performance of these native code libraries, however > it would be good to know just how far off we are. > I started looking into this and posted a fork of ann-benchmarks that uses > KnnGraphTester class to run these: > https://github.com/msokolov/ann-benchmarks. It's still a WIP; you have to > manually copy jars and the KnnGraphTester.class to the test host machine > rather than downloading from a distribution. KnnGraphTester needs some > modifications in order to support this process - this issue is mostly about > that. > One thing I noticed is that some of the index builds with higher fanout > (efConstruction) settings time out at 2h (on an AWS c5 instance), so this is > concerning and I'll open a separate issue for trying to improve that. -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] dweiss commented on pull request #909: LUCENE-10370: pass proper classpath/module arguments for forking jvms from within tests
dweiss commented on PR #909: URL: https://github.com/apache/lucene/pull/909#issuecomment-1132763931 I though about it but... Nah. I don't think it makes sense. You can run tests via gradle from IDEs as well if you have to. The only thing I was wondering would be whether to throw an assertion error or an assumption error from direct IDE launches. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-10544) Should ExitableTermsEnum wrap postings and impacts?
[ https://issues.apache.org/jira/browse/LUCENE-10544?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17540056#comment-17540056 ] Deepika Sharma commented on LUCENE-10544: - For adding timeout feature In {{IndexSearcher}} through {{BulkScorer}} approach, I think there are two ways we can implement this: * One way could be we can make another class which will wrap {{BulkScorer}} along with timeout and other methods as per required (similar to existing {{TimeLimitingCollector}} class). * Another way could be making a helper function inside {{BulkScorer}} class which will wrap score method of {{BulkScorer}} along with timeout and it will invoke whenever timeout is enabled. So, I am looking for the suggestion which approach would be better to go ahead or is there any other way I should consider? > Should ExitableTermsEnum wrap postings and impacts? > --- > > Key: LUCENE-10544 > URL: https://issues.apache.org/jira/browse/LUCENE-10544 > Project: Lucene - Core > Issue Type: Bug > Components: core/index >Reporter: Greg Miller >Priority: Major > > While looking into options for LUCENE-10151, I noticed that > {{ExitableDirectoryReader}} doesn't actually do any timeout checking once you > start iterating postings/impacts. It *does* create a {{ExitableTermsEnum}} > wrapper when loading a {{{}TermsEnum{}}}, but that wrapper doesn't do > anything to wrap postings or impacts. So timeouts will be enforced when > moving to the "next" term, but not when iterating the postings/impacts > associated with a term. > I think we ought to wrap the postings/impacts as well with some form of > timeout checking so timeouts can be enforced on long-running queries. I'm not > sure why this wasn't done originally (back in 2014), but it was questioned > back in 2020 on the original Jira SOLR-5986. Does anyone know of a good > reason why we shouldn't enforce timeouts in this way? > Related, we may also want to wrap things like {{seekExact}} and {{seekCeil}} > given that only {{next}} is being wrapped currently. -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] uschindler commented on pull request #909: LUCENE-10370: pass proper classpath/module arguments for forking jvms from within tests
uschindler commented on PR #909: URL: https://github.com/apache/lucene/pull/909#issuecomment-1132736446 Could we not fallback to the classpath code if the forkoptions file system property is not there? This would allow to run tests in most jvms. Maybe print a warning then. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] rmuir commented on a diff in pull request #892: LUCENE-10581: Optimize stored fields bulk merges on the first segment
rmuir commented on code in PR #892: URL: https://github.com/apache/lucene/pull/892#discussion_r877982698 ## lucene/core/src/java/org/apache/lucene/codecs/lucene90/compressing/Lucene90CompressingStoredFieldsWriter.java: ## @@ -553,14 +554,20 @@ private void copyChunks( final long toPointer = toDocID == sub.maxDoc ? reader.getMaxPointer() : index.getStartPointer(toDocID); if (fromPointer < toPointer) { - if (numBufferedDocs > 0) { -flush(true); - } final IndexInput rawDocs = reader.getFieldsStream(); rawDocs.seek(fromPointer); do { final int base = rawDocs.readVInt(); final int code = rawDocs.readVInt(); +final boolean dirtyChunk = (code & 2) != 0; +if (copyDirtyChunks) { + if (numBufferedDocs > 0) { +flush(true); + } +} else if (dirtyChunk || numBufferedDocs > 0) { + // Don't copy a dirty chunk or force a flush, which would create a dirty chunk + break; +} Review Comment: I find the way this is organized confusing, especially compared to the previous `if buffered > 0 flush()`. Is there some other way the code could be re-arranged? the if-then-else structure is just especially complicated and hard on the eyes. Maybe, we should avoid the use of `copyDirtyChunks` boolean. This boolean is making the logic especially difficult for me. If instead of `if (copyDirtyChunks)`, it were to read `if (mergeStrategy == DIRTY_BULK)`, then its on its way to getting better. (separately, i also hate DIRTY_BULK and CLEAN_BULK names, i will think about it). -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] rmuir commented on pull request #892: LUCENE-10581: Optimize stored fields bulk merges on the first segment
rmuir commented on PR #892: URL: https://github.com/apache/lucene/pull/892#issuecomment-1132725061 Maybe one way to negate risk is to make the same changes to term vectors. It effectively increases the testing of such changes and at least makes all this bulk merge logic more consistent between the two (e.g. less possibilities of corruption) -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Updated] (LUCENE-10370) Fix classpath/module path of tests forking their own Java (TestNRTReplication)
[ https://issues.apache.org/jira/browse/LUCENE-10370?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dawid Weiss updated LUCENE-10370: - Description: TestNRTReplication fails because it assumes classpath can just be copied to a sub-process - this is no longer the case. PR at: https://github.com/apache/lucene/pull/909 was:TestNRTReplication fails because it assumes classpath can just be copied to a sub-process - this is no longer the case. > Fix classpath/module path of tests forking their own Java (TestNRTReplication) > -- > > Key: LUCENE-10370 > URL: https://issues.apache.org/jira/browse/LUCENE-10370 > Project: Lucene - Core > Issue Type: Sub-task >Reporter: Dawid Weiss >Assignee: Dawid Weiss >Priority: Minor > Time Spent: 10m > Remaining Estimate: 0h > > TestNRTReplication fails because it assumes classpath can just be copied to a > sub-process - this is no longer the case. > PR at: > https://github.com/apache/lucene/pull/909 -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] rmuir commented on pull request #892: LUCENE-10581: Optimize stored fields bulk merges on the first segment
rmuir commented on PR #892: URL: https://github.com/apache/lucene/pull/892#issuecomment-1132717121 This adds a little more complexity to the scary bulk merge logic, though. is it worth it? change is somewhat terrifying. can we do something to make it less so? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] dweiss commented on pull request #901: remove commented-out/obselete AwaitsFix
dweiss commented on PR #901: URL: https://github.com/apache/lucene/pull/901#issuecomment-1132716807 I filed a PR fixing those "forking" tests (not just the ones above, others as well). Rewriting the nrt tests completely so that they don't fork at all is a more complex task and probably deserves an issue on its own rights. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] dweiss opened a new pull request, #909: LUCENE-10370: pass proper classpath/module arguments for forking jvms from within tests
dweiss opened a new pull request, #909: URL: https://github.com/apache/lucene/pull/909 This is putting some lipstick on the pig. Forking those JVMs is really an awkward thing to do. I understand some tests are specifically testing JVM crashes, etc., but, uh, those process builders/ pipes are ugly. Anyway. I tracked down java classpath references in a few test classes and replaced them with a single method on LuceneTestCase that passes proper classpath/ module path/ opens, consistent with those the test classes are running with. I ran nightly tests of the core+replication and all seems fine. The downside is that folks will no longer be able to run those tests from within IDEs, but I think it's an acceptable tradeoff given how complex those jvm arguments can get (some things on module path, some on classpath, etc.). -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Assigned] (LUCENE-10370) Fix classpath/module path of tests forking their own Java (TestNRTReplication)
[ https://issues.apache.org/jira/browse/LUCENE-10370?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dawid Weiss reassigned LUCENE-10370: Assignee: Dawid Weiss > Fix classpath/module path of tests forking their own Java (TestNRTReplication) > -- > > Key: LUCENE-10370 > URL: https://issues.apache.org/jira/browse/LUCENE-10370 > Project: Lucene - Core > Issue Type: Sub-task >Reporter: Dawid Weiss >Assignee: Dawid Weiss >Priority: Minor > > TestNRTReplication fails because it assumes classpath can just be copied to a > sub-process - this is no longer the case. -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] jpountz commented on a diff in pull request #897: LUCENE-10266 Move nearest-neighbor search on points to core
jpountz commented on code in PR #897: URL: https://github.com/apache/lucene/pull/897#discussion_r877887683 ## lucene/core/src/java/org/apache/lucene/document/LatLonPoint.java: ## @@ -362,4 +377,72 @@ public static Query newDistanceFeatureQuery( } return query; } + + /** + * Finds the {@code n} nearest indexed points to the provided point, according to Haversine + * distance. + * + * This is functionally equivalent to running {@link MatchAllDocsQuery} with a {@link + * LatLonDocValuesField#newDistanceSort}, but is far more efficient since it takes advantage of + * properties the indexed BKD tree. Currently this only works with {@link Lucene90PointsFormat} Review Comment: this point that it only works with Lucene90PointsFormat is no longer true, can you remove it? ## lucene/core/src/java/org/apache/lucene/search/NearestNeighbor.java: ## @@ -31,12 +31,8 @@ import org.apache.lucene.util.Bits; import org.apache.lucene.util.SloppyMath; -/** - * KNN search on top of 2D lat/lon indexed points. - * - * @lucene.experimental - */ -class NearestNeighbor { +/** KNN search on top of 2D lat/lon indexed points. */ +public class NearestNeighbor { Review Comment: Can we keep it pkg-private if we have it in oal.document? This is especially important as we're promoting this class to core. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] jpountz commented on pull request #892: LUCENE-10581: Optimize stored fields bulk merges on the first segment
jpountz commented on PR #892: URL: https://github.com/apache/lucene/pull/892#issuecomment-1132621070 Quadratic merging is gone but this change still looks helpful so I updated the title/description. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] jpountz commented on pull request #907: LUCENE-10357 Ghost fields and postings/points
jpountz commented on PR #907: URL: https://github.com/apache/lucene/pull/907#issuecomment-1132617984 Thanks for tackling this! Your are going in the right direction. Could we drop most of the `if (terms != Terms.EMPTY)` checks and handle them like any non-null `Terms` instance? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-10574) Remove O(n^2) from TieredMergePolicy or change defaults to one that doesn't do this
[ https://issues.apache.org/jira/browse/LUCENE-10574?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17539982#comment-17539982 ] ASF subversion and git services commented on LUCENE-10574: -- Commit e56f7a6336b26adbf2a180a588240ffb75c47ff4 in lucene's branch refs/heads/branch_9x from Adrien Grand [ https://gitbox.apache.org/repos/asf?p=lucene.git;h=e56f7a6336b ] LUCENE-10574: Keep allowing unbalanced merges if they would reclaim lots of deletes. (#905) `TestTieredMergePolicy` caught this special case: if a segment has lots of deletes, we should still allow unbalanced merges. > Remove O(n^2) from TieredMergePolicy or change defaults to one that doesn't > do this > --- > > Key: LUCENE-10574 > URL: https://issues.apache.org/jira/browse/LUCENE-10574 > Project: Lucene - Core > Issue Type: Bug >Reporter: Robert Muir >Priority: Major > Fix For: 9.3 > > Time Spent: 2h 10m > Remaining Estimate: 0h > > Remove {{floorSegmentBytes}} parameter, or change lucene's default to a merge > policy that doesn't merge in an O(n^2) way. > I have the feeling it might have to be the latter, as folks seem really wed > to this crazy O(n^2) behavior. -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-10574) Remove O(n^2) from TieredMergePolicy or change defaults to one that doesn't do this
[ https://issues.apache.org/jira/browse/LUCENE-10574?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17539980#comment-17539980 ] ASF subversion and git services commented on LUCENE-10574: -- Commit 5e9dfbed23a9bdd21b16d2e31e5e805f4fd6 in lucene's branch refs/heads/main from Adrien Grand [ https://gitbox.apache.org/repos/asf?p=lucene.git;h=5e9dfbed277 ] LUCENE-10574: Keep allowing unbalanced merges if they would reclaim lots of deletes. (#905) `TestTieredMergePolicy` caught this special case: if a segment has lots of deletes, we should still allow unbalanced merges. > Remove O(n^2) from TieredMergePolicy or change defaults to one that doesn't > do this > --- > > Key: LUCENE-10574 > URL: https://issues.apache.org/jira/browse/LUCENE-10574 > Project: Lucene - Core > Issue Type: Bug >Reporter: Robert Muir >Priority: Major > Fix For: 9.3 > > Time Spent: 2h 10m > Remaining Estimate: 0h > > Remove {{floorSegmentBytes}} parameter, or change lucene's default to a merge > policy that doesn't merge in an O(n^2) way. > I have the feeling it might have to be the latter, as folks seem really wed > to this crazy O(n^2) behavior. -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] jpountz merged pull request #905: LUCENE-10574: Keep allowing unbalanced merges if they would reclaim lots of deletes.
jpountz merged PR #905: URL: https://github.com/apache/lucene/pull/905 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-10581) Optimize stored fields merges on the first segment
[ https://issues.apache.org/jira/browse/LUCENE-10581?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17539976#comment-17539976 ] Adrien Grand commented on LUCENE-10581: --- StoredFieldsBenchmark on the entire geonames dataset gives me || Msec to index || BEST_SPEED || BEST_COMPRESSION || | main | 105726 | 248489 | | patch | 91000 | 200811 | > Optimize stored fields merges on the first segment > -- > > Key: LUCENE-10581 > URL: https://issues.apache.org/jira/browse/LUCENE-10581 > Project: Lucene - Core > Issue Type: Improvement >Reporter: Adrien Grand >Priority: Minor > > This is mostly repurposing LUCENE-10573. Even though our merge policies no > longer perform quadratic merging, it's still possible to configure them with > low merge factors (e.g. 2) or they might decide to create unbalanced merges > where the biggest segment of the merge accounts for a large part of the > merge. In such cases, copying compressed data directly still yields > significant benefits. -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Created] (LUCENE-10581) Optimize stored fields merges on the first segment
Adrien Grand created LUCENE-10581: - Summary: Optimize stored fields merges on the first segment Key: LUCENE-10581 URL: https://issues.apache.org/jira/browse/LUCENE-10581 Project: Lucene - Core Issue Type: Improvement Reporter: Adrien Grand This is mostly repurposing LUCENE-10573. Even though our merge policies no longer perform quadratic merging, it's still possible to configure them with low merge factors (e.g. 2) or they might decide to create unbalanced merges where the biggest segment of the merge accounts for a large part of the merge. In such cases, copying compressed data directly still yields significant benefits. -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-10574) Remove O(n^2) from TieredMergePolicy or change defaults to one that doesn't do this
[ https://issues.apache.org/jira/browse/LUCENE-10574?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17539959#comment-17539959 ] Adrien Grand commented on LUCENE-10574: --- There was a good speedup on the stored fields benchmark thanks to this change: http://people.apache.org/~mikemccand/lucenebench/stored_fields_benchmarks.html - 2.8x faster for BEST_SPEED - 2.3x faster for BEST_COMPRESSION The other good news is that tests do not take longer despite the removal of LuceneTestCase#avoidPathologicalMerging. http://people.apache.org/~mikemccand/lucenebench/antcleantest.html Other benchmarks are not affected because they would flush large-enough segments to not be subject to the floor segment size. > Remove O(n^2) from TieredMergePolicy or change defaults to one that doesn't > do this > --- > > Key: LUCENE-10574 > URL: https://issues.apache.org/jira/browse/LUCENE-10574 > Project: Lucene - Core > Issue Type: Bug >Reporter: Robert Muir >Priority: Major > Fix For: 9.3 > > Time Spent: 2h > Remaining Estimate: 0h > > Remove {{floorSegmentBytes}} parameter, or change lucene's default to a merge > policy that doesn't merge in an O(n^2) way. > I have the feeling it might have to be the latter, as folks seem really wed > to this crazy O(n^2) behavior. -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org