[GitHub] [lucene] mocobeta commented on a diff in pull request #917: LUCENE-10531: Disable distribution test (gui test) on windows.

2022-05-20 Thread GitBox


mocobeta commented on code in PR #917:
URL: https://github.com/apache/lucene/pull/917#discussion_r878647487


##
.github/workflows/distribution.yml:
##
@@ -19,8 +19,9 @@ jobs:
 runs-on: ${{ matrix.os }}
 strategy:
   matrix:
-# we run the distribution tests on all major OSs.
-os: [ubuntu-latest, macos-latest, windows-latest]
+# we want to run the distribution tests on all major OSs, but it's 
occasionally too slow (or hangs or the forked process is started at all..., not 
sure the cause) on windows.

Review Comment:
   ```suggestion
   # we want to run the distribution tests on all major OSs, but it's 
occasionally too slow (or hangs or the forked process is not started at all..., 
not sure the cause) on windows.
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] mocobeta commented on a diff in pull request #917: LUCENE-10531: Disable distribution test (gui test) on windows.

2022-05-20 Thread GitBox


mocobeta commented on code in PR #917:
URL: https://github.com/apache/lucene/pull/917#discussion_r878647263


##
.github/workflows/distribution.yml:
##
@@ -19,8 +19,9 @@ jobs:
 runs-on: ${{ matrix.os }}
 strategy:
   matrix:
-# we run the distribution tests on all major OSs.
-os: [ubuntu-latest, macos-latest, windows-latest]
+# we want to run the distribution tests on all major OSs, but it's 
occasionally too slow on windows.

Review Comment:
   ```suggestion
   # we want to run the distribution tests on all major OSs, but it's 
occasionally too slow (or hangs or the forked process is started at all..., not 
sure the cause) on windows.
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] mocobeta commented on pull request #913: Lucene 10577

2022-05-20 Thread GitBox


mocobeta commented on PR #913:
URL: https://github.com/apache/lucene/pull/913#issuecomment-1133536049

   > Looks like some other issue blocking the forked process.
   
   Yes I think it could completely hang for some reason.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] dweiss commented on pull request #913: Lucene 10577

2022-05-20 Thread GitBox


dweiss commented on PR #913:
URL: https://github.com/apache/lucene/pull/913#issuecomment-1133533455

   > I'm planning to disable the test on windows VM - it looks too slow in 
GitHub Actions.
   
   I'm not sure it's the slowness anymore. Looks like some other issue blocking 
the forked process.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] mocobeta commented on pull request #913: Lucene 10577

2022-05-20 Thread GitBox


mocobeta commented on PR #913:
URL: https://github.com/apache/lucene/pull/913#issuecomment-1133532803

   Also, it should run on headless JVM; we are running it on CI servers without 
displays.
   
https://github.com/apache/lucene/blob/71a9acb2e2aa55257021eefce1e5d8d390bc7048/lucene/luke/src/java/org/apache/lucene/luke/app/desktop/LukeMain.java#L75-L78


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] mocobeta commented on pull request #893: LUCENE-10531: Add @RequiresGUI test group for GUI tests

2022-05-20 Thread GitBox


mocobeta commented on PR #893:
URL: https://github.com/apache/lucene/pull/893#issuecomment-1133530902

   We increased the timeout to 120 seconds, but the test still occasionally 
fails on Windows. I opened #917 to disable it on windows for now.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] mocobeta opened a new pull request, #917: LUCENE-10531: Disable distribution test (gui test) on windows.

2022-05-20 Thread GitBox


mocobeta opened a new pull request, #917:
URL: https://github.com/apache/lucene/pull/917

   The test occasionally is too slow and fails on Windows VM. 
   We can't increase the timeout infinitely, I would disable it on GitHub 
Actions for now (we still run it on Jenkins Servers).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] mocobeta commented on pull request #913: Lucene 10577

2022-05-20 Thread GitBox


mocobeta commented on PR #913:
URL: https://github.com/apache/lucene/pull/913#issuecomment-1133529000

   > hm this "luke-can-be-launched" test failed for me locally too. I guess it 
needs a JVM with a head?
   
   It's not a part of the mandatory test suite but you can run it headlessly 
(if you are on Linux). Please see: 
https://github.com/apache/lucene/blob/71a9acb2e2aa55257021eefce1e5d8d390bc7048/help/tests.txt#L131
   
   As for the occasional test failure of the GUI test for Luke, I'm planning to 
disable the test on windows VM - it looks too slow in GitHub Actions.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] rmuir commented on a diff in pull request #913: Lucene 10577

2022-05-20 Thread GitBox


rmuir commented on code in PR #913:
URL: https://github.com/apache/lucene/pull/913#discussion_r878641687


##
lucene/core/src/java/org/apache/lucene/util/VectorUtil.java:
##
@@ -213,4 +213,21 @@ public static void add(float[] u, float[] v) {
   u[i] += v[i];
 }
   }
+
+  public static float dotProduct(BytesRef a, int aOffset, BytesRef b, int 
bOffset, int len) {
+// fixme -- move to codec? What if later we want to access the bytes some 
other way?
+int total = 0;
+for (int i = 0; i < len; i++) {
+  total += a.bytes[aOffset++] * b.bytes[bOffset++];

Review Comment:
   i wonder if we can coerce autovectorization of this loop somehow too. it is 
also worth looking into as it wouldn't require any vector api at all, maybe 
just manipulating the code. it is integer math: doesn't have the restrictions 
that floating point does (order of operations etc), so theoretically the 
compiler could do it.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] mocobeta opened a new pull request, #916: Refine contribution guide and pull request template

2022-05-20 Thread GitBox


mocobeta opened a new pull request, #916:
URL: https://github.com/apache/lucene/pull/916

   Some part of the pull request template has been a bit obsoleted, and there 
are overlaps between the contribution guide and the template.
   
   I noticed the majority of developers (including me...) delete all texts in 
the template and start with a blank sheet (90% of PRs, I roughly estimate) - 
then how about just encouraging people to refer to the contribution guide in 
the template?
   
   To avoid information loss, I moved checklist items that did not appear in 
the contribution guide to it.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-10312) Add PersianStemmer

2022-05-20 Thread ASF subversion and git services (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-10312?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17540341#comment-17540341
 ] 

ASF subversion and git services commented on LUCENE-10312:
--

Commit 71a9acb2e2aa55257021eefce1e5d8d390bc7048 in lucene's branch 
refs/heads/main from Tomoko Uchida
[ https://gitbox.apache.org/repos/asf?p=lucene.git;h=71a9acb2e2a ]

LUCENE-10312: MIGRATE entry and small follow-ups (#908)



> Add PersianStemmer
> --
>
> Key: LUCENE-10312
> URL: https://issues.apache.org/jira/browse/LUCENE-10312
> Project: Lucene - Core
>  Issue Type: Wish
>  Components: modules/analysis
>Affects Versions: 9.0
>Reporter: Ramin Alirezaee
>Priority: Major
> Fix For: 10.0 (main), 9.2
>
> Attachments: image.png
>
>  Time Spent: 10h 20m
>  Remaining Estimate: 0h
>




--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] mocobeta merged pull request #908: LUCENE-10312: small follow-ups

2022-05-20 Thread GitBox


mocobeta merged PR #908:
URL: https://github.com/apache/lucene/pull/908


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] mocobeta commented on a diff in pull request #912: Initial rewrite of MMapDirectory for JDK-19 preview Panama APIs (>= JDK-19-ea+23)

2022-05-20 Thread GitBox


mocobeta commented on code in PR #912:
URL: https://github.com/apache/lucene/pull/912#discussion_r878635805


##
lucene/core/src/java/org/apache/lucene/store/MMapDirectory.java:
##
@@ -232,55 +224,62 @@ public IndexInput openInput(String name, IOContext 
context) throws IOException {
 ensureOpen();
 ensureCanRead(name);
 Path path = directory.resolve(name);
-try (FileChannel c = FileChannel.open(path, StandardOpenOption.READ)) {
-  final String resourceDescription = "MMapIndexInput(path=\"" + 
path.toString() + "\")";
-  final boolean useUnmap = getUseUnmap();
-  return ByteBufferIndexInput.newInstance(
-  resourceDescription,
-  map(resourceDescription, c, 0, c.size()),
-  c.size(),
-  chunkSizePower,
-  new ByteBufferGuard(resourceDescription, useUnmap ? CLEANER : null));
+final String resourceDescription = "MemorySegmentIndexInput(path=\"" + 
path.toString() + "\")";
+
+// Work around for JDK-8259028: we need to unwrap our test-only file 
system layers
+path = Unwrappable.unwrapAll(path);
+
+boolean success = false;
+final MemorySession session = MemorySession.openShared();
+try (var fc = FileChannel.open(path)) {
+  final long fileSize = fc.size();
+  final MemorySegment[] segments = map(session, resourceDescription, fc, 
fileSize);
+  final IndexInput in =
+  MemorySegmentIndexInput.newInstance(
+  resourceDescription, session, segments, fileSize, 
chunkSizePower);
+  success = true;
+  return in;
+} finally {
+  if (success == false) {
+session.close();
+  }
 }
   }
 
-  /** Maps a file into a set of buffers */
-  final ByteBuffer[] map(String resourceDescription, FileChannel fc, long 
offset, long length)
+  /** Maps a file into a set of segments */
+  final MemorySegment[] map(
+  MemorySession session, String resourceDescription, FileChannel fc, long 
length)
   throws IOException {
 if ((length >>> chunkSizePower) >= Integer.MAX_VALUE)
-  throw new IllegalArgumentException(
-  "RandomAccessFile too big for chunk size: " + resourceDescription);
+  throw new IllegalArgumentException("File too big for chunk size: " + 
resourceDescription);
 
 final long chunkSize = 1L << chunkSizePower;
 
-// we always allocate one more buffer, the last one may be a 0 byte one
-final int nrBuffers = (int) (length >>> chunkSizePower) + 1;
+// we always allocate one more segments, the last one may be a 0 byte one
+final int nrSegments = (int) (length >>> chunkSizePower) + 1;
 
-ByteBuffer[] buffers = new ByteBuffer[nrBuffers];
+final MemorySegment[] segments = new MemorySegment[nrSegments];
 
-long bufferStart = 0L;
-for (int bufNr = 0; bufNr < nrBuffers; bufNr++) {
-  int bufSize =
-  (int) ((length > (bufferStart + chunkSize)) ? chunkSize : (length - 
bufferStart));
-  MappedByteBuffer buffer;
+long startOffset = 0L;
+for (int segNr = 0; segNr < nrSegments; segNr++) {
+  long segSize = (length > (startOffset + chunkSize)) ? chunkSize : 
(length - startOffset);

Review Comment:
   minor: can be final?
   also, thanks for changing the type to `long` - I was wondering why the old 
`bufSize` should be int.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] gsmiller commented on pull request #915: LUCENE-10585: Scrub copy/paste code in the facets module and attempt to simplify a bit

2022-05-20 Thread GitBox


gsmiller commented on PR #915:
URL: https://github.com/apache/lucene/pull/915#issuecomment-1133483198

   Ran benchmarks to make sure this didn't regress performance in any 
meaningful way. I'm not seeing any regressions:
   
   ```
   TaskQPS baseline  StdDevQPS candidate  
StdDevPct diff p-value
   Wildcard  337.70 (16.7%)  323.50 
(16.7%)   -4.2% ( -32% -   35%) 0.426
  BrowseMonthTaxoFacets   29.43 (13.8%)   28.72 
(19.1%)   -2.4% ( -31% -   35%) 0.649
   BrowseDateSSDVFacets2.68 (17.5%)2.63 
(17.0%)   -1.8% ( -30% -   39%) 0.743
  BrowseDayOfYearSSDVFacets   14.94 (16.5%)   14.68 
(13.6%)   -1.8% ( -27% -   33%) 0.709
  HighTermMonthSort  147.90 (22.4%)  145.50 
(19.4%)   -1.6% ( -35% -   51%) 0.807
  HighTermDayOfYearSort   61.68 (14.9%)   60.80 
(15.3%)   -1.4% ( -27% -   33%) 0.766
  OrHighNotHigh 1107.68  (4.1%) 1092.99  
(2.8%)   -1.3% (  -7% -5%) 0.232
LowTerm 2181.38  (3.5%) 2153.14  
(3.1%)   -1.3% (  -7% -5%) 0.213
   OrHighNotMed 1420.79  (2.7%) 1404.11  
(3.1%)   -1.2% (  -6% -4%) 0.203
 TermDTSort  117.08 (15.5%)  115.77 
(15.3%)   -1.1% ( -27% -   35%) 0.818
BrowseRandomLabelTaxoFacets   18.20 (10.0%)   18.01 
(13.2%)   -1.1% ( -22% -   24%) 0.769
LowSloppyPhrase  121.57  (4.8%)  120.52  
(3.7%)   -0.9% (  -8% -8%) 0.527
   HighTerm 1427.75  (4.8%) 1416.29  
(3.0%)   -0.8% (  -8% -7%) 0.529
MedSloppyPhrase   92.18  (5.1%)   91.45  
(4.0%)   -0.8% (  -9% -8%) 0.589
   BrowseDateTaxoFacets   20.98 (12.4%)   20.82 
(15.5%)   -0.7% ( -25% -   30%) 0.866
   OrHighNotLow 1233.19  (4.3%) 1223.99  
(3.8%)   -0.7% (  -8% -7%) 0.557
  OrHighLow  515.20  (2.8%)  511.43  
(2.7%)   -0.7% (  -6% -4%) 0.400
  BrowseDayOfYearTaxoFacets   20.91 (12.5%)   20.77 
(15.7%)   -0.6% ( -25% -   31%) 0.886
AndHighHigh   40.18  (4.6%)   40.06  
(4.2%)   -0.3% (  -8% -8%) 0.823
   OrNotHighMed  923.89  (3.3%)  921.97  
(2.8%)   -0.2% (  -6% -6%) 0.830
   HighSloppyPhrase   17.59  (6.1%)   17.55  
(4.3%)   -0.2% ( -10% -   10%) 0.906
 AndHighMed  137.42  (3.7%)  137.17  
(3.5%)   -0.2% (  -7% -7%) 0.870
   PKLookup  168.11  (2.6%)  168.01  
(3.4%)   -0.1% (  -5% -6%) 0.951
BrowseRandomLabelSSDVFacets   10.14  (3.6%)   10.14  
(3.3%)0.0% (  -6% -7%) 0.989
 IntNRQ  204.47  (0.9%)  204.53  
(0.6%)0.0% (  -1% -1%) 0.907
  OrNotHighHigh 1480.42  (3.4%) 1480.99  
(2.4%)0.0% (  -5% -6%) 0.967
Prefix3  221.07 (15.0%)  221.20 
(13.6%)0.1% ( -24% -   33%) 0.989
 Fuzzy1   79.60  (1.8%)   79.74  
(1.4%)0.2% (  -2% -3%) 0.727
 Fuzzy2   75.46  (1.9%)   75.62  
(1.6%)0.2% (  -3% -3%) 0.697
MedTerm 1563.09  (5.2%) 1567.15  
(3.5%)0.3% (  -8% -9%) 0.853
MedSpanNear   27.45  (1.7%)   27.54  
(1.9%)0.3% (  -3% -3%) 0.579
  LowPhrase  123.19  (2.2%)  123.66  
(1.7%)0.4% (  -3% -4%) 0.539
Respell   46.77  (1.8%)   46.97  
(1.4%)0.4% (  -2% -3%) 0.422
  MedPhrase  476.52  (2.1%)  478.54  
(1.9%)0.4% (  -3% -4%) 0.497
 HighPhrase  222.77  (2.3%)  223.82  
(1.7%)0.5% (  -3% -4%) 0.458
 AndHighLow 1161.41  (2.3%) 1167.56  
(2.4%)0.5% (  -4% -5%) 0.482
  OrHighMed  128.58  (3.7%)  129.28  
(4.3%)0.5% (  -7% -8%) 0.671
LowSpanNear   13.26  (2.3%)   13.34  
(2.2%)0.6% (  -3% -5%) 0.421
   HighSpanNear   10.68  (2.2%)   10.74  
(2.1%)0.6% (  -3% -5%) 0.392
   OrNotHighLow 1314.95  (3.3%) 1324.37  
(2.5%)0.7% (  -4% -6%) 0.443
 OrHighHigh   

[GitHub] [lucene] gsmiller opened a new pull request, #915: LUCENE-10585: Scrub copy/paste code in the facets module and attempt to simplify a bit

2022-05-20 Thread GitBox


gsmiller opened a new pull request, #915:
URL: https://github.com/apache/lucene/pull/915

   # Description
   
   The facets module has had a fair amount of recent development, some 
resulting in a fair amount of copy/paste duplication. This attempts to clean it 
up a bit and simplify/refactor some of the internal functionality.
   
   # Solution
   
   Minor internal refactoring of some facet implementations and extraction of a 
common parent class to contain some duplicated logic in the SSDV faceting 
implementations.
   
   # Tests
   
   No new tests were written. All existing testing is still passing.
   
   # Checklist
   
   Please review the following and check all that apply:
   
   - [x] I have reviewed the guidelines for [How to 
Contribute](https://github.com/apache/lucene/blob/main/CONTRIBUTING.md) and my 
code conforms to the standards described there to the best of my ability.
   - [x] I have given Lucene maintainers 
[access](https://help.github.com/en/articles/allowing-changes-to-a-pull-request-branch-created-from-a-fork)
 to contribute to my PR branch. (optional but recommended)
   - [x] I have developed this patch against the `main` branch.
   - [x] I have run `./gradlew check`.
   - [ ] I have added tests for my changes.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] Yuti-G commented on pull request #777: LUCENE-10488: Optimize Facets#getTopDims in ConcurrentSortedSetDocValuesFacetCounts

2022-05-20 Thread GitBox


Yuti-G commented on PR #777:
URL: https://github.com/apache/lucene/pull/777#issuecomment-1133481903

   > This looks great! Thanks @Yuti-G! It would be nice if we could create a 
common abstract class to hold some of the common logic between this and the 
non-concurrent implementation. Seems like a lot of copy/paste going on. This 
isn't a new problem though, so let's not do that work as part of this PR. What 
do you think of opening a separate issue to see if we can consolidate some 
common logic? You probably have a better idea of how feasible this is after 
working on these changes, so I'm curious what you think. Thanks again for 
taking this on!
   
   Thanks @gsmiller for the great idea! I will open a separate issue to 
consolidate the common logic between ConcurrentSSDVFacetCounts and 
SSDVFacetCounts after LUCENE-10550 are merged. There are many duplicate codes 
between these two classes in https://github.com/apache/lucene/pull/914 as well.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] Yuti-G opened a new pull request, #914: LUCENE-10550: Add getAllChildren functionality to facets

2022-05-20 Thread GitBox


Yuti-G opened a new pull request, #914:
URL: https://github.com/apache/lucene/pull/914

   # Description
   
   This new API returns all children for a specified path and allows users to 
sort in their desired criteria
   
   # Solution
   
   * Added getAllChildren function in the Facets class.
   * Override getAllChildren in subclasses
   
   # Tests
   
   Added new testing for all the overridden implementations of getAllChildren 
in Facets
   
   # Checklist
   
   Please review the following and check all that apply:
   
   - [X] I have reviewed the guidelines for [How to 
Contribute](https://github.com/apache/lucene/blob/main/CONTRIBUTING.md) and my 
code conforms to the standards described there to the best of my ability.
   - [X] I have given Lucene maintainers 
[access](https://help.github.com/en/articles/allowing-changes-to-a-pull-request-branch-created-from-a-fork)
 to contribute to my PR branch. (optional but recommended)
   - [X] I have developed this patch against the `main` branch.
   - [X] I have run `./gradlew check`.
   - [X] I have added tests for my changes.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Resolved] (LUCENE-10370) Fix classpath/module path of tests forking their own Java (TestNRTReplication)

2022-05-20 Thread Dawid Weiss (Jira)


 [ 
https://issues.apache.org/jira/browse/LUCENE-10370?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dawid Weiss resolved LUCENE-10370.
--
Resolution: Fixed

> Fix classpath/module path of tests forking their own Java (TestNRTReplication)
> --
>
> Key: LUCENE-10370
> URL: https://issues.apache.org/jira/browse/LUCENE-10370
> Project: Lucene - Core
>  Issue Type: Sub-task
>Reporter: Dawid Weiss
>Assignee: Dawid Weiss
>Priority: Minor
> Fix For: 9.3
>
>  Time Spent: 1h 20m
>  Remaining Estimate: 0h
>
> TestNRTReplication fails because it assumes classpath can just be copied to a 
> sub-process - this is no longer the case.
> PR at:
> https://github.com/apache/lucene/pull/909



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] msokolov commented on a diff in pull request #913: Lucene 10577

2022-05-20 Thread GitBox


msokolov commented on code in PR #913:
URL: https://github.com/apache/lucene/pull/913#discussion_r878580201


##
lucene/core/src/java/org/apache/lucene/util/VectorUtil.java:
##
@@ -213,4 +213,21 @@ public static void add(float[] u, float[] v) {
   u[i] += v[i];
 }
   }
+
+  public static float dotProduct(BytesRef a, int aOffset, BytesRef b, int 
bOffset, int len) {
+// fixme -- move to codec? What if later we want to access the bytes some 
other way?
+int total = 0;
+for (int i = 0; i < len; i++) {
+  total += a.bytes[aOffset++] * b.bytes[bOffset++];

Review Comment:
   I will give castShape/convert/convertShape a try. I also found 
reduceLanesToLong that seems promising



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Updated] (LUCENE-10370) Fix classpath/module path of tests forking their own Java (TestNRTReplication)

2022-05-20 Thread Dawid Weiss (Jira)


 [ 
https://issues.apache.org/jira/browse/LUCENE-10370?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dawid Weiss updated LUCENE-10370:
-
Fix Version/s: 9.3

> Fix classpath/module path of tests forking their own Java (TestNRTReplication)
> --
>
> Key: LUCENE-10370
> URL: https://issues.apache.org/jira/browse/LUCENE-10370
> Project: Lucene - Core
>  Issue Type: Sub-task
>Reporter: Dawid Weiss
>Assignee: Dawid Weiss
>Priority: Minor
> Fix For: 9.3
>
>  Time Spent: 1h 20m
>  Remaining Estimate: 0h
>
> TestNRTReplication fails because it assumes classpath can just be copied to a 
> sub-process - this is no longer the case.
> PR at:
> https://github.com/apache/lucene/pull/909



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-10370) Fix classpath/module path of tests forking their own Java (TestNRTReplication)

2022-05-20 Thread ASF subversion and git services (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-10370?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17540315#comment-17540315
 ] 

ASF subversion and git services commented on LUCENE-10370:
--

Commit ab19b4986aa2eea82d24f0011619eec629191c59 in lucene's branch 
refs/heads/branch_9x from Dawid Weiss
[ https://gitbox.apache.org/repos/asf?p=lucene.git;h=ab19b4986aa ]

LUCENE-10370: pass proper classpath/module arguments for forking jvms from 
within tests. (#909)



> Fix classpath/module path of tests forking their own Java (TestNRTReplication)
> --
>
> Key: LUCENE-10370
> URL: https://issues.apache.org/jira/browse/LUCENE-10370
> Project: Lucene - Core
>  Issue Type: Sub-task
>Reporter: Dawid Weiss
>Assignee: Dawid Weiss
>Priority: Minor
>  Time Spent: 1h 20m
>  Remaining Estimate: 0h
>
> TestNRTReplication fails because it assumes classpath can just be copied to a 
> sub-process - this is no longer the case.
> PR at:
> https://github.com/apache/lucene/pull/909



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-10370) Fix classpath/module path of tests forking their own Java (TestNRTReplication)

2022-05-20 Thread ASF subversion and git services (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-10370?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17540314#comment-17540314
 ] 

ASF subversion and git services commented on LUCENE-10370:
--

Commit 63b66e06cdf648f2da47fb8a430518079f16b83c in lucene's branch 
refs/heads/main from Dawid Weiss
[ https://gitbox.apache.org/repos/asf?p=lucene.git;h=63b66e06cdf ]

LUCENE-10370: pass proper classpath/module arguments for forking jvms from 
within tests. (#909)



> Fix classpath/module path of tests forking their own Java (TestNRTReplication)
> --
>
> Key: LUCENE-10370
> URL: https://issues.apache.org/jira/browse/LUCENE-10370
> Project: Lucene - Core
>  Issue Type: Sub-task
>Reporter: Dawid Weiss
>Assignee: Dawid Weiss
>Priority: Minor
>  Time Spent: 1h 20m
>  Remaining Estimate: 0h
>
> TestNRTReplication fails because it assumes classpath can just be copied to a 
> sub-process - this is no longer the case.
> PR at:
> https://github.com/apache/lucene/pull/909



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] dweiss merged pull request #909: LUCENE-10370: pass proper classpath/module arguments for forking jvms from within tests

2022-05-20 Thread GitBox


dweiss merged PR #909:
URL: https://github.com/apache/lucene/pull/909


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] dweiss commented on pull request #909: LUCENE-10370: pass proper classpath/module arguments for forking jvms from within tests

2022-05-20 Thread GitBox


dweiss commented on PR #909:
URL: https://github.com/apache/lucene/pull/909#issuecomment-1133434751

   > If we make it an assumption, it makes the results look a bit nicer?
   
   Depends. It can go unnoticed then because it superficially looks just as if 
the test was ignored. I'd rather have it fail because then it draws more 
attention.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Comment Edited] (LUCENE-10544) Should ExitableTermsEnum wrap postings and impacts?

2022-05-20 Thread Greg Miller (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-10544?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17540309#comment-17540309
 ] 

Greg Miller edited comment on LUCENE-10544 at 5/20/22 10:03 PM:


As I understood Adrien's suggestion, I think the idea is to create a new 
{{BulkScorer}} sub-class that would wrap another {{BulkScorer}} (provided as 
{{in}} in Adrien's code snippet). This class would override the {{score}} 
method as Adrien shows above to periodically check timeouts, but otherwise just 
delegate to {{in}} if the query has not yet timed out. I imagine somewhere in 
{{IndexSearcher}} you would instantiate this new "timeout enforcing bulk 
scorer", wrapping the {{BulkScorer}} provided by the query's weight. Does that 
help?

Also, can I request that we move this conversation over to 
[LUCENE-10151|https://issues.apache.org/jira/browse/LUCENE-10151]? This issue 
is really about modifying {{ExitableTermsEnum}}, which we may want to 
eventually due independent of adding timeout support to {{IndexSearcher}}. 
Since this discussion is really about adding timeout support to 
{{IndexSearcher}}, it would be best to capture the conversation in LUCENE-10151 
to make it easier to dig up in the future. Thank you!


was (Author: gsmiller):
As I understood Adrien's suggestion, I think the idea is to create a new 
{{BulkScorer}} sub-class that would wrap another {{BulkScorer}} (provided as 
{{in}} in Adrien's code snippet). This class would override the {{score}} 
method as Adrien shows above to periodically check timeouts, but otherwise just 
delegate to {{in}} if the query has not yet timed out. I image somewhere in 
{{IndexSearcher}} you would instantiate this new "timeout enforcing bulk 
scorer", wrapping the {{BulkScorer}} provided by the query's weight. Does that 
help?

Also, can I request that we move this conversation over to 
[LUCENE-10151|https://issues.apache.org/jira/browse/LUCENE-10151]? This issue 
is really about modifying {{ExitableTermsEnum}}, which we may want to 
eventually due independent of adding timeout support to {{IndexSearcher}}. 
Since this discussion is really about adding timeout support to 
{{IndexSearcher}}, it would be best to capture the conversation in LUCENE-10151 
to make it easier to dig up in the future. Thank you!

> Should ExitableTermsEnum wrap postings and impacts?
> ---
>
> Key: LUCENE-10544
> URL: https://issues.apache.org/jira/browse/LUCENE-10544
> Project: Lucene - Core
>  Issue Type: Bug
>  Components: core/index
>Reporter: Greg Miller
>Priority: Major
>
> While looking into options for LUCENE-10151, I noticed that 
> {{ExitableDirectoryReader}} doesn't actually do any timeout checking once you 
> start iterating postings/impacts. It *does* create a {{ExitableTermsEnum}} 
> wrapper when loading a {{{}TermsEnum{}}}, but that wrapper doesn't do 
> anything to wrap postings or impacts. So timeouts will be enforced when 
> moving to the "next" term, but not when iterating the postings/impacts 
> associated with a term.
> I think we ought to wrap the postings/impacts as well with some form of 
> timeout checking so timeouts can be enforced on long-running queries. I'm not 
> sure why this wasn't done originally (back in 2014), but it was questioned 
> back in 2020 on the original Jira SOLR-5986. Does anyone know of a good 
> reason why we shouldn't enforce timeouts in this way?
> Related, we may also want to wrap things like {{seekExact}} and {{seekCeil}} 
> given that only {{next}} is being wrapped currently.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-10544) Should ExitableTermsEnum wrap postings and impacts?

2022-05-20 Thread Greg Miller (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-10544?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17540309#comment-17540309
 ] 

Greg Miller commented on LUCENE-10544:
--

As I understood Adrien's suggestion, I think the idea is to create a new 
{{BulkScorer}} sub-class that would wrap another {{BulkScorer}} (provided as 
{{in}} in Adrien's code snippet). This class would override the {{score}} 
method as Adrien shows above to periodically check timeouts, but otherwise just 
delegate to {{in}} if the query has not yet timed out. I image somewhere in 
{{IndexSearcher}} you would instantiate this new "timeout enforcing bulk 
scorer", wrapping the {{BulkScorer}} provided by the query's weight. Does that 
help?

Also, can I request that we move this conversation over to 
[LUCENE-10151|https://issues.apache.org/jira/browse/LUCENE-10151]? This issue 
is really about modifying {{ExitableTermsEnum}}, which we may want to 
eventually due independent of adding timeout support to {{IndexSearcher}}. 
Since this discussion is really about adding timeout support to 
{{IndexSearcher}}, it would be best to capture the conversation in LUCENE-10151 
to make it easier to dig up in the future. Thank you!

> Should ExitableTermsEnum wrap postings and impacts?
> ---
>
> Key: LUCENE-10544
> URL: https://issues.apache.org/jira/browse/LUCENE-10544
> Project: Lucene - Core
>  Issue Type: Bug
>  Components: core/index
>Reporter: Greg Miller
>Priority: Major
>
> While looking into options for LUCENE-10151, I noticed that 
> {{ExitableDirectoryReader}} doesn't actually do any timeout checking once you 
> start iterating postings/impacts. It *does* create a {{ExitableTermsEnum}} 
> wrapper when loading a {{{}TermsEnum{}}}, but that wrapper doesn't do 
> anything to wrap postings or impacts. So timeouts will be enforced when 
> moving to the "next" term, but not when iterating the postings/impacts 
> associated with a term.
> I think we ought to wrap the postings/impacts as well with some form of 
> timeout checking so timeouts can be enforced on long-running queries. I'm not 
> sure why this wasn't done originally (back in 2014), but it was questioned 
> back in 2020 on the original Jira SOLR-5986. Does anyone know of a good 
> reason why we shouldn't enforce timeouts in this way?
> Related, we may also want to wrap things like {{seekExact}} and {{seekCeil}} 
> given that only {{next}} is being wrapped currently.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] jtibshirani commented on a diff in pull request #910: LUCENE-10582: Fix merging of CollectionStatistics in CombinedFieldQuery

2022-05-20 Thread GitBox


jtibshirani commented on code in PR #910:
URL: https://github.com/apache/lucene/pull/910#discussion_r878568768


##
lucene/sandbox/src/test/org/apache/lucene/sandbox/search/TestCombinedFieldQuery.java:
##
@@ -589,4 +589,52 @@ public SimScorer scorer(
   return new BM25Similarity().scorer(boost, collectionStats, termStats);
 }
   }
+
+  public void testDistributedCollectionStatistics() throws IOException {
+Directory dir = newDirectory();
+IndexWriterConfig iwc = new IndexWriterConfig();
+iwc.setSimilarity(randomCompatibleSimilarity());
+RandomIndexWriter w = new RandomIndexWriter(random(), dir, iwc);
+
+String queryString = "foo";
+
+Document doc0 = new Document();
+doc0.add(new TextField("f", "foo", Store.NO));
+doc0.add(new TextField("g", "foo baz", Store.NO));
+w.addDocument(doc0);
+
+IndexReader reader = w.getReader();
+IndexSearcher searcher =
+new IndexSearcher(reader) {
+  @Override
+  public CollectionStatistics collectionStatistics(String field) 
throws IOException {
+CollectionStatistics shardStatistics = 
super.collectionStatistics(field);
+int extraMaxDoc = randomIntBetween(0, 10);
+int extraDocCount = randomIntBetween(0, extraMaxDoc);
+int extraSumDocFreq = extraDocCount + randomIntBetween(0, 10);
+int extraSumTotalTermFreq = extraSumDocFreq + randomIntBetween(0, 
10);
+CollectionStatistics globalStatistics =
+new CollectionStatistics(
+field,
+shardStatistics.maxDoc() + extraMaxDoc,
+shardStatistics.docCount() + extraDocCount,
+shardStatistics.sumTotalTermFreq() + extraSumTotalTermFreq,
+shardStatistics.sumDocFreq() + extraSumDocFreq);
+return globalStatistics;
+  }
+};
+searcher.setSimilarity(new BM25Similarity());
+CombinedFieldQuery query =
+new CombinedFieldQuery.Builder()
+.addField("f")
+.addField("g")
+.addTerm(new BytesRef(queryString))
+.build();
+// just check that search does not fail
+searcher.search(query, 10);

Review Comment:
   It'd be nice to assert something stronger here, to check that 
`CombinedFieldQuery` still works as expected when collection stats are 
overridden. Maybe we could compare the output of two query strategies like we 
do in `testCopyField`.



##
lucene/sandbox/src/test/org/apache/lucene/sandbox/search/TestCombinedFieldQuery.java:
##
@@ -589,4 +589,52 @@ public SimScorer scorer(
   return new BM25Similarity().scorer(boost, collectionStats, termStats);
 }
   }
+
+  public void testDistributedCollectionStatistics() throws IOException {
+Directory dir = newDirectory();
+IndexWriterConfig iwc = new IndexWriterConfig();
+iwc.setSimilarity(randomCompatibleSimilarity());
+RandomIndexWriter w = new RandomIndexWriter(random(), dir, iwc);
+
+String queryString = "foo";
+
+Document doc0 = new Document();
+doc0.add(new TextField("f", "foo", Store.NO));
+doc0.add(new TextField("g", "foo baz", Store.NO));
+w.addDocument(doc0);
+
+IndexReader reader = w.getReader();
+IndexSearcher searcher =
+new IndexSearcher(reader) {
+  @Override
+  public CollectionStatistics collectionStatistics(String field) 
throws IOException {
+CollectionStatistics shardStatistics = 
super.collectionStatistics(field);
+int extraMaxDoc = randomIntBetween(0, 10);
+int extraDocCount = randomIntBetween(0, extraMaxDoc);
+int extraSumDocFreq = extraDocCount + randomIntBetween(0, 10);
+int extraSumTotalTermFreq = extraSumDocFreq + randomIntBetween(0, 
10);
+CollectionStatistics globalStatistics =
+new CollectionStatistics(
+field,
+shardStatistics.maxDoc() + extraMaxDoc,
+shardStatistics.docCount() + extraDocCount,
+shardStatistics.sumTotalTermFreq() + extraSumTotalTermFreq,
+shardStatistics.sumDocFreq() + extraSumDocFreq);
+return globalStatistics;
+  }
+};
+searcher.setSimilarity(new BM25Similarity());

Review Comment:
   It's unusual to search with a different similarity than was used during 
indexing -- I think we could remove this line.



##
lucene/sandbox/src/test/org/apache/lucene/sandbox/search/TestCombinedFieldQuery.java:
##
@@ -589,4 +589,52 @@ public SimScorer scorer(
   return new BM25Similarity().scorer(boost, collectionStats, termStats);
 }
   }
+
+  public void testDistributedCollectionStatistics() throws IOException {

Review Comment:
   Small comment, maybe we could call this `testOverrideCollectionStatistics`? 
Lucene doesn't really have a native concept of "distributed 

[GitHub] [lucene] msokolov commented on a diff in pull request #913: Lucene 10577

2022-05-20 Thread GitBox


msokolov commented on code in PR #913:
URL: https://github.com/apache/lucene/pull/913#discussion_r878557225


##
lucene/core/src/java/org/apache/lucene/util/VectorUtil.java:
##
@@ -213,4 +213,21 @@ public static void add(float[] u, float[] v) {
   u[i] += v[i];
 }
   }
+
+  public static float dotProduct(BytesRef a, int aOffset, BytesRef b, int 
bOffset, int len) {
+// fixme -- move to codec? What if later we want to access the bytes some 
other way?
+int total = 0;
+for (int i = 0; i < len; i++) {
+  total += a.bytes[aOffset++] * b.bytes[bOffset++];

Review Comment:
   yeah, I see that now. Well ... this *is* what we want it to do. I mean we 
could potentially reduce the precision still further (to ~7~ umm 4! bits) and 
then each pair of bytes could multiplied safely. But we do still need to add 
them up, too.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] rmuir commented on a diff in pull request #913: Lucene 10577

2022-05-20 Thread GitBox


rmuir commented on code in PR #913:
URL: https://github.com/apache/lucene/pull/913#discussion_r878562615


##
lucene/core/src/java/org/apache/lucene/util/VectorUtil.java:
##
@@ -213,4 +213,21 @@ public static void add(float[] u, float[] v) {
   u[i] += v[i];
 }
   }
+
+  public static float dotProduct(BytesRef a, int aOffset, BytesRef b, int 
bOffset, int len) {
+// fixme -- move to codec? What if later we want to access the bytes some 
other way?
+int total = 0;
+for (int i = 0; i < len; i++) {
+  total += a.bytes[aOffset++] * b.bytes[bOffset++];

Review Comment:
   I think our result vector could be ShortVector and to try to use conversion 
(in some way that isnt slow) such as 
https://docs.oracle.com/en/java/javase/16/docs/api/jdk.incubator.vector/jdk/incubator/vector/VectorOperators.html#B2S
 in a way that is efficient? 
   
   for the adding-up part, with the floats we did reduceLanes(), so I think 
we'd have to use same trick with conversion of result to IntVector/LongVector 
to create the sum.
   
   There are gigantic paragraphs on Vector.java javadoc on how to do these 
conversions (maybe they are actually efficient?), but I haven't tried.
   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene-solr] madrob opened a new pull request, #2660: SOLR-16209

2022-05-20 Thread GitBox


madrob opened a new pull request, #2660:
URL: https://github.com/apache/lucene-solr/pull/2660

   Depends on #2641 will commit both at the same time


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] msokolov commented on pull request #913: Lucene 10577

2022-05-20 Thread GitBox


msokolov commented on PR #913:
URL: https://github.com/apache/lucene/pull/913#issuecomment-1133402474

   hm this "luke-can-be-launched" test failed for me locally too. I guess it 
needs a JVM with a head?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] msokolov commented on a diff in pull request #913: Lucene 10577

2022-05-20 Thread GitBox


msokolov commented on code in PR #913:
URL: https://github.com/apache/lucene/pull/913#discussion_r878557225


##
lucene/core/src/java/org/apache/lucene/util/VectorUtil.java:
##
@@ -213,4 +213,21 @@ public static void add(float[] u, float[] v) {
   u[i] += v[i];
 }
   }
+
+  public static float dotProduct(BytesRef a, int aOffset, BytesRef b, int 
bOffset, int len) {
+// fixme -- move to codec? What if later we want to access the bytes some 
other way?
+int total = 0;
+for (int i = 0; i < len; i++) {
+  total += a.bytes[aOffset++] * b.bytes[bOffset++];

Review Comment:
   yeah, I see that now. Well ... this *is* what we want it to do. I mean we 
could potentially reduce the precision still further (to 7 bits) and then each 
pair of bytes could multiplied safely. But we do still need to add them up, too.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] rmuir commented on a diff in pull request #913: Lucene 10577

2022-05-20 Thread GitBox


rmuir commented on code in PR #913:
URL: https://github.com/apache/lucene/pull/913#discussion_r878555057


##
lucene/core/src/java/org/apache/lucene/util/VectorUtil.java:
##
@@ -213,4 +213,21 @@ public static void add(float[] u, float[] v) {
   u[i] += v[i];
 }
   }
+
+  public static float dotProduct(BytesRef a, int aOffset, BytesRef b, int 
bOffset, int len) {
+// fixme -- move to codec? What if later we want to access the bytes some 
other way?
+int total = 0;
+for (int i = 0; i < len; i++) {
+  total += a.bytes[aOffset++] * b.bytes[bOffset++];

Review Comment:
   well its going to multiply each `a` byte * each `b` byte and store each 
result in `c` byte. So you will overflow right there, before you even do any 
sum.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] msokolov commented on a diff in pull request #913: Lucene 10577

2022-05-20 Thread GitBox


msokolov commented on code in PR #913:
URL: https://github.com/apache/lucene/pull/913#discussion_r878553114


##
lucene/core/src/java/org/apache/lucene/util/VectorUtil.java:
##
@@ -213,4 +213,21 @@ public static void add(float[] u, float[] v) {
   u[i] += v[i];
 }
   }
+
+  public static float dotProduct(BytesRef a, int aOffset, BytesRef b, int 
bOffset, int len) {
+// fixme -- move to codec? What if later we want to access the bytes some 
other way?
+int total = 0;
+for (int i = 0; i < len; i++) {
+  total += a.bytes[aOffset++] * b.bytes[bOffset++];

Review Comment:
   hm, well in the worst case (128 * 128) = 2^7*2^7 = 2^14 and we can sum lots 
of these in an int without any concern about overflow. I think the limit is on 
the dimension of the vector; I mean 2^31/2^14 = 2^17 and I don't think we would 
ever have a 2^17-dimensional vector. Perhaps this gives us a principled way to 
choose that max dimension :) I guess 2^17 = 128K.
   
   Looking at what `ByteVector.mul()` does ...



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] rmuir commented on a diff in pull request #913: Lucene 10577

2022-05-20 Thread GitBox


rmuir commented on code in PR #913:
URL: https://github.com/apache/lucene/pull/913#discussion_r878550812


##
lucene/core/src/java/org/apache/lucene/util/VectorUtil.java:
##
@@ -213,4 +213,21 @@ public static void add(float[] u, float[] v) {
   u[i] += v[i];
 }
   }
+
+  public static float dotProduct(BytesRef a, int aOffset, BytesRef b, int 
bOffset, int len) {
+// fixme -- move to codec? What if later we want to access the bytes some 
other way?
+int total = 0;
+for (int i = 0; i < len; i++) {
+  total += a.bytes[aOffset++] * b.bytes[bOffset++];

Review Comment:
   Is this really the way the function should work?
   
   it is doing the equivalent of casting both `a` and `b` to `int` first, and 
then adding result to another `int`. That's just how java promotion works.
   
   But is this what we want? What about overflow? why `int` and not `long`? 
   
   This is not at all what `ByteVector.mul()` does.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] msokolov commented on pull request #909: LUCENE-10370: pass proper classpath/module arguments for forking jvms from within tests

2022-05-20 Thread GitBox


msokolov commented on PR #909:
URL: https://github.com/apache/lucene/pull/909#issuecomment-1133380108

   If we make it an assumption, it makes the results look a bit nicer?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] jtibshirani commented on pull request #833: LUCENE-10411: Add NN vectors support to ExitableDirectoryReader

2022-05-20 Thread GitBox


jtibshirani commented on PR #833:
URL: https://github.com/apache/lucene/pull/833#issuecomment-1133369782

   Thanks @mocobeta, it is so nice to have benchmarks for 
ExitableDirectoryReader now. It looks like the impact is not significant for 
vector search.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] msokolov opened a new pull request, #913: Lucene 10577

2022-05-20 Thread GitBox


msokolov opened a new pull request, #913:
URL: https://github.com/apache/lucene/pull/913

   This patch revisits the idea of storing vectors in 8-bit precision. It 
incorporates some @rcmuir and @jpountz suggestions from 
https://github.com/apache/lucene/pull/899 which was the first attempt. The 
highlights here are:
   
   1. Adds a new VectorSimilarity to represent this use case
   2. Expects vectors to be supplied (as float[]) with values in the range 
[-128,127]
   3. All dot-product operations are done over byte using integer math when 
this similarity is used
   
   The good parts:
   
   Index size is reduced; I ran a test with 100K (200-dimensional) GloVe 
vectors and the index size goes from 231M to 173M for an M=200 graph and 
154M->97M for an M=100 graph. The vector portion of the storage (.vec files) is 
reduced nearly 75% as expected.
   
   With careful tuning of the scaling factor, recall is pretty much unchanged 
compared to the full-precision baseline. I think these GloVe vectors are not 
well-conditioned numerically, so to do the scaling I did not use min/max values 
as the limits, rather I discarded outlying values and used 0.1/0.99 pctile 
values.
   
   Indexing and search performance as measured with KnnGraphTester improved 
somewhere in the range of 15-25%. I'd like to run luceneutil and/or 
ann-benchmarks to get another look, but this was a very pleasant surprise. I 
think the prospect is bright for vectorization too, but this improvement is 
achievable using our scalar impl.
   
   The not-so-good-parts:
   
   I think we probably wouldn't want to commit as-is since it would make 
changes to the codec that aren't forwards-compatible. To go forward I guess 
we'd need to create a new version of the HnswVectorsFormat.
   
   I had to add some branches and generics to handle the two different data 
representations. I don't think it's too terribly ugly, but I know we've had 
concerns about this kind of thing in the past - maybe there's a better way that 
doesn't require massive code duplication?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] msokolov commented on a diff in pull request #899: Lucene 10577

2022-05-20 Thread GitBox


msokolov commented on code in PR #899:
URL: https://github.com/apache/lucene/pull/899#discussion_r878531565


##
lucene/core/src/java/org/apache/lucene/codecs/lucene92/ExpandingRandomAccessVectorValues.java:
##
@@ -0,0 +1,57 @@
+package org.apache.lucene.codecs.lucene92;
+
+import org.apache.lucene.index.RandomAccessVectorValues;
+import org.apache.lucene.index.RandomAccessVectorValuesProducer;
+import org.apache.lucene.util.BytesRef;
+
+import java.io.IOException;
+
+public class ExpandingRandomAccessVectorValues implements 
RandomAccessVectorValuesProducer {
+
+  private final RandomAccessVectorValuesProducer delegate;
+  private final float scale;
+
+  /**
+   * Wraps an existing vector values producer. Floating point vector values 
will be produced by scaling
+   * byte-quantized values read from the values produced by the input.
+   */
+  protected ExpandingRandomAccessVectorValues(RandomAccessVectorValuesProducer 
in, float scale) {
+this.delegate = in;
+assert scale != 0;
+this.scale = scale;
+  }
+
+  @Override
+  public RandomAccessVectorValues randomAccess() throws IOException {
+RandomAccessVectorValues delegateValues = delegate.randomAccess();
+float[] value  = new float[delegateValues.dimension()];;
+
+return new RandomAccessVectorValues() {
+
+  @Override
+  public int size() {
+return delegateValues.size();
+  }
+
+  @Override
+  public int dimension() {
+return delegateValues.dimension();
+  }
+
+  @Override
+  public float[] vectorValue(int targetOrd) throws IOException {
+BytesRef binaryValue = delegateValues.binaryValue(targetOrd);
+byte[] bytes = binaryValue.bytes;
+for (int i = 0, j = binaryValue.offset; i < value.length; i++, j++) {
+  value[i] = bytes[j] * scale;

Review Comment:
   I did some git booboo and messed this up -- I'll close and open a new PR 
since it's a substantially different approach anyway



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] msokolov closed pull request #899: Lucene 10577

2022-05-20 Thread GitBox


msokolov closed pull request #899: Lucene 10577
URL: https://github.com/apache/lucene/pull/899


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-10574) Remove O(n^2) from TieredMergePolicy or change defaults to one that doesn't do this

2022-05-20 Thread Robert Muir (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-10574?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17540283#comment-17540283
 ] 

Robert Muir commented on LUCENE-10574:
--

The benchmark was originally some code written to reproduce a performance 
issue: LUCENE-9827

So it is definitely a bit pathological. Purpose was to really zero in on the 
stored fields merging and nothing else, and not to be noisy.

> Remove O(n^2) from TieredMergePolicy or change defaults to one that doesn't 
> do this
> ---
>
> Key: LUCENE-10574
> URL: https://issues.apache.org/jira/browse/LUCENE-10574
> Project: Lucene - Core
>  Issue Type: Bug
>Reporter: Robert Muir
>Priority: Major
> Fix For: 9.3
>
>  Time Spent: 2h 10m
>  Remaining Estimate: 0h
>
> Remove {{floorSegmentBytes}} parameter, or change lucene's default to a merge 
> policy that doesn't merge in an O(n^2) way.
> I have the feeling it might have to be the latter, as folks seem really wed 
> to this crazy O(n^2) behavior.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-9625) Benchmark KNN search with ann-benchmarks

2022-05-20 Thread Michael Sokolov (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-9625?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17540284#comment-17540284
 ] 

Michael Sokolov commented on LUCENE-9625:
-

I realized maybe this deserves a better explanation: I didn't use 
multi-threading in the KnnGraphTester that builds the index since my goal at 
the time was really to evaluate whether our algorithm implementation is correct 
and how it performs on a single HNSW graph index. If we use multiple threads, 
this is going to lead to a more fragmented graph due to the way Lucene indexes 
segments, and while this would be a useful point of comparison, it also creates 
a different variable to tune in the benchmark evaluation. If you do want to 
pursue this, I would suggest configuring the IndexWriterConfig with large 
buffers so that each thread creates a single segment, and exposing the number 
of threads/segments as a tunable parameter since it is going to impact the 
recall and latency reported by the benchmark

> Benchmark KNN search with ann-benchmarks
> 
>
> Key: LUCENE-9625
> URL: https://issues.apache.org/jira/browse/LUCENE-9625
> Project: Lucene - Core
>  Issue Type: New Feature
>Reporter: Michael Sokolov
>Priority: Major
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> In addition to benchmarking with luceneutil, it would be good to be able to 
> make use of ann-benchmarks, which is publishing results from many approximate 
> knn algorithms, including the hnsw implementation from its authors. We don't 
> expect to challenge the performance of these native code libraries, however 
> it would be good to know just how far off we are.
> I started looking into this and posted a fork of ann-benchmarks that uses 
> KnnGraphTester  class to run these: 
> https://github.com/msokolov/ann-benchmarks. It's still a WIP; you have to 
> manually copy jars and the KnnGraphTester.class to the test host machine 
> rather than downloading from a distribution. KnnGraphTester needs some 
> modifications in order to support this process - this issue is mostly about 
> that.
> One thing I noticed is that some of the index builds with higher fanout 
> (efConstruction) settings time out at 2h (on an AWS c5 instance), so this is 
> concerning and I'll open a separate issue for trying to improve that.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Assigned] (LUCENE-10585) Cleanup copy/paste code in facets, particularly in SSDV

2022-05-20 Thread Greg Miller (Jira)


 [ 
https://issues.apache.org/jira/browse/LUCENE-10585?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Greg Miller reassigned LUCENE-10585:


Assignee: Greg Miller

> Cleanup copy/paste code in facets, particularly in SSDV
> ---
>
> Key: LUCENE-10585
> URL: https://issues.apache.org/jira/browse/LUCENE-10585
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: modules/facet
>Reporter: Greg Miller
>Assignee: Greg Miller
>Priority: Minor
>
> We've accumulated some copy/paste code in the facets modules, especially in 
> SSDV-related classes. I'm going to take a pass at cleaning this up to help 
> make the code more readable and easier to maintain.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Created] (LUCENE-10585) Cleanup copy/paste code in facets, particularly in SSDV

2022-05-20 Thread Greg Miller (Jira)
Greg Miller created LUCENE-10585:


 Summary: Cleanup copy/paste code in facets, particularly in SSDV
 Key: LUCENE-10585
 URL: https://issues.apache.org/jira/browse/LUCENE-10585
 Project: Lucene - Core
  Issue Type: Improvement
  Components: modules/facet
Reporter: Greg Miller


We've accumulated some copy/paste code in the facets modules, especially in 
SSDV-related classes. I'm going to take a pass at cleaning this up to help make 
the code more readable and easier to maintain.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Assigned] (LUCENE-10584) SSDV facets should support hierarchical paths in #getSpecificValue

2022-05-20 Thread Greg Miller (Jira)


 [ 
https://issues.apache.org/jira/browse/LUCENE-10584?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Greg Miller reassigned LUCENE-10584:


Assignee: Greg Miller

> SSDV facets should support hierarchical paths in #getSpecificValue
> --
>
> Key: LUCENE-10584
> URL: https://issues.apache.org/jira/browse/LUCENE-10584
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: modules/facet
>Reporter: Greg Miller
>Assignee: Greg Miller
>Priority: Major
>
> We added hierarchical pathing capabilities to SSDV faceting recently but it 
> looks like we didn't update #getSpecificValue to work with hierarchical paths.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-10574) Remove O(n^2) from TieredMergePolicy or change defaults to one that doesn't do this

2022-05-20 Thread Michael Sokolov (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-10574?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17540280#comment-17540280
 ] 

Michael Sokolov commented on LUCENE-10574:
--

Very nice! I confess I don't really know how those benchmarks are constructed 
-- do they have "typical" merge policy/update rate/etc or are they deliberately 
exploring this pathological case?

> Remove O(n^2) from TieredMergePolicy or change defaults to one that doesn't 
> do this
> ---
>
> Key: LUCENE-10574
> URL: https://issues.apache.org/jira/browse/LUCENE-10574
> Project: Lucene - Core
>  Issue Type: Bug
>Reporter: Robert Muir
>Priority: Major
> Fix For: 9.3
>
>  Time Spent: 2h 10m
>  Remaining Estimate: 0h
>
> Remove {{floorSegmentBytes}} parameter, or change lucene's default to a merge 
> policy that doesn't merge in an O(n^2) way.
> I have the feeling it might have to be the latter, as folks seem really wed 
> to this crazy O(n^2) behavior.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Created] (LUCENE-10584) SSDV facets should support hierarchical paths in #getSpecificValue

2022-05-20 Thread Greg Miller (Jira)
Greg Miller created LUCENE-10584:


 Summary: SSDV facets should support hierarchical paths in 
#getSpecificValue
 Key: LUCENE-10584
 URL: https://issues.apache.org/jira/browse/LUCENE-10584
 Project: Lucene - Core
  Issue Type: Improvement
  Components: modules/facet
Reporter: Greg Miller


We added hierarchical pathing capabilities to SSDV faceting recently but it 
looks like we didn't update #getSpecificValue to work with hierarchical paths.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] mikemccand commented on pull request #633: LUCENE-10216: Use MergeScheduler and MergePolicy to run addIndexes(CodecReader[]) merges.

2022-05-20 Thread GitBox


mikemccand commented on PR #633:
URL: https://github.com/apache/lucene/pull/633#issuecomment-1133249474

   Hmm I beasted the PR on current `main` branch using `./gradlew 
:lucene:core:test --tests "org.apache.lucene.index.TestAddIndexes" 
-Dtests.iters=500 `.
   
   The first time, it seemed to deadlock, but I screwed up trying to get the 
thread stacks.
   
   The 2nd time it failed with this exciting failure!  It looks like we are 
missing some locking on updating IW's `HashMap` tracking merges?:
   
   ```
   org.apache.lucene.index.TestAddIndexes > test suite's output saved to 
/l/vigya/lucene/core/build/test-results/test/outputs/OUTPUT-org.apache.lucene.index.TestAddIndexes.txt,
 copied below:
 1> java.util.ConcurrentModificationException
 1>at 
java.base/java.util.HashMap$HashIterator.nextNode(HashMap.java:1597)
 1>at java.base/java.util.HashMap$KeyIterator.next(HashMap.java:1620)
 1>at 
org.apache.lucene.index.IndexWriter.forceMerge(IndexWriter.java:2122)
 1>at 
org.apache.lucene.index.IndexWriter.forceMerge(IndexWriter.java:2081)
 1>at 
org.apache.lucene.index.TestAddIndexes$CommitAndAddIndexes3.doBody(TestAddIndexes.java:1221)
 1>at 
org.apache.lucene.index.TestAddIndexes$RunAddIndexesThreads$1.run(TestAddIndexes.java:1023)
  > java.lang.AssertionError
  > at 
__randomizedtesting.SeedInfo.seed([3AB557F4BFC8763A:D625EE2D285D2CAD]:0)
  > at org.junit.Assert.fail(Assert.java:87)
  > at org.junit.Assert.assertTrue(Assert.java:42)
  > at org.junit.Assert.assertTrue(Assert.java:53)
  > at 
org.apache.lucene.index.TestAddIndexes.testAddIndexesWithRollback(TestAddIndexes.java:1305)
  > at 
jdk.internal.reflect.GeneratedMethodAccessor38.invoke(Unknown Source)
  > at 
java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
  > at java.base/java.lang.reflect.Method.invoke(Method.java:568)
  > at 
com.carrotsearch.randomizedtesting.RandomizedRunner.invoke(RandomizedRunner.java:1754)
  > at 
com.carrotsearch.randomizedtesting.RandomizedRunner$8.evaluate(RandomizedRunner.java:942)
  > at 
com.carrotsearch.randomizedtesting.RandomizedRunner$9.evaluate(RandomizedRunner.java:978)
  > at 
com.carrotsearch.randomizedtesting.RandomizedRunner$10.evaluate(RandomizedRunner.java:992)
  > at 
org.apache.lucene.tests.util.TestRuleSetupTeardownChained$1.evaluate(TestRuleSetupTeardownChained.java:44)
  > at 
org.apache.lucene.tests.util.AbstractBeforeAfterRule$1.evaluate(AbstractBeforeAfterRule.java:43)
  > at 
org.apache.lucene.tests.util.TestRuleThreadAndTestName$1.evaluate(TestRuleThreadAndTestName.java:45)
  > at 
org.apache.lucene.tests.util.TestRuleIgnoreAfterMaxFailures$1.evaluate(TestRuleIgnoreAfterMaxFailures.java:60)
  > at 
org.apache.lucene.tests.util.TestRuleMarkFailure$1.evaluate(TestRuleMarkFailure.java:44)
  > at org.junit.rules.RunRules.evaluate(RunRules.java:20)
  > at 
com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
  > at 
com.carrotsearch.randomizedtesting.ThreadLeakControl$StatementRunner.run(ThreadLeakControl.java:370)
  > at 
com.carrotsearch.randomizedtesting.ThreadLeakControl.forkTimeoutingTask(ThreadLeakControl.java:819)
  > at 
com.carrotsearch.randomizedtesting.ThreadLeakControl$3.evaluate(ThreadLeakControl.java:470)
  > at 
com.carrotsearch.randomizedtesting.RandomizedRunner.runSingleTest(RandomizedRunner.java:951)
  > at 
com.carrotsearch.randomizedtesting.RandomizedRunner$5.evaluate(RandomizedRunner.java:836)
  > at 
com.carrotsearch.randomizedtesting.RandomizedRunner$6.evaluate(RandomizedRunner.java:887)
  > at 
com.carrotsearch.randomizedtesting.RandomizedRunner$7.evaluate(RandomizedRunner.java:898)
  > at 
org.apache.lucene.tests.util.AbstractBeforeAfterRule$1.evaluate(AbstractBeforeAfterRule.java:43)
  > at 
com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
  > at 
org.apache.lucene.tests.util.TestRuleStoreClassName$1.evaluate(TestRuleStoreClassName.java:38)
  > at 
com.carrotsearch.randomizedtesting.rules.NoShadowingOrOverridesOnMethodsRule$1.evaluate(NoShadowingOrOverridesOnMethodsRule.java:40)
  > at 
com.carrotsearch.randomizedtesting.rules.NoShadowingOrOverridesOnMethodsRule$1.evaluate(NoShadowingOrOverridesOnMethodsRule.java:40)
  > at 
com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
  > at 
com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
  > at 

[jira] [Created] (LUCENE-10583) Deadlock with MMapDirectory while waitForMerges

2022-05-20 Thread Thomas Hoffmann (Jira)
Thomas Hoffmann created LUCENE-10583:


 Summary: Deadlock with MMapDirectory while waitForMerges
 Key: LUCENE-10583
 URL: https://issues.apache.org/jira/browse/LUCENE-10583
 Project: Lucene - Core
  Issue Type: Bug
  Components: core/index
Affects Versions: 8.11.1
 Environment: Java 17

OS: Windows 2016
Reporter: Thomas Hoffmann


Hello,

a deadlock situation happened in our application. We are using MMapDirectory on 
Windows 2016 and got the following stacktrace:
{code:java}
"https-openssl-nio-443-exec-30" #166 daemon prio=5 os_prio=0 cpu=78703.13ms 
"https-openssl-nio-443-exec-30" #166 daemon prio=5 os_prio=0 cpu=78703.13ms 
elapsed=81248.18s tid=0x2860af10 nid=0x237c in Object.wait()  
[0x413fc000]
   java.lang.Thread.State: TIMED_WAITING (on object monitor)
    at java.lang.Object.wait(java.base@17.0.2/Native Method)
    - waiting on 
    at org.apache.lucene.index.IndexWriter.doWait(IndexWriter.java:4983)
    - locked <0x0006ef1fc020> (a org.apache.lucene.index.IndexWriter)
    at org.apache.lucene.index.IndexWriter.waitForMerges(IndexWriter.java:2697)
    - locked <0x0006ef1fc020> (a org.apache.lucene.index.IndexWriter)
    at org.apache.lucene.index.IndexWriter.shutdown(IndexWriter.java:1236)
    at org.apache.lucene.index.IndexWriter.close(IndexWriter.java:1278)
    at 
com.speed4trade.ebs.module.search.SearchService.updateSearchIndex(SearchService.java:1723)
    - locked <0x0006d5c00208> (a org.apache.lucene.store.MMapDirectory)
    at 
com.speed4trade.ebs.module.businessrelations.ticket.TicketChangedListener.postUpdate(TicketChangedListener.java:142)
...{code}
All threads were waiting to lock <0x0006d5c00208> which got never released.

A lucene thread was also blocked, I dont know if this is relevant:
{code:java}
"Lucene Merge Thread #0" #18466 daemon prio=5 os_prio=0 cpu=15.63ms 
elapsed=3499.07s tid=0x459453e0 nid=0x1f8 waiting for monitor entry  
[0x5da9e000]
   java.lang.Thread.State: BLOCKED (on object monitor)
    at 
org.apache.lucene.store.FSDirectory.deletePendingFiles(FSDirectory.java:346)
    - waiting to lock <0x0006d5c00208> (a 
org.apache.lucene.store.MMapDirectory)
    at 
org.apache.lucene.store.FSDirectory.maybeDeletePendingFiles(FSDirectory.java:363)
    at org.apache.lucene.store.FSDirectory.createOutput(FSDirectory.java:248)
    at 
org.apache.lucene.store.LockValidatingDirectoryWrapper.createOutput(LockValidatingDirectoryWrapper.java:44)
    at 
org.apache.lucene.index.ConcurrentMergeScheduler$1.createOutput(ConcurrentMergeScheduler.java:289)
    at 
org.apache.lucene.store.TrackingDirectoryWrapper.createOutput(TrackingDirectoryWrapper.java:43)
    at 
org.apache.lucene.codecs.compressing.CompressingStoredFieldsWriter.(CompressingStoredFieldsWriter.java:121)
    at 
org.apache.lucene.codecs.compressing.CompressingStoredFieldsFormat.fieldsWriter(CompressingStoredFieldsFormat.java:130)
    at 
org.apache.lucene.codecs.lucene87.Lucene87StoredFieldsFormat.fieldsWriter(Lucene87StoredFieldsFormat.java:141)
    at org.apache.lucene.index.SegmentMerger.mergeFields(SegmentMerger.java:227)
    at org.apache.lucene.index.SegmentMerger.merge(SegmentMerger.java:105)
    at org.apache.lucene.index.IndexWriter.mergeMiddle(IndexWriter.java:4757)
    at org.apache.lucene.index.IndexWriter.merge(IndexWriter.java:4361)
    at 
org.apache.lucene.index.IndexWriter$IndexWriterMergeSource.merge(IndexWriter.java:5920)
    at 
org.apache.lucene.index.ConcurrentMergeScheduler.doMerge(ConcurrentMergeScheduler.java:626)
    at 
org.apache.lucene.index.ConcurrentMergeScheduler$MergeThread.run(ConcurrentMergeScheduler.java:684){code}
If looks like the merge operation never finished and released the lock.

Is there any option to prevent this deadlock or how to investigate it further?
A load-test didn't show this problem unfortunately.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] uschindler opened a new pull request, #912: Initial rewrite of MMapDirectory for JDK-19 preview Panama APIs (>= JDK-19-ea+23)

2022-05-20 Thread GitBox


uschindler opened a new pull request, #912:
URL: https://github.com/apache/lucene/pull/912

   **INFO: This is a followup of #518: It's the same code base, but with API 
changes from JDK 19 applied**
   
   This is just a draft PR for a first insight on memory mapping improvements 
in JDK 19+.
   
   Some background information: Starting with JDK-14, there is a new incubating 
module "jdk.incubator.foreign" that has a new, not yet stable API for accessing 
off-heap memory (and later it will also support calling functions using 
classical MethodHandles that are located in libraries like .so or .dll files). 
This incubator module has several versions:
   - first version: https://openjdk.java.net/jeps/370 (slow, very buggy and 
thread confinement, so making it unuseable with Lucene)
   - second version: https://openjdk.java.net/jeps/383 (still thread 
confinement, but now allows transfer of "ownership" to other threads; this is 
still impossible to use with Lucene.
   - third version in JDK 16: https://openjdk.java.net/jeps/393 (this version 
has included "Support for shared segments"). This now allows us to safely use 
the same external mmaped memory from different threads and also unmap it! This 
was implemented in the previous pull request #173
   - fourth version in JDK 17: https://openjdk.java.net/jeps/412 . This mainly 
changes the API around the scopes. Instead of having segments explicitely made 
"shared", we can assign them to some resource scope which control their 
behaviour. The resourceScope is produced one time for each IndexInput instance 
(not clones) and owns all segments. When the resourceScope is closed, all 
segments get invalid - and we throw `AlreadyClosedException`. The big problem 
is slowness due to heavy use of new instances just to copy memory between 
segments and java heap. This drives garbage collector crazy. This was 
implemented in previous PR #177 
   - fifth version in JDK 18, included in build 26: 
https://openjdk.java.net/jeps/419. This mainly cleans up the API. From Lucene's 
persepctive the `MemorySegment` API now has `System.arraycopy()`-like APIs to 
copy memory between heap and memory segments. This improves speed. It also 
handles byte-swapping automatically. This version of the PR also uses 
`ValueLayout` instead of varhandles, as it makes code more readable and 
type-safe.
   - sixth version in JDK 19, included with build 23: 
https://openjdk.java.net/jeps/424 (actual version). This version moves the 
whole incubation API as a stable "preview API". Some classes were renamed and 
bugs (e.g. on Windows with huge mappings) were resolved.
   
   This new preview API more or less overcomes several problems:
   - ByteBuffer API is limited to 32bit (in fact MMapDirectory has to chunk in 
1 GiB portions)
   - There is no official way to unmap ByteBuffers when the file is no longer 
used. There is a way to use `sun.misc.Unsafe` and forcefully unmap segments, 
but any IndexInput accessing the file from another thread will crush the JVM 
with SIGSEGV or SIGBUS. We learned to live with that and we happily apply the 
unsafe unmapping, but that's the main issue.
   
   @uschindler had many discussions with the team at OpenJDK and finally with 
the third incubator, we have an API that works with Lucene. It was very 
fruitful discussions (thanks to @mcimadamore !)
   
   With the third incubator we are now finally able to do some tests 
(especially performance). With JDK 19, we can do testing the following way by 
tweaking the build system a bit:
   - disable `-Werror` everywhere
   - upgrade minimum and release Java version to 19 (this breaks linting with 
ECJ, so precommit won't work, but running tests does)
   - add `--enable-preview` parameter for all modules and test runner
   
   The code basically just modifies `MMapDirectory` to use LONG instead of INT 
for the chunk size parameter. In addition it adds `MemorySegmentIndexInput` 
that is a copy of our `ByteBufferIndexInput` (still there, but unused), but 
using MemorySegment instead of ByteBuffer behind the scenes. It works in 
exactly the same way, just the try/catch blocks for supporting EOFException or 
moving to another segment were rewritten.
   
   It passes all tests and it looks like you can use it to read indexes. The 
default chunk size is now 16 GiB (but you can raise or lower it as you like; 
tests are doing this). Of course you can set it to Long.MAX_VALUE, in that case 
every index file is always mapped to one big memory mapping. My testing with 
Windows 10 have shown, that this is *not a good idea!!!*. Huge mappings 
fragment address space over time and as we can only use like 43 or 46 bits 
(depending on OS), the fragmentation will at some point kill you. So 16 GiB 
looks like a good compromise: Most files will be smaller than 6 GiB anyways 
(unless you optimize your index to one huge segment). So for most Lucene 
installations, the number of segments will equal the number of open files, so 
Elasticsearch 

[GitHub] [lucene] uschindler closed pull request #911: Initial rewrite of MMapDirectory for JDK-19 preview Panama APIs (>= JDK-19-ea+23)

2022-05-20 Thread GitBox


uschindler closed pull request #911: Initial rewrite of MMapDirectory for 
JDK-19 preview Panama APIs (>= JDK-19-ea+23)
URL: https://github.com/apache/lucene/pull/911


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] mikemccand commented on pull request #633: LUCENE-10216: Use MergeScheduler and MergePolicy to run addIndexes(CodecReader[]) merges.

2022-05-20 Thread GitBox


mikemccand commented on PR #633:
URL: https://github.com/apache/lucene/pull/633#issuecomment-1133136314

   I think this is ready!
   
   What a great change, giving `MergePolicy` purview over how `addIndexes` does 
its merging too.  I'm also beasting `TestAddIndexes` and will report back.
   
   I agree we should push only to `main` to start and let this bake for a while 
before backporting to 9.x.
   
   Thank you @vigyasharma!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] rmuir commented on pull request #909: LUCENE-10370: pass proper classpath/module arguments for forking jvms from within tests

2022-05-20 Thread GitBox


rmuir commented on PR #909:
URL: https://github.com/apache/lucene/pull/909#issuecomment-1133125938

   i'm suspicious if some of these tests would even run directly from the IDE 
anyway. even if you fallback to classpath, TestIndexWriterOnJRECrash might not 
work from IDEA. I installed idea the other day, it defaults to running tests 
from gradle, so IMO its a non-issue for that IDE. for eclipse it is probably 
best to simply fail.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] uschindler opened a new pull request, #911: Initial rewrite of MMapDirectory for JDK-19 preview Panama APIs (>= JDK-19-ea+23)

2022-05-20 Thread GitBox


uschindler opened a new pull request, #911:
URL: https://github.com/apache/lucene/pull/911

   **INFO: This is a followup of #518: It's the same code base, but with API 
changes from JDK 19 applied**
   
   This is just a draft PR for a first insight on memory mapping improvements 
in JDK 19+.
   
   Some background information: Starting with JDK-14, there is a new incubating 
module "jdk.incubator.foreign" that has a new, not yet stable API for accessing 
off-heap memory (and later it will also support calling functions using 
classical MethodHandles that are located in libraries like .so or .dll files). 
This incubator module has several versions:
   - first version: https://openjdk.java.net/jeps/370 (slow, very buggy and 
thread confinement, so making it unuseable with Lucene)
   - second version: https://openjdk.java.net/jeps/383 (still thread 
confinement, but now allows transfer of "ownership" to other threads; this is 
still impossible to use with Lucene.
   - third version in JDK 16: https://openjdk.java.net/jeps/393 (this version 
has included "Support for shared segments"). This now allows us to safely use 
the same external mmaped memory from different threads and also unmap it! This 
was implemented in the previous pull request #173
   - fourth version in JDK 17: https://openjdk.java.net/jeps/412 . This mainly 
changes the API around the scopes. Instead of having segments explicitely made 
"shared", we can assign them to some resource scope which control their 
behaviour. The resourceScope is produced one time for each IndexInput instance 
(not clones) and owns all segments. When the resourceScope is closed, all 
segments get invalid - and we throw `AlreadyClosedException`. The big problem 
is slowness due to heavy use of new instances just to copy memory between 
segments and java heap. This drives garbage collector crazy. This was 
implemented in previous PR #177 
   - fifth version in JDK 18, included in build 26: 
https://openjdk.java.net/jeps/419. This mainly cleans up the API. From Lucene's 
persepctive the `MemorySegment` API now has `System.arraycopy()`-like APIs to 
copy memory between heap and memory segments. This improves speed. It also 
handles byte-swapping automatically. This version of the PR also uses 
`ValueLayout` instead of varhandles, as it makes code more readable and 
type-safe.
   - sixth version in JDK 19, included with build 23: 
https://openjdk.java.net/jeps/424 (actual version). This version moves the 
whole incubation API as a stable "preview API". Some classes were renamed and 
bugs (e.g. on Windows with huge mappings) were resolved.
   
   This new preview API more or less overcomes several problems:
   - ByteBuffer API is limited to 32bit (in fact MMapDirectory has to chunk in 
1 GiB portions)
   - There is no official way to unmap ByteBuffers when the file is no longer 
used. There is a way to use `sun.misc.Unsafe` and forcefully unmap segments, 
but any IndexInput accessing the file from another thread will crush the JVM 
with SIGSEGV or SIGBUS. We learned to live with that and we happily apply the 
unsafe unmapping, but that's the main issue.
   
   @uschindler had many discussions with the team at OpenJDK and finally with 
the third incubator, we have an API that works with Lucene. It was very 
fruitful discussions (thanks to @mcimadamore !)
   
   With the third incubator we are now finally able to do some tests 
(especially performance). With JDK 19, we can do testing the following way by 
tweaking the build system a bit:
   - disable `-Werror` everywhere
   - upgrade minimum and release Java version to 19 (this breaks linting with 
ECJ, so precommit won't work, but running tests does)
   - add `--enable-preview` parameter for all modules and test runner
   
   The code basically just modifies `MMapDirectory` to use LONG instead of INT 
for the chunk size parameter. In addition it adds `MemorySegmentIndexInput` 
that is a copy of our `ByteBufferIndexInput` (still there, but unused), but 
using MemorySegment instead of ByteBuffer behind the scenes. It works in 
exactly the same way, just the try/catch blocks for supporting EOFException or 
moving to another segment were rewritten.
   
   It passes all tests and it looks like you can use it to read indexes. The 
default chunk size is now 16 GiB (but you can raise or lower it as you like; 
tests are doing this). Of course you can set it to Long.MAX_VALUE, in that case 
every index file is always mapped to one big memory mapping. My testing with 
Windows 10 have shown, that this is *not a good idea!!!*. Huge mappings 
fragment address space over time and as we can only use like 43 or 46 bits 
(depending on OS), the fragmentation will at some point kill you. So 16 GiB 
looks like a good compromise: Most files will be smaller than 6 GiB anyways 
(unless you optimize your index to one huge segment). So for most Lucene 
installations, the number of segments will equal the number of open files, so 
Elasticsearch 

[GitHub] [lucene] dweiss commented on pull request #909: LUCENE-10370: pass proper classpath/module arguments for forking jvms from within tests

2022-05-20 Thread GitBox


dweiss commented on PR #909:
URL: https://github.com/apache/lucene/pull/909#issuecomment-1133086554

   @uschindler Are you ok with the patch as is? Should I rework it into an 
assumption error? Should I fall back to classpath if nothing is available (I'm 
not sold on this one)?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] mocobeta commented on pull request #833: LUCENE-10411: Add NN vectors support to ExitableDirectoryReader

2022-05-20 Thread GitBox


mocobeta commented on PR #833:
URL: https://github.com/apache/lucene/pull/833#issuecomment-1133038858

   For your information, luceneutil now supports ExitableDirectoryReader 
(https://github.com/mikemccand/luceneutil/pull/172) and I ran the benchmark 
with vector values.
   Data: 500k English Wikipedia docs ([couldn't index 1 million docs for some 
reason](https://github.com/mikemccand/luceneutil/pull/178)), 200 dimensions.
   
   This is a sampled result and I saw similar results in multiple runs:
   ```
  TaskQPS baseline  StdDevQPS 
exitable_directory_reader  StdDevPct diff p-value
   AndHighMedVector  430.56  (7.3%)  404.12 
(11.1%)   -6.1% ( -22% -   13%) 0.057
   PKLookup  203.88 (33.2%)  194.26 
(29.5%)   -4.7% ( -50% -   86%) 0.662
  LowTermVector  428.08  (9.7%)  410.96  
(9.7%)   -4.0% ( -21% -   17%) 0.229
  MedTermVector  426.44 (14.4%)  413.25 
(12.6%)   -3.1% ( -26% -   27%) 0.504
  AndHighHighVector  461.11 (15.4%)  457.82 
(11.8%)   -0.7% ( -24% -   31%) 0.880
 HighTermVector  444.54 (17.1%)  443.67 
(13.9%)   -0.2% ( -26% -   37%) 0.971
   AndHighLowVector  431.03 (19.9%)  434.53 
(11.7%)0.8% ( -25% -   40%) 0.885
   ```
   
   (Note that I ran the benchmark on a hard disk drive.)


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] ywelsch opened a new pull request, #910: LUCENE-10582: Fix merging of CollectionStatistics in CombinedFieldQuery

2022-05-20 Thread GitBox


ywelsch opened a new pull request, #910:
URL: https://github.com/apache/lucene/pull/910

   CombinedFieldQuery does not properly combine distributed collection 
statistics, resulting in an IllegalArgumentException during searches.
   
   Originally surfaced in this Elasticsearch issue: 
https://github.com/elastic/elasticsearch/issues/82817


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Created] (LUCENE-10582) CombinedFieldQuery fails with distributed field statistics

2022-05-20 Thread Yannick Welsch (Jira)
Yannick Welsch created LUCENE-10582:
---

 Summary: CombinedFieldQuery fails with distributed field statistics
 Key: LUCENE-10582
 URL: https://issues.apache.org/jira/browse/LUCENE-10582
 Project: Lucene - Core
  Issue Type: Bug
  Components: modules/sandbox
Reporter: Yannick Welsch


CombinedFieldQuery does not properly combine distributed collection statistics, 
resulting in an IllegalArgumentException during searches.

Originally surfaced in this Elasticsearch issue: 
https://github.com/elastic/elasticsearch/issues/82817



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Resolved] (LUCENE-10579) fix smoketester backwards-check to not parse stdout

2022-05-20 Thread Robert Muir (Jira)


 [ 
https://issues.apache.org/jira/browse/LUCENE-10579?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Robert Muir resolved LUCENE-10579.
--
Resolution: Fixed

> fix smoketester backwards-check to not parse stdout
> ---
>
> Key: LUCENE-10579
> URL: https://issues.apache.org/jira/browse/LUCENE-10579
> Project: Lucene - Core
>  Issue Type: Bug
>Reporter: Robert Muir
>Priority: Major
> Fix For: 9.3
>
> Attachments: backwards.log.gz
>
>  Time Spent: 0.5h
>  Remaining Estimate: 0h
>
> The smoketester parses the output of TestBackwardsCompatibility -verbose 
> looking for certain prints for each index release. 
> But I think this is a noisier channel than you might expect. I added a hack 
> to log the stuff its trying to parse... it is legit crazy. See attachment
> Let's rethink, maybe we should just examine the zip files?



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-10579) fix smoketester backwards-check to not parse stdout

2022-05-20 Thread ASF subversion and git services (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-10579?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17540069#comment-17540069
 ] 

ASF subversion and git services commented on LUCENE-10579:
--

Commit 2090ac431843f689ca6f124723ef37c3b12a2826 in lucene's branch 
refs/heads/main from Robert Muir
[ https://gitbox.apache.org/repos/asf?p=lucene.git;h=2090ac43184 ]

LUCENE-10579: fix smoketester backwards-check to not parse stdout (#903)

This is very noisy, can contain gradle status updates, various other 
tests.verbose prints from other threads, you name it.

It causes the check to be flaky, and randomly "miss" seeing a test that 
executed.

Instead, let's look at the zip files. We can still preserve the essence of what 
the test wants to do, but without any flakiness.

> fix smoketester backwards-check to not parse stdout
> ---
>
> Key: LUCENE-10579
> URL: https://issues.apache.org/jira/browse/LUCENE-10579
> Project: Lucene - Core
>  Issue Type: Bug
>Reporter: Robert Muir
>Priority: Major
> Fix For: 9.3
>
> Attachments: backwards.log.gz
>
>  Time Spent: 0.5h
>  Remaining Estimate: 0h
>
> The smoketester parses the output of TestBackwardsCompatibility -verbose 
> looking for certain prints for each index release. 
> But I think this is a noisier channel than you might expect. I added a hack 
> to log the stuff its trying to parse... it is legit crazy. See attachment
> Let's rethink, maybe we should just examine the zip files?



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-10579) fix smoketester backwards-check to not parse stdout

2022-05-20 Thread ASF subversion and git services (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-10579?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17540070#comment-17540070
 ] 

ASF subversion and git services commented on LUCENE-10579:
--

Commit 5a5c8bca4e691d91801d04d881aa25560413835c in lucene's branch 
refs/heads/branch_9x from Robert Muir
[ https://gitbox.apache.org/repos/asf?p=lucene.git;h=5a5c8bca4e6 ]

LUCENE-10579: fix smoketester backwards-check to not parse stdout (#903)

This is very noisy, can contain gradle status updates, various other 
tests.verbose prints from other threads, you name it.

It causes the check to be flaky, and randomly "miss" seeing a test that 
executed.

Instead, let's look at the zip files. We can still preserve the essence of what 
the test wants to do, but without any flakiness.

> fix smoketester backwards-check to not parse stdout
> ---
>
> Key: LUCENE-10579
> URL: https://issues.apache.org/jira/browse/LUCENE-10579
> Project: Lucene - Core
>  Issue Type: Bug
>Reporter: Robert Muir
>Priority: Major
> Fix For: 9.3
>
> Attachments: backwards.log.gz
>
>  Time Spent: 0.5h
>  Remaining Estimate: 0h
>
> The smoketester parses the output of TestBackwardsCompatibility -verbose 
> looking for certain prints for each index release. 
> But I think this is a noisier channel than you might expect. I added a hack 
> to log the stuff its trying to parse... it is legit crazy. See attachment
> Let's rethink, maybe we should just examine the zip files?



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] rmuir merged pull request #903: LUCENE-10579: fix smoketester backwards-check to not parse stdout

2022-05-20 Thread GitBox


rmuir merged PR #903:
URL: https://github.com/apache/lucene/pull/903


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-9625) Benchmark KNN search with ann-benchmarks

2022-05-20 Thread Balmukund Mandal (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-9625?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17540066#comment-17540066
 ] 

Balmukund Mandal commented on LUCENE-9625:
--

Thank you Michael

> Benchmark KNN search with ann-benchmarks
> 
>
> Key: LUCENE-9625
> URL: https://issues.apache.org/jira/browse/LUCENE-9625
> Project: Lucene - Core
>  Issue Type: New Feature
>Reporter: Michael Sokolov
>Priority: Major
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> In addition to benchmarking with luceneutil, it would be good to be able to 
> make use of ann-benchmarks, which is publishing results from many approximate 
> knn algorithms, including the hnsw implementation from its authors. We don't 
> expect to challenge the performance of these native code libraries, however 
> it would be good to know just how far off we are.
> I started looking into this and posted a fork of ann-benchmarks that uses 
> KnnGraphTester  class to run these: 
> https://github.com/msokolov/ann-benchmarks. It's still a WIP; you have to 
> manually copy jars and the KnnGraphTester.class to the test host machine 
> rather than downloading from a distribution. KnnGraphTester needs some 
> modifications in order to support this process - this issue is mostly about 
> that.
> One thing I noticed is that some of the index builds with higher fanout 
> (efConstruction) settings time out at 2h (on an AWS c5 instance), so this is 
> concerning and I'll open a separate issue for trying to improve that.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] dweiss commented on pull request #909: LUCENE-10370: pass proper classpath/module arguments for forking jvms from within tests

2022-05-20 Thread GitBox


dweiss commented on PR #909:
URL: https://github.com/apache/lucene/pull/909#issuecomment-1132763931

   I though about it but... Nah. I don't think it makes sense. You can run 
tests via gradle from IDEs as well if you have to.
   The only thing I was wondering would be whether to throw an assertion error 
or an assumption error from direct IDE launches.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-10544) Should ExitableTermsEnum wrap postings and impacts?

2022-05-20 Thread Deepika Sharma (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-10544?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17540056#comment-17540056
 ] 

Deepika Sharma commented on LUCENE-10544:
-

For adding timeout feature In {{IndexSearcher}} through {{BulkScorer}} 
approach, I think there are two ways we can implement this:
 * One way could be we can make another class which will wrap {{BulkScorer}} 
along with timeout and other methods as per required (similar to existing 
{{TimeLimitingCollector}} class).
 * Another way could be making a helper function inside {{BulkScorer}} class 
which will wrap score method of {{BulkScorer}} along with timeout and it will 
invoke whenever timeout is enabled.

So, I am looking for the suggestion which approach would be better to go ahead 
or is there any other way I should consider?

> Should ExitableTermsEnum wrap postings and impacts?
> ---
>
> Key: LUCENE-10544
> URL: https://issues.apache.org/jira/browse/LUCENE-10544
> Project: Lucene - Core
>  Issue Type: Bug
>  Components: core/index
>Reporter: Greg Miller
>Priority: Major
>
> While looking into options for LUCENE-10151, I noticed that 
> {{ExitableDirectoryReader}} doesn't actually do any timeout checking once you 
> start iterating postings/impacts. It *does* create a {{ExitableTermsEnum}} 
> wrapper when loading a {{{}TermsEnum{}}}, but that wrapper doesn't do 
> anything to wrap postings or impacts. So timeouts will be enforced when 
> moving to the "next" term, but not when iterating the postings/impacts 
> associated with a term.
> I think we ought to wrap the postings/impacts as well with some form of 
> timeout checking so timeouts can be enforced on long-running queries. I'm not 
> sure why this wasn't done originally (back in 2014), but it was questioned 
> back in 2020 on the original Jira SOLR-5986. Does anyone know of a good 
> reason why we shouldn't enforce timeouts in this way?
> Related, we may also want to wrap things like {{seekExact}} and {{seekCeil}} 
> given that only {{next}} is being wrapped currently.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] uschindler commented on pull request #909: LUCENE-10370: pass proper classpath/module arguments for forking jvms from within tests

2022-05-20 Thread GitBox


uschindler commented on PR #909:
URL: https://github.com/apache/lucene/pull/909#issuecomment-1132736446

   Could we not fallback to the classpath code if the forkoptions file system 
property is not there?
   
   This would allow to run tests in most jvms. Maybe print a warning then.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] rmuir commented on a diff in pull request #892: LUCENE-10581: Optimize stored fields bulk merges on the first segment

2022-05-20 Thread GitBox


rmuir commented on code in PR #892:
URL: https://github.com/apache/lucene/pull/892#discussion_r877982698


##
lucene/core/src/java/org/apache/lucene/codecs/lucene90/compressing/Lucene90CompressingStoredFieldsWriter.java:
##
@@ -553,14 +554,20 @@ private void copyChunks(
 final long toPointer =
 toDocID == sub.maxDoc ? reader.getMaxPointer() : 
index.getStartPointer(toDocID);
 if (fromPointer < toPointer) {
-  if (numBufferedDocs > 0) {
-flush(true);
-  }
   final IndexInput rawDocs = reader.getFieldsStream();
   rawDocs.seek(fromPointer);
   do {
 final int base = rawDocs.readVInt();
 final int code = rawDocs.readVInt();
+final boolean dirtyChunk = (code & 2) != 0;
+if (copyDirtyChunks) {
+  if (numBufferedDocs > 0) {
+flush(true);
+  }
+} else if (dirtyChunk || numBufferedDocs > 0) {
+  // Don't copy a dirty chunk or force a flush, which would create a 
dirty chunk
+  break;
+}

Review Comment:
   I find the way this is organized confusing, especially compared to the 
previous `if buffered > 0 flush()`. Is there some other way the code could be 
re-arranged? the if-then-else structure is just especially complicated and hard 
on the eyes.
   
   Maybe, we should avoid the use of `copyDirtyChunks` boolean. This boolean is 
making the logic especially difficult for me. If instead of `if 
(copyDirtyChunks)`, it were to read `if (mergeStrategy == DIRTY_BULK)`, then 
its on its way to getting better. (separately, i also hate DIRTY_BULK and 
CLEAN_BULK names, i will think about it).



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] rmuir commented on pull request #892: LUCENE-10581: Optimize stored fields bulk merges on the first segment

2022-05-20 Thread GitBox


rmuir commented on PR #892:
URL: https://github.com/apache/lucene/pull/892#issuecomment-1132725061

   Maybe one way to negate risk is to make the same changes to term vectors. It 
effectively increases the testing of such changes and at least makes all this 
bulk merge logic more consistent between the two (e.g. less possibilities of 
corruption)


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Updated] (LUCENE-10370) Fix classpath/module path of tests forking their own Java (TestNRTReplication)

2022-05-20 Thread Dawid Weiss (Jira)


 [ 
https://issues.apache.org/jira/browse/LUCENE-10370?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dawid Weiss updated LUCENE-10370:
-
Description: 
TestNRTReplication fails because it assumes classpath can just be copied to a 
sub-process - this is no longer the case.

PR at:
https://github.com/apache/lucene/pull/909

  was:TestNRTReplication fails because it assumes classpath can just be copied 
to a sub-process - this is no longer the case.


> Fix classpath/module path of tests forking their own Java (TestNRTReplication)
> --
>
> Key: LUCENE-10370
> URL: https://issues.apache.org/jira/browse/LUCENE-10370
> Project: Lucene - Core
>  Issue Type: Sub-task
>Reporter: Dawid Weiss
>Assignee: Dawid Weiss
>Priority: Minor
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> TestNRTReplication fails because it assumes classpath can just be copied to a 
> sub-process - this is no longer the case.
> PR at:
> https://github.com/apache/lucene/pull/909



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] rmuir commented on pull request #892: LUCENE-10581: Optimize stored fields bulk merges on the first segment

2022-05-20 Thread GitBox


rmuir commented on PR #892:
URL: https://github.com/apache/lucene/pull/892#issuecomment-1132717121

   This adds a little more complexity to the scary bulk merge logic, though. is 
it worth it? 
   
   change is somewhat terrifying. can we do something to make it less so?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] dweiss commented on pull request #901: remove commented-out/obselete AwaitsFix

2022-05-20 Thread GitBox


dweiss commented on PR #901:
URL: https://github.com/apache/lucene/pull/901#issuecomment-1132716807

   I filed a PR fixing those "forking" tests (not just the ones above, others 
as well). Rewriting the nrt tests completely so that they don't fork at all is 
a more complex task and probably deserves an issue on its own rights.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] dweiss opened a new pull request, #909: LUCENE-10370: pass proper classpath/module arguments for forking jvms from within tests

2022-05-20 Thread GitBox


dweiss opened a new pull request, #909:
URL: https://github.com/apache/lucene/pull/909

   This is putting some lipstick on the pig. Forking those JVMs is really an 
awkward thing to do. I understand some tests are specifically testing JVM 
crashes, etc., but, uh, those process builders/ pipes are ugly.
   
   Anyway. I tracked down java classpath references in a few test classes and 
replaced them with a single method on LuceneTestCase that passes proper 
classpath/ module path/ opens, consistent with those the test classes are 
running with. I ran nightly tests of the core+replication and all seems fine.
   
   The downside is that folks will no longer be able to run those tests from 
within IDEs, but I think it's an acceptable tradeoff given how complex those 
jvm arguments can get (some things on module path, some on classpath, etc.).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Assigned] (LUCENE-10370) Fix classpath/module path of tests forking their own Java (TestNRTReplication)

2022-05-20 Thread Dawid Weiss (Jira)


 [ 
https://issues.apache.org/jira/browse/LUCENE-10370?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dawid Weiss reassigned LUCENE-10370:


Assignee: Dawid Weiss

> Fix classpath/module path of tests forking their own Java (TestNRTReplication)
> --
>
> Key: LUCENE-10370
> URL: https://issues.apache.org/jira/browse/LUCENE-10370
> Project: Lucene - Core
>  Issue Type: Sub-task
>Reporter: Dawid Weiss
>Assignee: Dawid Weiss
>Priority: Minor
>
> TestNRTReplication fails because it assumes classpath can just be copied to a 
> sub-process - this is no longer the case.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] jpountz commented on a diff in pull request #897: LUCENE-10266 Move nearest-neighbor search on points to core

2022-05-20 Thread GitBox


jpountz commented on code in PR #897:
URL: https://github.com/apache/lucene/pull/897#discussion_r877887683


##
lucene/core/src/java/org/apache/lucene/document/LatLonPoint.java:
##
@@ -362,4 +377,72 @@ public static Query newDistanceFeatureQuery(
 }
 return query;
   }
+
+  /**
+   * Finds the {@code n} nearest indexed points to the provided point, 
according to Haversine
+   * distance.
+   *
+   * This is functionally equivalent to running {@link MatchAllDocsQuery} 
with a {@link
+   * LatLonDocValuesField#newDistanceSort}, but is far more efficient since it 
takes advantage of
+   * properties the indexed BKD tree. Currently this only works with {@link 
Lucene90PointsFormat}

Review Comment:
   this point that it only works with Lucene90PointsFormat is no longer true, 
can you remove it?



##
lucene/core/src/java/org/apache/lucene/search/NearestNeighbor.java:
##
@@ -31,12 +31,8 @@
 import org.apache.lucene.util.Bits;
 import org.apache.lucene.util.SloppyMath;
 
-/**
- * KNN search on top of 2D lat/lon indexed points.
- *
- * @lucene.experimental
- */
-class NearestNeighbor {
+/** KNN search on top of 2D lat/lon indexed points. */
+public class NearestNeighbor {

Review Comment:
   Can we keep it pkg-private if we have it in oal.document? This is especially 
important as we're promoting this class to core.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] jpountz commented on pull request #892: LUCENE-10581: Optimize stored fields bulk merges on the first segment

2022-05-20 Thread GitBox


jpountz commented on PR #892:
URL: https://github.com/apache/lucene/pull/892#issuecomment-1132621070

   Quadratic merging is gone but this change still looks helpful so I updated 
the title/description.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] jpountz commented on pull request #907: LUCENE-10357 Ghost fields and postings/points

2022-05-20 Thread GitBox


jpountz commented on PR #907:
URL: https://github.com/apache/lucene/pull/907#issuecomment-1132617984

   Thanks for tackling this! Your are going in the right direction. Could we 
drop most of the `if (terms != Terms.EMPTY)` checks and handle them like any 
non-null `Terms` instance?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-10574) Remove O(n^2) from TieredMergePolicy or change defaults to one that doesn't do this

2022-05-20 Thread ASF subversion and git services (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-10574?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17539982#comment-17539982
 ] 

ASF subversion and git services commented on LUCENE-10574:
--

Commit e56f7a6336b26adbf2a180a588240ffb75c47ff4 in lucene's branch 
refs/heads/branch_9x from Adrien Grand
[ https://gitbox.apache.org/repos/asf?p=lucene.git;h=e56f7a6336b ]

LUCENE-10574: Keep allowing unbalanced merges if they would reclaim lots of 
deletes. (#905)

`TestTieredMergePolicy` caught this special case: if a segment has lots of
deletes, we should still allow unbalanced merges.

> Remove O(n^2) from TieredMergePolicy or change defaults to one that doesn't 
> do this
> ---
>
> Key: LUCENE-10574
> URL: https://issues.apache.org/jira/browse/LUCENE-10574
> Project: Lucene - Core
>  Issue Type: Bug
>Reporter: Robert Muir
>Priority: Major
> Fix For: 9.3
>
>  Time Spent: 2h 10m
>  Remaining Estimate: 0h
>
> Remove {{floorSegmentBytes}} parameter, or change lucene's default to a merge 
> policy that doesn't merge in an O(n^2) way.
> I have the feeling it might have to be the latter, as folks seem really wed 
> to this crazy O(n^2) behavior.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-10574) Remove O(n^2) from TieredMergePolicy or change defaults to one that doesn't do this

2022-05-20 Thread ASF subversion and git services (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-10574?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17539980#comment-17539980
 ] 

ASF subversion and git services commented on LUCENE-10574:
--

Commit 5e9dfbed23a9bdd21b16d2e31e5e805f4fd6 in lucene's branch 
refs/heads/main from Adrien Grand
[ https://gitbox.apache.org/repos/asf?p=lucene.git;h=5e9dfbed277 ]

LUCENE-10574: Keep allowing unbalanced merges if they would reclaim lots of 
deletes. (#905)

`TestTieredMergePolicy` caught this special case: if a segment has lots of
deletes, we should still allow unbalanced merges.

> Remove O(n^2) from TieredMergePolicy or change defaults to one that doesn't 
> do this
> ---
>
> Key: LUCENE-10574
> URL: https://issues.apache.org/jira/browse/LUCENE-10574
> Project: Lucene - Core
>  Issue Type: Bug
>Reporter: Robert Muir
>Priority: Major
> Fix For: 9.3
>
>  Time Spent: 2h 10m
>  Remaining Estimate: 0h
>
> Remove {{floorSegmentBytes}} parameter, or change lucene's default to a merge 
> policy that doesn't merge in an O(n^2) way.
> I have the feeling it might have to be the latter, as folks seem really wed 
> to this crazy O(n^2) behavior.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] jpountz merged pull request #905: LUCENE-10574: Keep allowing unbalanced merges if they would reclaim lots of deletes.

2022-05-20 Thread GitBox


jpountz merged PR #905:
URL: https://github.com/apache/lucene/pull/905


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-10581) Optimize stored fields merges on the first segment

2022-05-20 Thread Adrien Grand (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-10581?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17539976#comment-17539976
 ] 

Adrien Grand commented on LUCENE-10581:
---

StoredFieldsBenchmark on the entire geonames dataset gives me

|| Msec to index || BEST_SPEED || BEST_COMPRESSION ||
| main | 105726 | 248489 |
| patch | 91000 | 200811 |

> Optimize stored fields merges on the first segment
> --
>
> Key: LUCENE-10581
> URL: https://issues.apache.org/jira/browse/LUCENE-10581
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Adrien Grand
>Priority: Minor
>
> This is mostly repurposing LUCENE-10573. Even though our merge policies no 
> longer perform quadratic merging, it's still possible to configure them with 
> low merge factors (e.g. 2) or they might decide to create unbalanced merges 
> where the biggest segment of the merge accounts for a large part of the 
> merge. In such cases, copying compressed data directly still yields 
> significant benefits.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Created] (LUCENE-10581) Optimize stored fields merges on the first segment

2022-05-20 Thread Adrien Grand (Jira)
Adrien Grand created LUCENE-10581:
-

 Summary: Optimize stored fields merges on the first segment
 Key: LUCENE-10581
 URL: https://issues.apache.org/jira/browse/LUCENE-10581
 Project: Lucene - Core
  Issue Type: Improvement
Reporter: Adrien Grand


This is mostly repurposing LUCENE-10573. Even though our merge policies no 
longer perform quadratic merging, it's still possible to configure them with 
low merge factors (e.g. 2) or they might decide to create unbalanced merges 
where the biggest segment of the merge accounts for a large part of the merge. 
In such cases, copying compressed data directly still yields significant 
benefits.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-10574) Remove O(n^2) from TieredMergePolicy or change defaults to one that doesn't do this

2022-05-20 Thread Adrien Grand (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-10574?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17539959#comment-17539959
 ] 

Adrien Grand commented on LUCENE-10574:
---

There was a good speedup on the stored fields benchmark thanks to this change: 
http://people.apache.org/~mikemccand/lucenebench/stored_fields_benchmarks.html
 - 2.8x faster for BEST_SPEED
 - 2.3x faster for BEST_COMPRESSION

The other good news is that tests do not take longer despite the removal of 
LuceneTestCase#avoidPathologicalMerging. 
http://people.apache.org/~mikemccand/lucenebench/antcleantest.html

Other benchmarks are not affected because they would flush large-enough 
segments to not be subject to the floor segment size.

> Remove O(n^2) from TieredMergePolicy or change defaults to one that doesn't 
> do this
> ---
>
> Key: LUCENE-10574
> URL: https://issues.apache.org/jira/browse/LUCENE-10574
> Project: Lucene - Core
>  Issue Type: Bug
>Reporter: Robert Muir
>Priority: Major
> Fix For: 9.3
>
>  Time Spent: 2h
>  Remaining Estimate: 0h
>
> Remove {{floorSegmentBytes}} parameter, or change lucene's default to a merge 
> policy that doesn't merge in an O(n^2) way.
> I have the feeling it might have to be the latter, as folks seem really wed 
> to this crazy O(n^2) behavior.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org