Re: [PR] Fix a bug in ShapeTestUtil [lucene]

2024-01-23 Thread via GitHub


heemin32 commented on PR #12287:
URL: https://github.com/apache/lucene/pull/12287#issuecomment-1907155750

   Any update?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Add support for similarity-based vector searches [lucene]

2024-01-23 Thread via GitHub


junqiu-lei commented on PR #12679:
URL: https://github.com/apache/lucene/pull/12679#issuecomment-1907152593

   Hi, do we have any scheduled release date for this exciting feature?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] LUCENE-10334: Introduce a BlockReader based on ForUtil and use it for NumericDocValues [lucene]

2024-01-23 Thread via GitHub


github-actions[bot] commented on PR #562:
URL: https://github.com/apache/lucene/pull/562#issuecomment-1907130825

   This PR has not had activity in the past 2 weeks, labeling it as stale. If 
the PR is waiting for review, notify the d...@lucene.apache.org list. Thank you 
for your contribution!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] GH#11922: Allow DisjunctionDISIApproximation to short-circuit [lucene]

2024-01-23 Thread via GitHub


github-actions[bot] commented on PR #11928:
URL: https://github.com/apache/lucene/pull/11928#issuecomment-1907130536

   This PR has not had activity in the past 2 weeks, labeling it as stale. If 
the PR is waiting for review, notify the d...@lucene.apache.org list. Thank you 
for your contribution!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] [Fix] Binary search the entries when all suffixes have the same length in a leaf block. [lucene]

2024-01-23 Thread via GitHub


github-actions[bot] commented on PR #11888:
URL: https://github.com/apache/lucene/pull/11888#issuecomment-1907130580

   This PR has not had activity in the past 2 weeks, labeling it as stale. If 
the PR is waiting for review, notify the d...@lucene.apache.org list. Thank you 
for your contribution!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Remove redundant code in Lucene search [lucene]

2024-01-23 Thread via GitHub


github-actions[bot] commented on PR #12035:
URL: https://github.com/apache/lucene/pull/12035#issuecomment-1907130465

   This PR has not had activity in the past 2 weeks, labeling it as stale. If 
the PR is waiting for review, notify the d...@lucene.apache.org list. Thank you 
for your contribution!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Use `instanceof` pattern-matching where possible [lucene]

2024-01-23 Thread via GitHub


github-actions[bot] commented on PR #12295:
URL: https://github.com/apache/lucene/pull/12295#issuecomment-1907130198

   This PR has not had activity in the past 2 weeks, labeling it as stale. If 
the PR is waiting for review, notify the d...@lucene.apache.org list. Thank you 
for your contribution!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Fix a bug in ShapeTestUtil [lucene]

2024-01-23 Thread via GitHub


github-actions[bot] commented on PR #12287:
URL: https://github.com/apache/lucene/pull/12287#issuecomment-1907130242

   This PR has not had activity in the past 2 weeks, labeling it as stale. If 
the PR is waiting for review, notify the d...@lucene.apache.org list. Thank you 
for your contribution!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] upgrade to OpenNLP 2.3.1 [lucene]

2024-01-23 Thread via GitHub


github-actions[bot] commented on PR #12674:
URL: https://github.com/apache/lucene/pull/12674#issuecomment-1907129893

   This PR has not had activity in the past 2 weeks, labeling it as stale. If 
the PR is waiting for review, notify the d...@lucene.apache.org list. Thank you 
for your contribution!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Generalize LSBRadixSorter and use it in SortingPostingsEnum [lucene]

2024-01-23 Thread via GitHub


github-actions[bot] commented on PR #12800:
URL: https://github.com/apache/lucene/pull/12800#issuecomment-1907129767

   This PR has not had activity in the past 2 weeks, labeling it as stale. If 
the PR is waiting for review, notify the d...@lucene.apache.org list. Thank you 
for your contribution!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Add Facets#getBulkSpecificValues method [lucene]

2024-01-23 Thread via GitHub


github-actions[bot] commented on PR #12862:
URL: https://github.com/apache/lucene/pull/12862#issuecomment-1907129639

   This PR has not had activity in the past 2 weeks, labeling it as stale. If 
the PR is waiting for review, notify the d...@lucene.apache.org list. Thank you 
for your contribution!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Add new token filters for Japanese sutegana (捨て仮名) [lucene]

2024-01-23 Thread via GitHub


github-actions[bot] commented on PR #12915:
URL: https://github.com/apache/lucene/pull/12915#issuecomment-1907129523

   This PR has not had activity in the past 2 weeks, labeling it as stale. If 
the PR is waiting for review, notify the d...@lucene.apache.org list. Thank you 
for your contribution!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Forbidden Thread.sleep API [lucene]

2024-01-23 Thread via GitHub


github-actions[bot] commented on PR #13001:
URL: https://github.com/apache/lucene/pull/13001#issuecomment-1907129375

   This PR has not had activity in the past 2 weeks, labeling it as stale. If 
the PR is waiting for review, notify the d...@lucene.apache.org list. Thank you 
for your contribution!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Fix issues with chunked TaxonomyIndexArray [lucene]

2024-01-23 Thread via GitHub


msfroh commented on code in PR #13028:
URL: https://github.com/apache/lucene/pull/13028#discussion_r1463878454


##
lucene/facet/src/java/org/apache/lucene/facet/taxonomy/directory/TaxonomyIndexArrays.java:
##
@@ -95,7 +97,8 @@ public TaxonomyIndexArrays(IndexReader reader, 
TaxonomyIndexArrays copyFrom) thr
 // NRT reader was obtained, even though nothing was changed. this is not 
very likely
 // to happen.
 int[][] parentArray = allocateChunkedArray(reader.maxDoc(), 
copyFrom.parents.values.length - 1);
-if (parentArray.length > 0) {
+assert parentArray.length > 0;
+if (parentArray[parentArray.length - 1].length > 0) {

Review Comment:
   Actually, I guess my concern above was about the case where we start with a 
multiple of `CHUNK_SIZE`. This one kicks in when we grow to a multiple of 
`CHUNK_SIZE`.
   
   Either way, I think the condition is going to hit us when we have that 
trailing empty array. 
   
   Could the condition be `if (parentArray[0].length > 0)`, since we have the 
assert to say that `parentArray` always has at least one element (which could 
be empty)?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Fix issues with chunked TaxonomyIndexArray [lucene]

2024-01-23 Thread via GitHub


msfroh commented on code in PR #13028:
URL: https://github.com/apache/lucene/pull/13028#discussion_r1463869892


##
lucene/facet/src/test/org/apache/lucene/facet/taxonomy/directory/TestTaxonomyIndexArrays.java:
##
@@ -59,4 +74,88 @@ public void testMultiplesOfChunkSize() {
   ordinal = newOrdinal;
 }
   }
+
+  public void testConstructFromEmptyIndex() throws IOException {
+Directory dir = newDirectory();
+
+// Produce empty index
+new IndexWriter(dir, newIndexWriterConfig(null)).close();
+
+IndexReader reader = DirectoryReader.open(dir);
+
+TaxonomyIndexArrays tia = new TaxonomyIndexArrays(reader);
+assertEquals(0, tia.parents().length());
+
+tia = new TaxonomyIndexArrays(reader, tia);
+assertEquals(0, tia.parents().length());
+
+reader.close();
+dir.close();
+  }
+
+  public void testRefresh() throws IOException {
+Directory dir = newDirectory();
+
+// Write two chunks worth of ordinals whose parents are ordinals 1 or 2
+TaxonomyWriter taxoWriter = new DirectoryTaxonomyWriter(dir);
+taxoWriter.addCategory(new FacetLabel("a")); // ordinal 1
+taxoWriter.addCategory(new FacetLabel("b")); // ordinal 2
+for (int i = 0; i < 2 * TaxonomyIndexArrays.CHUNK_SIZE; i++) {

Review Comment:
   If you start this with `i = 3`, you should end up with exactly two 
`CHUNK_SIZE` worth of ordinals.
   
   I believe you would hit the condition that I called out above where 
`initParents` won't get called.



##
lucene/facet/src/java/org/apache/lucene/facet/taxonomy/directory/TaxonomyIndexArrays.java:
##
@@ -80,7 +81,8 @@ public int length() {
 
   public TaxonomyIndexArrays(IndexReader reader) throws IOException {
 int[][] parentArray = allocateChunkedArray(reader.maxDoc(), 0);
-if (parentArray.length > 0) {
+assert parentArray.length > 0;
+if (parentArray[parentArray.length - 1].length > 0) {

Review Comment:
   Does this condition work when we grow to a multiple of `CHUNK_SIZE`?
   
   Let's say the total size was 8190, then we add two more ordinals. Now we 
have a chunk of size 8192, followed by "tail chunk" of size 0. With this 
condition, I think we would fail to read the two new elements from `reader`.
   
   I think you might be able to verify it with `testRefresh`, where you should 
find that the `parents` arrays are always:
   
   ```
   parents[0][0] == INVALID
   parents[0][1] == 0
   parents[0][2] == 0
   parents[0][3] == 1
   parents[0][4] == 2
   parents[0][5] == 1
   parents[0][6] == 2
   ...
   parents[0][n] == 1 (if n%2 == 1)
   parents[0][n] == 2 (if n%2 == 0)
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Fix NPE when sampling for quantization in Lucene99HnswScalarQuantizedVectorsFormat [lucene]

2024-01-23 Thread via GitHub


benwtrent merged PR #13027:
URL: https://github.com/apache/lucene/pull/13027


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Enable MemorySegment in MMapDirectory for Java 22+ and Vectorization (incubation) for exact Java 22 [lucene]

2024-01-23 Thread via GitHub


ChrisHegarty commented on PR #12706:
URL: https://github.com/apache/lucene/pull/12706#issuecomment-1905951549

   > I don't think there will be any changes till release, but my plan was to 
merge this on 2024-02-08.
   
   That sounds fine to me. And I agree with merging this at the JDK 22 RC date.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Fix NPE when sampling for quantization in Lucene99HnswScalarQuantizedVectorsFormat [lucene]

2024-01-23 Thread via GitHub


ChrisHegarty commented on code in PR #13027:
URL: https://github.com/apache/lucene/pull/13027#discussion_r1463192412


##
lucene/CHANGES.txt:
##
@@ -243,6 +243,13 @@ Other
 
 * GITHUB#12934: Cleaning up old references to Lucene/Solr. (Jakub Slowinski)
 
+ Lucene 9.9.2 ===
+
+Bug Fixes
+-
+
+* GITHUB#13027: Fix NPE when sampling for quantization in 
Lucene99HnswScalarQuantizedVectorsFormat (Ben Trent)

Review Comment:
   After given it some thought, I agree with targeting this for 9.9.2. This 
kinda bug is exactly why we have patch releases.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Fix NPE when sampling for quantization in Lucene99HnswScalarQuantizedVectorsFormat [lucene]

2024-01-23 Thread via GitHub


benwtrent commented on code in PR #13027:
URL: https://github.com/apache/lucene/pull/13027#discussion_r1463174260


##
lucene/core/src/java/org/apache/lucene/util/ScalarQuantizer.java:
##
@@ -201,13 +233,26 @@ public String toString() {
*
* @param floatVectorValues the float vector values from which to calculate 
the quantiles
* @param confidenceInterval the confidence interval used to calculate the 
quantiles
+   * @param totalVectorCount the total number of live float vectors in the 
index. This is vital for
+   * accounting for deleted documents when calculating the quantiles.
* @return A new {@link ScalarQuantizer} instance
* @throws IOException if there is an error reading the float vector values
*/
   public static ScalarQuantizer fromVectors(
-  FloatVectorValues floatVectorValues, float confidenceInterval) throws 
IOException {
+  FloatVectorValues floatVectorValues, float confidenceInterval, int 
totalVectorCount)

Review Comment:
   Yeah, changing a public definition in a bug fix stinks. I can overload it 
for the bug fix and remove the overload in a minor release. These are 
experimental, so we should be able to change them in minors.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Initial impl of MMapDirectory and Vectorization (incubation) for Java 22 [lucene]

2024-01-23 Thread via GitHub


uschindler commented on PR #12706:
URL: https://github.com/apache/lucene/pull/12706#issuecomment-1905762807

   I don't think there will be any changes till release, but my plan was to 
merge this on 2024-02-08.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Initial impl of MMapDirectory and Vectorization (incubation) for Java 22 [lucene]

2024-01-23 Thread via GitHub


uschindler commented on PR #12706:
URL: https://github.com/apache/lucene/pull/12706#issuecomment-1905761622

   Hey, yes. I wanted to wait till RC phase and check everything before it.
   
   In case we want to release a new Lucene version we can merge this earlier, I 
just want to be on safe side.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Initial impl of MMapDirectory and Vectorization (incubation) for Java 22 [lucene]

2024-01-23 Thread via GitHub


ChrisHegarty commented on PR #12706:
URL: https://github.com/apache/lucene/pull/12706#issuecomment-1905715861

   @uschindler This looks like it is in good shape. I think that it is ready 
for merging, right?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Fix NPE when sampling for quantization in Lucene99HnswScalarQuantizedVectorsFormat [lucene]

2024-01-23 Thread via GitHub


ChrisHegarty commented on code in PR #13027:
URL: https://github.com/apache/lucene/pull/13027#discussion_r1463026638


##
lucene/CHANGES.txt:
##
@@ -243,6 +243,13 @@ Other
 
 * GITHUB#12934: Cleaning up old references to Lucene/Solr. (Jakub Slowinski)
 
+ Lucene 9.9.2 ===
+
+Bug Fixes
+-
+
+* GITHUB#13027: Fix NPE when sampling for quantization in 
Lucene99HnswScalarQuantizedVectorsFormat (Ben Trent)

Review Comment:
   This is a significant issue and needs addressing with some urgency. However, 
we need to decide exactly where this should land.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Fix NPE when sampling for quantization in Lucene99HnswScalarQuantizedVectorsFormat [lucene]

2024-01-23 Thread via GitHub


ChrisHegarty commented on code in PR #13027:
URL: https://github.com/apache/lucene/pull/13027#discussion_r1463016033


##
lucene/core/src/java/org/apache/lucene/util/ScalarQuantizer.java:
##
@@ -201,13 +233,26 @@ public String toString() {
*
* @param floatVectorValues the float vector values from which to calculate 
the quantiles
* @param confidenceInterval the confidence interval used to calculate the 
quantiles
+   * @param totalVectorCount the total number of live float vectors in the 
index. This is vital for
+   * accounting for deleted documents when calculating the quantiles.
* @return A new {@link ScalarQuantizer} instance
* @throws IOException if there is an error reading the float vector values
*/
   public static ScalarQuantizer fromVectors(
-  FloatVectorValues floatVectorValues, float confidenceInterval) throws 
IOException {
+  FloatVectorValues floatVectorValues, float confidenceInterval, int 
totalVectorCount)

Review Comment:
   Alternatively, and if we're concerned about binary compatibility with 9.9.0, 
then this could be structured as a new additional override accepting the 
`totalVectorCount`, while retaining the the original signature simply calls the 
new overloaded method passing the `floatVectorValues.size()` and asserting that 
there are no deleted docs or something. This could be a bit "trappy", but no 
unreasonable if we think that we need to maintain binary compatibility.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Fix NPE when sampling for quantization in Lucene99HnswScalarQuantizedVectorsFormat [lucene]

2024-01-23 Thread via GitHub


ChrisHegarty commented on code in PR #13027:
URL: https://github.com/apache/lucene/pull/13027#discussion_r1463016033


##
lucene/core/src/java/org/apache/lucene/util/ScalarQuantizer.java:
##
@@ -201,13 +233,26 @@ public String toString() {
*
* @param floatVectorValues the float vector values from which to calculate 
the quantiles
* @param confidenceInterval the confidence interval used to calculate the 
quantiles
+   * @param totalVectorCount the total number of live float vectors in the 
index. This is vital for
+   * accounting for deleted documents when calculating the quantiles.
* @return A new {@link ScalarQuantizer} instance
* @throws IOException if there is an error reading the float vector values
*/
   public static ScalarQuantizer fromVectors(
-  FloatVectorValues floatVectorValues, float confidenceInterval) throws 
IOException {
+  FloatVectorValues floatVectorValues, float confidenceInterval, int 
totalVectorCount)

Review Comment:
   Alternatively, and if we're concerned about binary compatibility with 9.9.0, 
then this could be structured as a new additional override accepting the 
`totalVectorCount`, while retaining the the original signature that passes the 
`floatVectorValues.size()` and asserting that there are no deleted docs or 
something. This could be a bit "trappy", but no unreasonable if we think that 
we need to maintain binary compatibility.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [I] What if we pick up segments in segment size's ascending order in TieredMergePolicy.doFindMerges? [lucene]

2024-01-23 Thread via GitHub


vsop-479 commented on issue #13022:
URL: https://github.com/apache/lucene/issues/13022#issuecomment-1905672274

   > Is findMerges() a bottleck for your use-case?
   
   No, though we find indexing performance downgrade when our index get too 
big, but the main reason may be `doStall`  takes longer with data growing(we 
set maxMergeCount to 1 on HDD with ES's recommendation). I will enable 
`MergeScheduler.verbose` to verify that.
   
   I am just wondering if we can walk segments in increasing size order with 
some contract.
   
   > You can no longer easily do this if you walk segments in increasing size 
order.
   
   You are right, thanks for indicating it.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Writing a HOWTO migrate codec version [lucene]

2024-01-23 Thread via GitHub


shubhamvishu commented on code in PR #12930:
URL: https://github.com/apache/lucene/pull/12930#discussion_r1462861811


##
dev-docs/codec-version-bump-howto.md:
##
@@ -0,0 +1,74 @@
+
+
+# Lucene Codec Version Bump How-To Manual
+
+Changing the name of the codec in Lucene is required for maintaining backward 
compatibility and ensuring a smooth transition for users. 
+Through explicit versioning, we isolate changes to the codec, and prevent 
unintended interactions between different codec 
+versions. Changes to the codec format version serves as a signal that a change 
in the format or structure of the index has occurred.
+
+This manual provides a step-by-step guide through the process of "bumping" the 
version of the Lucene Codec.
+
+### Fork and modify files
+* Creating a new Codec:
+  * Create a new package for your new codec of version XXX: 
`org.apache.lucene.codecs.luceneXXX`.

Review Comment:
   How about writing this how-to with the idea of changing the codec version 
from `LuceneXY` to `LuceneXZ` (?), that way I believe it'll be more clear to 
understand which codec is referenced in as opposed to writing LuceneXXX or 
saying old vs new. What do you think?



##
dev-docs/codec-version-bump-howto.md:
##
@@ -0,0 +1,74 @@
+
+
+# Lucene Codec Version Bump How-To Manual
+
+Changing the name of the codec in Lucene is required for maintaining backward 
compatibility and ensuring a smooth transition for users. 
+Through explicit versioning, we isolate changes to the codec, and prevent 
unintended interactions between different codec 
+versions. Changes to the codec format version serves as a signal that a change 
in the format or structure of the index has occurred.
+
+This manual provides a step-by-step guide through the process of "bumping" the 
version of the Lucene Codec.
+
+### Fork and modify files
+* Creating a new Codec:
+  * Create a new package for your new codec of version XXX: 
`org.apache.lucene.codecs.luceneXXX`.

Review Comment:
   How about writing this how-to with the idea of changing the codec version 
from `LuceneXY` to `LuceneXZ` (?), that way I believe it'll be more clear to 
understand which codec is referenced in as opposed to writing `LuceneXXX` or 
saying old vs new. What do you think?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Writing a HOWTO migrate codec version [lucene]

2024-01-23 Thread via GitHub


shubhamvishu commented on code in PR #12930:
URL: https://github.com/apache/lucene/pull/12930#discussion_r1462857551


##
dev-docs/codec-version-bump-howto.md:
##
@@ -0,0 +1,74 @@
+
+
+# Lucene Codec Version Bump How-To Manual
+
+Changing the name of the codec in Lucene is required for maintaining backward 
compatibility and ensuring a smooth transition for users. 
+Through explicit versioning, we isolate changes to the codec, and prevent 
unintended interactions between different codec 
+versions. Changes to the codec format version serves as a signal that a change 
in the format or structure of the index has occurred.
+
+This manual provides a step-by-step guide through the process of "bumping" the 
version of the Lucene Codec.
+
+### Fork and modify files
+* Creating a new Codec:
+  * Create a new package for your new codec of version XXX: 
`org.apache.lucene.codecs.luceneXXX`.
+  * Copy the previous codec in, updating all references to the old version 
number.
+  * Set the new codec, as the `defaultCodec` in 
[Codec.java](https://github.com/apache/lucene/blob/main/lucene/core/src/java/org/apache/lucene/codecs/Codec.java).
+  * Update all references from the old codec version to the new one.
+* Fork Components:
+  * Fork the required codec components, ensure the new codec is using those 
components.
+  * Fork subclasses impacted by the codec changes and update version numbers 
accordingly.
+* ForUtil (if modified):
+  * Move the unmodified `gen_ForUtil.py` to the latest `backward_codecs` 
package. Move the modified `gen_ForUtil.py` to the newly forked codec package.
+  * Change the 
[generateForUtil.json](https://github.com/apache/lucene/blob/main/lucene/core/src/generated/checksums/generateForUtil.json)
 to use the newest `luceneXXX` package. Create a 
[generateForUtilXXX.json](https://github.com/apache/lucene/tree/main/lucene/backward-codecs/src/generated/checksums)
 for the `gen_ForUtil.py` moved to `backwards_codecs`.
+  * In 
[forUtil.gradle](https://github.com/apache/lucene/blob/main/gradle/generation/forUtil.gradle),
 change the genDir to the latest 
+codec. Create a new task to `generateForUtil` for the `ForUtil` in 
`backwards_codecs`.
+  * Generate the `ForUtil`'s by running the command : `./gradlew 
generateForUtil`.
+
+### Move original files to backwards_codecs
+* Deprecate previous Codec:
+  * Create a new package for the previous codec of version XXX: 
`org.apache.lucene.backward_codecs.luceneXXX.LuceneXXXCodec`.

Review Comment:
   Are we creating a new package or you mean a new class eg: `LuceneXXXCodec` 
with old codec as specified here?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org