Re: [I] `FSTCompiler.Builder` should have an option to stream the FST bytes directly to Directory [lucene]

2023-10-04 Thread via GitHub


dungba88 commented on issue #12543:
URL: https://github.com/apache/lucene/issues/12543#issuecomment-1748001631

   I put together a PR at https://github.com/apache/lucene/pull/12624.
   
   I also verified with a custom dictionary (~1MB in size) that position does 
not go backward to previously written input, so I think it would not be a 
problem to create a new `FSTWriter` that flush to disk on every add and only 
use the in-mem buffer for processing the current input. That is assuming we 
have `shouldShareSuffix` set to false.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] SOLR-17004: ZkStateReader waitForState should check clusterState before using watchers [lucene-solr]

2023-10-04 Thread via GitHub


risdenk merged PR #2678:
URL: https://github.com/apache/lucene-solr/pull/2678


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[PR] SOLR-17004: ZkStateReader waitForState should check clusterState before using watchers [lucene-solr]

2023-10-04 Thread via GitHub


risdenk opened a new pull request, #2678:
URL: https://github.com/apache/lucene-solr/pull/2678

   Backport SOLR-17004


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [I] Make `byte[]` vector comparisons faster! (if possible) [lucene]

2023-10-04 Thread via GitHub


rmuir commented on issue #12621:
URL: https://github.com/apache/lucene/issues/12621#issuecomment-1747775200

   > Actually it is worse: Java 20 introduced conversion between short/float, 
but we got neither a native `float16` datatype nor vector support. In short: 
completely unuseable.
   
   We should at least fix the field we have in the sandbox to use it and start 
playing with the possibilities/performance? 
https://github.com/apache/lucene/blob/main/lucene/sandbox/src/java/org/apache/lucene/sandbox/document/HalfFloatPoint.java


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Reduce FST block size for BlockTreeTermsWriter (#12604) [lucene-solr]

2023-10-04 Thread via GitHub


risdenk merged PR #2677:
URL: https://github.com/apache/lucene-solr/pull/2677


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[PR] Allow FST builder to use different writer (#12543) [lucene]

2023-10-04 Thread via GitHub


dungba88 opened a new pull request, #12624:
URL: https://github.com/apache/lucene/pull/12624

   ### Description
   
   Refactor the method in `BytesStore` needed for FST construction to an 
abstract class and allow it to be passed from `FSTCompiler.Builder`. The 
Builder will still maintain `bytesPageBits(int)` for backward-compatibility, in 
which it will pass the BytesStore.
   
   Issue: https://github.com/apache/lucene/issues/12543


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[PR] Use a MergeSorter taking advantage of extra storage for StableMSBRadixSorter [lucene]

2023-10-04 Thread via GitHub


gf2121 opened a new pull request, #12623:
URL: https://github.com/apache/lucene/pull/12623

   ### Description
   
   As `StableMSBRadixSorter` always requires a `O(n)` extra memory. We can use 
a `MergeSorter` taking advantage of the extra memory instead of 
`InPlaceMergeSorter`.
   
   ### Benchmark
   
   This reduces around 15% took of sorting 10,000,000 timestamp values (long) 
with `StableMSBRadisSorter`.
   https://bytedance.feishu.cn/sheets/HSetsPqDrhicnet5lWrcOXMtnRc;
 data-importRangeRawData-range="Sheet1!A1:D2">
   
     | baseline | candidate | took diff
   -- | -- | -- | --
   sort took | 5285 | 4464 | -15.53%
   
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [I] Should we explore DiskANN for aKNN vector search? [lucene]

2023-10-04 Thread via GitHub


benwtrent commented on issue #12615:
URL: https://github.com/apache/lucene/issues/12615#issuecomment-1747350135

   @jmazanec15 I agree that SPANN seems more attractive. I would argue though 
we don't need to do clustering (in the paper they do clustering, but with 
minimal effectiveness), but could maybe get away with randomly selecting a 
subset of vectors per segment.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [I] Should we explore DiskANN for aKNN vector search? [lucene]

2023-10-04 Thread via GitHub


jmazanec15 commented on issue #12615:
URL: https://github.com/apache/lucene/issues/12615#issuecomment-1747329967

   A hybrid disk-memory algorithm would have very strong benefits. I did run a 
few tests recently that confirmed HNSW does not function very well when memory 
gets constrained (which I think everyone already knew). 
   
   I wonder though, instead of DiskANN, what about a partitioning based 
approach such as [SPANN](https://arxiv.org/pdf/2111.08566.pdf)? I think a 
partitioning based approach for Lucene would make merging, updating, filtering 
and indexing a lot easier. Also, it seems it would have better disk-access 
patterns. In the paper, they do show that in a memory constrained environment, 
they were able to outperfrom DiskANN.
   
   I guess the tradeoff might be that partitioning based approaches would 
struggle to achieve really low latency search when in memory compared to 
graph-based approaches. Additionally, partitioning approaches would require a 
potentially expensive "training" or "preprocessing" step such as k-Means and 
performance could degrade if data distribution drifts and the models are not 
updated. But, if PQ is implemented, the same considerations would need to be 
taken as well.  


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Improve fallback sorter for BKD [lucene]

2023-10-04 Thread via GitHub


gf2121 commented on code in PR #12610:
URL: https://github.com/apache/lucene/pull/12610#discussion_r1346210779


##
lucene/core/src/java/org/apache/lucene/util/bkd/MutablePointTreeReaderUtils.java:
##
@@ -81,6 +86,40 @@ protected int byteAt(int i, int k) {
   return (reader.getDocID(i) >>> Math.max(0, shift)) & 0xff;
 }
   }
+
+  @Override
+  protected Sorter getFallbackSorter(int k) {

Review Comment:
   Thanks for the review and the great advice! As `Arrays#compareUnsigned` need 
to compare bytes one by one when length < 8, This way should speed things up.
   However, the newest patch (as well as origin patch that using 
`Arrays#compareUnsigned`) shows a slower speed than `main` in my benchmark. I'm 
still digging to understand what happened. I'll change this PR to draft until 
figuring it out.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Improve fallback sorter for BKD [lucene]

2023-10-04 Thread via GitHub


gf2121 commented on code in PR #12610:
URL: https://github.com/apache/lucene/pull/12610#discussion_r1346202698


##
lucene/core/src/java/org/apache/lucene/util/bkd/MutablePointTreeReaderUtils.java:
##
@@ -81,6 +86,40 @@ protected int byteAt(int i, int k) {
   return (reader.getDocID(i) >>> Math.max(0, shift)) & 0xff;
 }
   }
+
+  @Override
+  protected Sorter getFallbackSorter(int k) {
+return new InPlaceMergeSorter() {
+  @Override
+  protected int compare(int i, int j) {
+if (k < config.packedBytesLength) {
+  reader.getValue(i, scratch1);
+  reader.getValue(j, scratch2);
+  int v =
+  Arrays.compareUnsigned(
+  scratch1.bytes,
+  scratch1.offset + k,
+  scratch1.offset + scratch1.length,
+  scratch2.bytes,
+  scratch2.offset + k,
+  scratch2.offset + scratch2.length);
+  if (v != 0) {
+return v;
+  }
+}
+if (bitsPerDocId == 0) {

Review Comment:
   Yes, see this 
[line](https://github.com/apache/lucene/blob/eb14a1a0a69757dcbda4829c4a160ecd1a5530ad/lucene/core/src/java/org/apache/lucene/util/bkd/MutablePointTreeReaderUtils.java#L59).
 We are avoiding doc id comparing with a stable sort if index sorting in not 
enabled :)



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [I] Should we explore DiskANN for aKNN vector search? [lucene]

2023-10-04 Thread via GitHub


benwtrent commented on issue #12615:
URL: https://github.com/apache/lucene/issues/12615#issuecomment-1747298348

   > DiskANN is known to be slower at indexing than HNSW and the blog post does 
not compare single threaded index times with Lucene.
   
   @robertvanwinkle1138 this is just one of my concerns.
   
   Additionally, I think we would still have the "segment merging problem". I 
do think we can get HNSW merges down by keeping connections between graphs, I 
just haven't had cycles myself to dig into that. There is a Lucene issue around 
making HNSW merges faster somewhere...
   
   https://github.com/apache/lucene/issues/12440


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [I] Should we explore DiskANN for aKNN vector search? [lucene]

2023-10-04 Thread via GitHub


robertvanwinkle1138 commented on issue #12615:
URL: https://github.com/apache/lucene/issues/12615#issuecomment-1747228177

   @benwtrent 
   For merges there is "FreshDiskANN: A Fast and Accurate Graph-Based
   ANN Index for Streaming Similarity Search"
   https://arxiv.org/pdf/2105.09613.pdf
   
   DiskANN is known to be slower at indexing than HNSW and the blog post does 
not compare single threaded index times with Lucene.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [I] Make `byte[]` vector comparisons faster! (if possible) [lucene]

2023-10-04 Thread via GitHub


uschindler commented on issue #12621:
URL: https://github.com/apache/lucene/issues/12621#issuecomment-1747206287

   See https://github.com/openjdk/jdk/pull/9422 (Java 20)


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [I] Make `byte[]` vector comparisons faster! (if possible) [lucene]

2023-10-04 Thread via GitHub


uschindler commented on issue #12621:
URL: https://github.com/apache/lucene/issues/12621#issuecomment-1747204954

   Actually it is worse: Java 20 introduced conversion between short/float, but 
we got neither a native `float16` datatype nor vector support. In short: 
completely unuseable. 冷


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Add readBytes method to RandomAccessInput [lucene]

2023-10-04 Thread via GitHub


iverase commented on PR #12600:
URL: https://github.com/apache/lucene/pull/12600#issuecomment-1747072284

   >@iverase, I think you have to move the changes entry to Lucene 10.
   
   I did it already in ba74da1
   
   >I changed the Policeman Jenkins MMAP job back to Lucene Main branch. The 
next run would have failed as you deleted the branch!
   
   Oops, I delete them mechanically, glad you changed quickly.
   
   > An alternative is to not implement the method and fallback to the default?!
   
   I think that is the best approach, We can do that if there is need to have 
it in the 9.x branch. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [I] Make `byte[]` vector comparisons faster! (if possible) [lucene]

2023-10-04 Thread via GitHub


rmuir commented on issue #12621:
URL: https://github.com/apache/lucene/issues/12621#issuecomment-1747066837

   My recommendation: stop messing around with `byte` and start thinking about 
the new 16-bit half-float support that is present in Java 21. Unfortunately the 
half-float *vectorization* support is not even in openjdk master branch for 
experimentation yet. But even my 3-year old crappy $200 mobile phone can do 
16-bit dot product in hardware, support is more widespread and it would avoid 
these issues while still saving memory/space.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Add readBytes method to RandomAccessInput [lucene]

2023-10-04 Thread via GitHub


uschindler commented on PR #12600:
URL: https://github.com/apache/lucene/pull/12600#issuecomment-1747053947

   @iverase, I think you have to move the changes entry to Lucene 10.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [I] Make `byte[]` vector comparisons faster! (if possible) [lucene]

2023-10-04 Thread via GitHub


rmuir commented on issue #12621:
URL: https://github.com/apache/lucene/issues/12621#issuecomment-1747044386

   As far as the ARM goes, the fact it has only 128-bit SIMD is the limiting 
factor.
   
   For e.g. AVX-256, we use 64-bit vector of 8 byte values -> 128 bit vector of 
8 short values -> 256 bit vector of 8 int values.
   
   For ARM/NEON with only 128-bit, we can't do this as we don't have 256-bit 
vectors. So instead we use use 64-bit vector of 8 byte values -> 128 bit vector 
of 8 short values -> 2 128-bit vectors of 4 short values each. It requires 
splitting the vector in half, it is just all we can do. 
   
   If you want it to be faster get an ARM with SVE SIMD which has bigger 
vectors than NEON.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Add readBytes method to RandomAccessInput [lucene]

2023-10-04 Thread via GitHub


uschindler commented on PR #12600:
URL: https://github.com/apache/lucene/pull/12600#issuecomment-1747042486

   P.S.: See 
[docs](https://docs.oracle.com/en/java/javase/21/docs/api/java.base/java/nio/ByteBuffer.html#get(int,byte%5B%5D,int,int))
 here. The method came with Java 13.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Add readBytes method to RandomAccessInput [lucene]

2023-10-04 Thread via GitHub


uschindler commented on PR #12600:
URL: https://github.com/apache/lucene/pull/12600#issuecomment-1747037797

   Hi @iverase,
   oh yeah. The absolute ByteBuffer gets are not available in older Java 
versions.
   
   If you want to backport, you could create a temporary ByteBuffer slice, but 
if you're fine to delay this change, I am fine. An alternative is to not 
implement the method and fallback to the default?!
   
   I changed the Policeman Jenkins MMAP job back to Lucene Main branch. The 
next run would have failed as you deleted the branch!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Add readBytes method to RandomAccessInput [lucene]

2023-10-04 Thread via GitHub


iverase commented on PR #12600:
URL: https://github.com/apache/lucene/pull/12600#issuecomment-1747031597

   @uschindler I merged the change. 
   
   I tried to backported but it is not possible ByteBuffer#get(int, byte[], 
int, int) is not available in the java version on line 9.x. I think it is ok to 
leave this change until 10.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Add a merge policy wrapper that performs recursive graph bisection on merge. [lucene]

2023-10-04 Thread via GitHub


jpountz commented on PR #12622:
URL: https://github.com/apache/lucene/pull/12622#issuecomment-1747029247

   The diff is large because I had to introduce a new 
`SlowCompositeCodecReaderWrapper`, which effectively does the merge (lazily) 
and can be fed to the reordering logic prior to actually running the merge. The 
rest of the change is reasonable (at least for now).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [I] Make `byte[]` vector comparisons faster! (if possible) [lucene]

2023-10-04 Thread via GitHub


rmuir commented on issue #12621:
URL: https://github.com/apache/lucene/issues/12621#issuecomment-1747026111

   Also their suggested replacement of 3 instructions for the `VPDPBUSD` is: 
   
   > Likewise, for 8-bit values, three instructions are needed - VPMADDUBSW 
which is used to multiply two 8-bit pairs and add them together, followed by a 
VPMADDWD with the value 1 in order to simply up-convert the 16-bit values to 
32-bit values, followed by the VPADDD instruction which adds the result to an 
accumulator. 
   
   I can tell you this is also not what is happening. We have no ability to 
write AVX-512-specific code and currently have to support ARM, machines with 
only AVX-256, etc. 
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [I] Make `byte[]` vector comparisons faster! (if possible) [lucene]

2023-10-04 Thread via GitHub


rmuir commented on issue #12621:
URL: https://github.com/apache/lucene/issues/12621#issuecomment-1747002969

   the type conversions are what makes it slow. for float case it is the equiv 
of:
   ```
   float x = something;
   float y = something;
   float z = something;
   // no conversions
   float result = x + y * z;
   ```
   
   for the binary case it is the equivalent of:
   ```
   byte a = something;
   byte b = something;
   byte c = something;
   // multiply b + c avoiding overflow
   short z1 = (short)b * (short)c;
   // add a + z1 avoiding overflow
   int z = (int)a + (int)z1;
   ```
   
   You can understand the limitations at the hardware level better by reading 
https://en.wikichip.org/wiki/x86/avx512_vnni
   `VPDPBUSD` is currently not supported by openjdk either: 
https://bugs.openjdk.org/browse/JDK-8215891


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [I] Add readBytes method to RandomAccessInput [lucene]

2023-10-04 Thread via GitHub


iverase closed issue #12599: Add readBytes method to RandomAccessInput
URL: https://github.com/apache/lucene/issues/12599


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Add readBytes method to RandomAccessInput [lucene]

2023-10-04 Thread via GitHub


iverase merged PR #12600:
URL: https://github.com/apache/lucene/pull/12600


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [I] FST#Compiler allocates too much memory [lucene]

2023-10-04 Thread via GitHub


risdenk commented on issue #12598:
URL: https://github.com/apache/lucene/issues/12598#issuecomment-1746863472

   FWIW I was looking into this a bit when I saw this issue come in. 
Specifically on Solr 8.11, but as far as I can tell the changes in #12604 apply 
to 8.x as well.
   
   In a 30s async profiler memory allocation profile, can see that ~4% of total 
is this same FST byte array.
   https://github.com/apache/lucene/assets/3384157/7eefeeed-04f6-442b-b9f3-0ac394581b60;>
   
   I put up a backport PR for lucene-solr branch_8_11 - 
https://github.com/apache/lucene-solr/pull/2677


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[PR] Reduce FST block size for BlockTreeTermsWriter (#12604) [lucene-solr]

2023-10-04 Thread via GitHub


risdenk opened a new pull request, #2677:
URL: https://github.com/apache/lucene-solr/pull/2677

   Backport of https://github.com/apache/lucene/pull/12604


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[PR] Add a merge policy wrapper that performs recursive graph bisection on merge. [lucene]

2023-10-04 Thread via GitHub


jpountz opened a new pull request, #12622:
URL: https://github.com/apache/lucene/pull/12622

   This adds `BPReorderingMergePolicy`, a merge policy wrapper that reorders 
doc IDs on merge using a `BPIndexReorderer`.
- Reordering always run on forced merges.
- A `minNaturalMergeNumDocs` parameter helps only enable reordering on the 
larger merged segments. This way, small merges retain all merging optimizations 
like bulk copying of stored fields, and only the larger segments - which are 
the most important for search performance - get reordered.
- If not enough RAM is available to perform reordering, reordering is 
skipped.
   
   To make this work, I had to add the ability for any merge to reorder doc IDs 
of the merged segment via `OneMerge#reorder`. `MockRandomMergePolicy` from the 
test framework randomly reverts the order of documents in a merged segment to 
make sure this logic is properly exercised.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [I] Write VLong in opposite order for better outputs sharing in the FST [lucene]

2023-10-04 Thread via GitHub


mikemccand commented on issue #12620:
URL: https://github.com/apache/lucene/issues/12620#issuecomment-1746831073

   This might be needle moving on the size of the FSTs created by block tree 
for the terms index, since it encodes long as `vLong` in its output.  We should 
only try this "reverse vLong" in that context (private to FST encoding) ... 
e.g. it would not likely help for FST `Outputs` since the encoded output 
is a true `long` value not a `byte[]` leading with an encoded `vLong` and most 
of the written `long` values in that case will already be smallish.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[I] Make `byte[]` vector comparisons faster! (if possible) [lucene]

2023-10-04 Thread via GitHub


benwtrent opened a new issue, #12621:
URL: https://github.com/apache/lucene/issues/12621

   ### Description
   
   While testing and digging around, I noticed that our float comparisons are 
way faster than byte on my Macbook (M1) and pretty much the same as our byte 
comparisons on a GCP Intel Sapphire Rapids CPU.
   
   This seems counter-intuitive to me. I would expect Panama to be able to do 
more `byte` operations per cycle than `float`. My guess is the intrinsics are 
weird? Panama Vector just doesn't support or detect the required operations?
   
   Here are two benchmark results using @rmuir's helpful vectorbench project:
   
   MacBook (Apple Silicon [128bits], JDK21):
   
   ```
   FloatDotProductBenchmark.dotProductNew 768  thrpt5  21.781 ± 0.254  
ops/us
   FloatDotProductBenchmark.dotProductNew1024  thrpt5  15.091 ± 0.217  
ops/us
   BinaryDotProductBenchmark.dotProductNew 768  thrpt5  8.041 ± 0.108  
ops/us
   BinaryDotProductBenchmark.dotProductNew1024  thrpt5  6.085 ± 0.133  
ops/us
   ```
   
   GCP (Intel Sapphire Rapids [avx512], JDK21):
   
   ```
   FloatDotProductBenchmark.dotProductNew 768  thrpt5  20.169 ± 0.385  
ops/us
   FloatDotProductBenchmark.dotProductNew1024  thrpt5  18.334 ± 0.180  
ops/us
   BinaryDotProductBenchmark.dotProductNew 768  thrpt5  19.686 ± 0.050  
ops/us
   BinaryDotProductBenchmark.dotProductNew1024  thrpt5  14.934 ± 0.014  
ops/us
   ```
   
   
   cpu-flags
   
   ```
   Flags:   fpu vme de pse tsc msr pae mce cx8 apic sep 
mtrr pge mca cmov pat pse36 clflus
h mmx fxsr sse sse2 ss ht syscall nx 
pdpe1gb rdtscp lm constant_tsc rep_good n
opl xtopology nonstop_tsc cpuid 
tsc_known_freq pni pclmulqdq ssse3 fma cx16 pc
id sse4_1 sse4_2 x2apic movbe popcnt aes 
xsave avx f16c rdrand hypervisor lahf
_lm abm 3dnowprefetch invpcid_single ssbd 
ibrs ibpb stibp ibrs_enhanced fsgsba
se tsc_adjust bmi1 avx2 smep bmi2 erms 
invpcid rtm avx512f avx512dq rdseed adx
 smap avx512ifma clflushopt clwb avx512cd 
sha_ni avx512bw avx512vl xsaveopt xs
avec xgetbv1 xsaves avx512_bf16 arat 
avx512vbmi umip avx512_vbmi2 gfni vaes vp
clmulqdq avx512_vnni avx512_bitalg 
avx512_vpopcntdq rdpid cldemote movdiri mov
dir64b fsrm md_clear serialize 
arch_capabilities
   ```
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Add readBytes method to RandomAccessInput [lucene]

2023-10-04 Thread via GitHub


iverase commented on code in PR #12600:
URL: https://github.com/apache/lucene/pull/12600#discussion_r1345506266


##
lucene/core/src/java19/org/apache/lucene/store/MemorySegmentIndexInput.java:
##
@@ -168,6 +168,28 @@ private void readBytesBoundary(byte[] b, int offset, int 
len) throws IOException
 }
   }
 
+  private void readBytesBoundary(long pos, byte[] b, int offset, int len) 
throws IOException {

Review Comment:
   > You removed the absolute/randomaccess readBytesBoundary method, correct?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Add readBytes method to RandomAccessInput [lucene]

2023-10-04 Thread via GitHub


iverase commented on code in PR #12600:
URL: https://github.com/apache/lucene/pull/12600#discussion_r1345506266


##
lucene/core/src/java19/org/apache/lucene/store/MemorySegmentIndexInput.java:
##
@@ -168,6 +168,28 @@ private void readBytesBoundary(byte[] b, int offset, int 
len) throws IOException
 }
   }
 
+  private void readBytesBoundary(long pos, byte[] b, int offset, int len) 
throws IOException {

Review Comment:
   > You removed the absolute/randomaccess readBytesBoundary method, correct?
   
   Yes, I left the sequential case unchanged.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Add readBytes method to RandomAccessInput [lucene]

2023-10-04 Thread via GitHub


uschindler commented on code in PR #12600:
URL: https://github.com/apache/lucene/pull/12600#discussion_r1345489213


##
lucene/core/src/java19/org/apache/lucene/store/MemorySegmentIndexInput.java:
##
@@ -168,6 +168,28 @@ private void readBytesBoundary(byte[] b, int offset, int 
len) throws IOException
 }
   }
 
+  private void readBytesBoundary(long pos, byte[] b, int offset, int len) 
throws IOException {

Review Comment:
   You removed the absolute/randomaccess readBytesBoundary method, correct?
   
   I think for the random access case it is fine. The sequential read case is 
more optimized and it uses "catch exception program flow", which is there for 
performance. The separate method is also there for Hotspot to not fall into 
method-bytecode-size limitations while inlining (because readBytesBoundary is 
called exceptionally and should not increate methodSize of readBytes).



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Add readBytes method to RandomAccessInput [lucene]

2023-10-04 Thread via GitHub


iverase commented on code in PR #12600:
URL: https://github.com/apache/lucene/pull/12600#discussion_r1345475483


##
lucene/core/src/java19/org/apache/lucene/store/MemorySegmentIndexInput.java:
##
@@ -168,6 +168,28 @@ private void readBytesBoundary(byte[] b, int offset, int 
len) throws IOException
 }
   }
 
+  private void readBytesBoundary(long pos, byte[] b, int offset, int len) 
throws IOException {

Review Comment:
   I have moved the method behind the random access readBytes method



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [I] `FSTCompiler.Builder` should have an option to stream the FST bytes directly to Directory [lucene]

2023-10-04 Thread via GitHub


dungba88 commented on issue #12543:
URL: https://github.com/apache/lucene/issues/12543#issuecomment-1746403380

   Thanks @mikemccand ! Let's continue the discuss in this issue instead.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Reduce FST block size for BlockTreeTermsWriter [lucene]

2023-10-04 Thread via GitHub


gf2121 merged PR #12604:
URL: https://github.com/apache/lucene/pull/12604


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [I] FST#Compiler allocates too much memory [lucene]

2023-10-04 Thread via GitHub


gf2121 closed issue #12598: FST#Compiler allocates too much memory
URL: https://github.com/apache/lucene/issues/12598


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[I] Write VLong in opposite order for better outputs sharing in the FST [lucene]

2023-10-04 Thread via GitHub


gf2121 opened a new issue, #12620:
URL: https://github.com/apache/lucene/issues/12620

   ### Description
   
   > We also should really explore the TODO above to write vLong in opposite 
byte order -- this might save quite a bit of storage in the FST since outputs 
would share more prefixes. Again, separate issue 
   
   From https://github.com/apache/lucene/pull/12604#discussion_r1344641303
   
   Related to this 
[TODO](https://github.com/apache/lucene/blob/21b6858fb6473842edd3088b037e76d02547790c/lucene/core/src/java/org/apache/lucene/codecs/lucene90/blocktree/Lucene90BlockTreeTermsWriter.java#L475)


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[I] Make FST BytesStore grow smoothly [lucene]

2023-10-04 Thread via GitHub


gf2121 opened a new issue, #12619:
URL: https://github.com/apache/lucene/issues/12619

   ### Description
   
   > Too bad we don't have a writer that uses tiny (like 8 bytes) block at 
first, but doubles size for each new block (16 bytes, 32 bytes next, etc.). 
Then we would naturally use log(size) number of blocks without over-allocating.
   > 
   > But then reading bytes is a bit tricky because we'd need to take discrete 
log (base 2) of the address. Maybe it wouldn't be so bad -- we could do this 
with Long.numberOfLeadingZeros maybe? But that's a bigger change ... we can do 
this separately/later.
   
   From https://github.com/apache/lucene/pull/12604#discussion_r1344639608


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org