Re: [I] Should we explore DiskANN for aKNN vector search? [lucene]

2024-03-04 Thread via GitHub
MarcusSorealheis commented on issue #12615: URL: https://github.com/apache/lucene/issues/12615#issuecomment-1977567565 > HNSW and Vamana are "competing" proximity graphs, which differ mainly in the number of layers in the graph (n vs 1) and the pruning algorithm used. I do not think

Re: [I] Should we explore DiskANN for aKNN vector search? [lucene]

2024-03-01 Thread via GitHub
kevindrosendahl commented on issue #12615: URL: https://github.com/apache/lucene/issues/12615#issuecomment-1974089752 Think I agree with your points @benwtrent, will just jot down my thinking on HNSW vs Vamana vs DiskANN in case it's useful. HNSW and Vamana are "competing" proximity

Re: [I] Should we explore DiskANN for aKNN vector search? [lucene]

2024-02-06 Thread via GitHub
benwtrent commented on issue #12615: URL: https://github.com/apache/lucene/issues/12615#issuecomment-1929761190 So, I did some of my own experiments. I tested Vamana (vectors in-graph) & HNSW, both with `int8` quantization (here is my Lucene branch:

Re: [I] Should we explore DiskANN for aKNN vector search? [lucene]

2024-01-29 Thread via GitHub
jmazanec15 commented on issue #12615: URL: https://github.com/apache/lucene/issues/12615#issuecomment-1915207682 @kevindrosendahl This is really cool! I had a couple questions around product quantization implementation. I see in

Re: [I] Should we explore DiskANN for aKNN vector search? [lucene]

2024-01-11 Thread via GitHub
rmuir commented on issue #12615: URL: https://github.com/apache/lucene/issues/12615#issuecomment-1888419082 i would be extremely careful around io_uring, it is disabled in many environments (e.g. by default in container environments) for security reasons: *

Re: [I] Should we explore DiskANN for aKNN vector search? [lucene]

2024-01-11 Thread via GitHub
kevindrosendahl commented on issue #12615: URL: https://github.com/apache/lucene/issues/12615#issuecomment-1888208867 > How is segment merging implemented by Lucene Vamana? I didn't do anything special for Vamana in these experiments, the index construction and merging are

Re: [I] Should we explore DiskANN for aKNN vector search? [lucene]

2024-01-09 Thread via GitHub
MarcusSorealheis commented on issue #12615: URL: https://github.com/apache/lucene/issues/12615#issuecomment-1883963445 Great to finally see you in the Lucene repo @kevindrosendahl after all these years.  The work you have done here is stellar and the whole world welcomes the diligence. I

Re: [I] Should we explore DiskANN for aKNN vector search? [lucene]

2023-12-22 Thread via GitHub
robertvanwinkle1138 commented on issue #12615: URL: https://github.com/apache/lucene/issues/12615#issuecomment-1868130520 @kevindrosendahl Pretty interesting, thanks for the low level details, `io_uring` is fancy! How is segment merging implemented by Lucene Vamana? Correct

Re: [I] Should we explore DiskANN for aKNN vector search? [lucene]

2023-12-22 Thread via GitHub
kevindrosendahl commented on issue #12615: URL: https://github.com/apache/lucene/issues/12615#issuecomment-1868095892 Hi all, thanks for the patience, some interesting and exciting results to share. TL;DR: - DiskANN doesn't seem to lose any performance relative to HNSW when fully

Re: [I] Should we explore DiskANN for aKNN vector search? [lucene]

2023-11-20 Thread via GitHub
kevindrosendahl commented on issue #12615: URL: https://github.com/apache/lucene/issues/12615#issuecomment-1819863900 @navneet1v I've been using [oshi](https://github.com/oshi/oshi) in my testing framework, particularly

Re: [I] Should we explore DiskANN for aKNN vector search? [lucene]

2023-11-20 Thread via GitHub
navneet1v commented on issue #12615: URL: https://github.com/apache/lucene/issues/12615#issuecomment-1819800985 @kevindrosendahl, unrelated to the thread I see that you have added a column named page faults. Can you provide me some details around how you got the page faults? I am doing

Re: [I] Should we explore DiskANN for aKNN vector search? [lucene]

2023-11-13 Thread via GitHub
kevindrosendahl commented on issue #12615: URL: https://github.com/apache/lucene/issues/12615#issuecomment-1808717985 @benwtrent > Thank you @kevindrosendahl this does seem to confirm my suspicion that the improvement isn't necessarily due to the data structure, but due to

Re: [I] Should we explore DiskANN for aKNN vector search? [lucene]

2023-11-13 Thread via GitHub
jbellis commented on issue #12615: URL: https://github.com/apache/lucene/issues/12615#issuecomment-1808489834 > recall actually improves when introducing pq, and only starts to decrease at a factor of 16 I would guess that either there is a bug or you happen to be testing with a

Re: [I] Should we explore DiskANN for aKNN vector search? [lucene]

2023-11-13 Thread via GitHub
benwtrent commented on issue #12615: URL: https://github.com/apache/lucene/issues/12615#issuecomment-1808340050 Thank you @kevindrosendahl this does seem to confirm my suspicion that the improvement isn't necessarily due to the data structure, but due to quantization. But, this does

Re: [I] Should we explore DiskANN for aKNN vector search? [lucene]

2023-11-13 Thread via GitHub
mikemccand commented on issue #12615: URL: https://github.com/apache/lucene/issues/12615#issuecomment-1808286805 > I've got my framework set up for testing larger than memory indexes and have some somewhat interesting first results. Thank you for setting this up @kevindrosendahl --

Re: [I] Should we explore DiskANN for aKNN vector search? [lucene]

2023-11-10 Thread via GitHub
kevindrosendahl commented on issue #12615: URL: https://github.com/apache/lucene/issues/12615#issuecomment-1806615864 I've got my framework set up for testing larger than memory indexes and have some somewhat interesting first results. TL;DR: - the main thing driving jvector's

Re: [I] Should we explore DiskANN for aKNN vector search? [lucene]

2023-11-10 Thread via GitHub
kevindrosendahl commented on issue #12615: URL: https://github.com/apache/lucene/issues/12615#issuecomment-1806614314 @**[benwtrent](https://github.com/benwtrent)**: > if I am reading the code correctly, it does the following: > - Write int8 quantized vectors along side the vector

Re: [I] Should we explore DiskANN for aKNN vector search? [lucene]

2023-11-10 Thread via GitHub
robertvanwinkle1138 commented on issue #12615: URL: https://github.com/apache/lucene/issues/12615#issuecomment-1805417696 Another notable difference in the Lucene implementation is delta variable byte encoding of node ids. The increase in disk space requires the user to purchase more RAM

Re: [I] Should we explore DiskANN for aKNN vector search? [lucene]

2023-11-09 Thread via GitHub
robertvanwinkle1138 commented on issue #12615: URL: https://github.com/apache/lucene/issues/12615#issuecomment-1804858072 Perhaps much of the jvector performance improvement is simply from on heap caching.

Re: [I] Should we explore DiskANN for aKNN vector search? [lucene]

2023-11-08 Thread via GitHub
benwtrent commented on issue #12615: URL: https://github.com/apache/lucene/issues/12615#issuecomment-1802705393 @kevindrosendahl if I am reading the code correctly, it does the following: - Write int8 quantized vectors along side the vector ordinals in the graph (`.vex` or whatever

Re: [I] Should we explore DiskANN for aKNN vector search? [lucene]

2023-11-07 Thread via GitHub
kevindrosendahl commented on issue #12615: URL: https://github.com/apache/lucene/issues/12615#issuecomment-1800243525 > I haven't had a chance to read your branch yet, but hope to soon. Great, thanks! To save you a bit of time, the tl;dr of going from HNSW to vamana is that it's

Re: [I] Should we explore DiskANN for aKNN vector search? [lucene]

2023-11-07 Thread via GitHub
benwtrent commented on issue #12615: URL: https://github.com/apache/lucene/issues/12615#issuecomment-1798357836 @kevindrosendahl this looks really interesting! Thank you for digging in and starting the experimentation! I haven't had a chance to read your branch yet, but hope to soon.

Re: [I] Should we explore DiskANN for aKNN vector search? [lucene]

2023-11-03 Thread via GitHub
kevindrosendahl commented on issue #12615: URL: https://github.com/apache/lucene/issues/12615#issuecomment-1793186041 Hey @benwtrent and all, just wanted to let you know that I'm experimenting some with different index structures for larger than memory indexes. I have a working

Re: [I] Should we explore DiskANN for aKNN vector search? [lucene]

2023-11-01 Thread via GitHub
benwtrent commented on issue #12615: URL: https://github.com/apache/lucene/issues/12615#issuecomment-1788822945 So, I replicated the jvector benchmark (the lucene part) using the new int8 quantization. Note, this is with `0` fan out or extra top-k gathered. Since the benchmark on

Re: [I] Should we explore DiskANN for aKNN vector search? [lucene]

2023-10-20 Thread via GitHub
jbellis commented on issue #12615: URL: https://github.com/apache/lucene/issues/12615#issuecomment-1772724758 > Or perhaps we "just" make a Lucene Codec component (KnnVectorsFormat) that wraps jvector? (https://github.com/jbellis/jvector) I'm happy to support anyone who wants to try

Re: [I] Should we explore DiskANN for aKNN vector search? [lucene]

2023-10-20 Thread via GitHub
jbellis commented on issue #12615: URL: https://github.com/apache/lucene/issues/12615#issuecomment-1772722737 > It is possible that the candidate postings (gathered via HNSW) don't contain ANY filtered docs. This would require gathering more candidate postings. This was a big problem

Re: [I] Should we explore DiskANN for aKNN vector search? [lucene]

2023-10-20 Thread via GitHub
jbellis commented on issue #12615: URL: https://github.com/apache/lucene/issues/12615#issuecomment-1772711571 > DiskANN is known to be slower at indexing than HNSW I don't remember the numbers here, maybe 10% slower? It wasn't material enough to make me worry about it. (This is

Re: [I] Should we explore DiskANN for aKNN vector search? [lucene]

2023-10-20 Thread via GitHub
jbellis commented on issue #12615: URL: https://github.com/apache/lucene/issues/12615#issuecomment-1772704049 Responding top to bottom, > I wonder how much the speed difference is due to (1) Vectors being out of memory (and if they used PQ for diskann, if they did, we should test PQ

Re: [I] Should we explore DiskANN for aKNN vector search? [lucene]

2023-10-19 Thread via GitHub
dsmiley commented on issue #12615: URL: https://github.com/apache/lucene/issues/12615#issuecomment-1772084185 What say you @jbellis :-) I recommended a module of Lucene when we spoke at Community-over-Code. A dependency outside is okay for non-core. -- This is an automated message

Re: [I] Should we explore DiskANN for aKNN vector search? [lucene]

2023-10-07 Thread via GitHub
mikemccand commented on issue #12615: URL: https://github.com/apache/lucene/issues/12615#issuecomment-1751743673 Or perhaps we "just" make a Lucene Codec component (KnnVectorsFormat) that wraps jvector? (https://github.com/jbellis/jvector) -- This is an automated message from the Apache

Re: [I] Should we explore DiskANN for aKNN vector search? [lucene]

2023-10-07 Thread via GitHub
mikemccand commented on issue #12615: URL: https://github.com/apache/lucene/issues/12615#issuecomment-1751740152 (listening to @jbellis talk at Community over Code). -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the

Re: [I] Should we explore DiskANN for aKNN vector search? [lucene]

2023-10-07 Thread via GitHub
mikemccand commented on issue #12615: URL: https://github.com/apache/lucene/issues/12615#issuecomment-1751739864 SPANN is another option? https://www.researchgate.net/publication/356282356_SPANN_Highly-efficient_Billion-scale_Approximate_Nearest_Neighbor_Search -- This is an

Re: [I] Should we explore DiskANN for aKNN vector search? [lucene]

2023-10-06 Thread via GitHub
benwtrent commented on issue #12615: URL: https://github.com/apache/lucene/issues/12615#issuecomment-1750872046 > This search with filter method seems to throw an error. LOL, I thought it was supported, I must have read a github issue and made an assumption. > Couldn't that be

Re: [I] Should we explore DiskANN for aKNN vector search? [lucene]

2023-10-06 Thread via GitHub
robertvanwinkle1138 commented on issue #12615: URL: https://github.com/apache/lucene/issues/12615#issuecomment-1750770721 > QDrant's HNSW filter solution is the exact same as Lucene's Interesting thanks. > as candidate posting lists are gathered, ensure they have some

Re: [I] Should we explore DiskANN for aKNN vector search? [lucene]

2023-10-05 Thread via GitHub
benwtrent commented on issue #12615: URL: https://github.com/apache/lucene/issues/12615#issuecomment-1749322588 > QDrant has a filter solution however the methodology described in their blog is opaque. QDrant's HNSW filter solution is the exact same as Lucene's. You can look at the

Re: [I] Should we explore DiskANN for aKNN vector search? [lucene]

2023-10-05 Thread via GitHub
robertvanwinkle1138 commented on issue #12615: URL: https://github.com/apache/lucene/issues/12615#issuecomment-1749187258 The SPANN paper does not address efficient filtered queries. Lucene's HNSW calculates the similarity score for every record, regardless of the record matching the

Re: [I] Should we explore DiskANN for aKNN vector search? [lucene]

2023-10-04 Thread via GitHub
benwtrent commented on issue #12615: URL: https://github.com/apache/lucene/issues/12615#issuecomment-1747350135 @jmazanec15 I agree that SPANN seems more attractive. I would argue though we don't need to do clustering (in the paper they do clustering, but with minimal effectiveness), but

Re: [I] Should we explore DiskANN for aKNN vector search? [lucene]

2023-10-04 Thread via GitHub
jmazanec15 commented on issue #12615: URL: https://github.com/apache/lucene/issues/12615#issuecomment-1747329967 A hybrid disk-memory algorithm would have very strong benefits. I did run a few tests recently that confirmed HNSW does not function very well when memory gets constrained

Re: [I] Should we explore DiskANN for aKNN vector search? [lucene]

2023-10-04 Thread via GitHub
benwtrent commented on issue #12615: URL: https://github.com/apache/lucene/issues/12615#issuecomment-1747298348 > DiskANN is known to be slower at indexing than HNSW and the blog post does not compare single threaded index times with Lucene. @robertvanwinkle1138 this is just one of

Re: [I] Should we explore DiskANN for aKNN vector search? [lucene]

2023-10-04 Thread via GitHub
robertvanwinkle1138 commented on issue #12615: URL: https://github.com/apache/lucene/issues/12615#issuecomment-1747228177 @benwtrent For merges there is "FreshDiskANN: A Fast and Accurate Graph-Based ANN Index for Streaming Similarity Search" https://arxiv.org/pdf/2105.09613.pdf

Re: [I] Should we explore DiskANN for aKNN vector search? [lucene]

2023-10-03 Thread via GitHub
benwtrent commented on issue #12615: URL: https://github.com/apache/lucene/issues/12615#issuecomment-1744763221 I do think Lucene's read-only segment based architecture leads itself to support quantization (required for DiskANN). It would be an interesting experiment to see how