Re: [I] Should we explore DiskANN for aKNN vector search? [lucene]

2024-03-04 Thread via GitHub


MarcusSorealheis commented on issue #12615:
URL: https://github.com/apache/lucene/issues/12615#issuecomment-1977567565

   > HNSW and Vamana are "competing" proximity graphs, which differ mainly in 
the number of layers in the graph (n vs 1) and the pruning algorithm used. 
   
   I do not think about them as competing. They're different implementations 
with tradeoffs for certain uses, and the data here will evolve over time. Your 
preliminary work opens a new window that I hope you continue to explore for the 
benefot of all of us and all the people that depend on Lucene.
   
   > From a purely academic point of view I find Vamana more appealing due to 
its simplicity
   
   My short time dealing with academia was not about simplicity so this made me 
laugh. Thank you for that.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [I] Should we explore DiskANN for aKNN vector search? [lucene]

2024-03-01 Thread via GitHub


kevindrosendahl commented on issue #12615:
URL: https://github.com/apache/lucene/issues/12615#issuecomment-1974089752

   Think I agree with your points @benwtrent, will just jot down my thinking on 
HNSW vs Vamana vs DiskANN in case it's useful.
   
   HNSW and Vamana are "competing" proximity graphs, which differ mainly in the 
number of layers in the graph (`n` vs 1) and the pruning algorithm used. From a 
purely academic point of view I find Vamana more appealing due to its 
simplicity, namely not having to keep track of levels and there being a 1:1 
relationship between the number of nodes and the number of vectors being 
indexed vs an M:1. Practically speaking they provide roughly the same 
interface, so given we have a working HNSW graph and nothing compelling enough 
to replace it as of now, I'd agree there wouldn't be reason to.
   
   I think of DiskANN as the algorithm consisting of an initial ANN search 
using compressed vectors followed by a reranking phase on full fidelity 
vectors. There are a number of decisions that can be made for where to store 
the graph, compressed vectors, and full fidelity vectors. If you choose to 
store the full fidelity vectors in-line with the graph (as suggested by the 
original DiskANN paper), then Vamana may be more appealing than HNSW due to its 
1:1 node:vector relationship. However, the results above seem to show that this 
implementation didn't benefit much from placing vectors inline with the graph. 
Given all other benefits of storing vectors in a flat file in ordinal order 
(including the potential for asynchronous I/O) that would seem like the 
pragmatic choice, in which case you could pretty easily use an HNSW graph as 
the proximity graph for the DiskANN algorithm.
   
   @jmazanec15 the constants used were taken from JVector, whose 
performance/behavior I was initially trying to emulate in Lucene. I didn't 
spend much time fiddling with them.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [I] Should we explore DiskANN for aKNN vector search? [lucene]

2024-02-06 Thread via GitHub


benwtrent commented on issue #12615:
URL: https://github.com/apache/lucene/issues/12615#issuecomment-1929761190

   So, I did some of my own experiments. I tested Vamana (vectors in-graph) & 
HNSW, both with `int8` quantization (here is my Lucene branch: 
https://github.com/apache/lucene/compare/main...benwtrent:lucene:feature/diskann-exp
 & lucene-util branch: 
https://github.com/mikemccand/luceneutil/compare/master...benwtrent:luceneutil:feature/vamana-testing)
 
   
   In low memory environments, HNSW performed better (confirming the results 
here: https://github.com/apache/lucene/issues/12615#issuecomment-1806615864). 
When the vectors are in the graph for Vamana, there were many more page faults 
(NOTE: I was not using PQ, but trying an apples-to-apples comparison of HNSW & 
vamana in the same conditions).
   
   Additionally, looking at previous results 
(https://github.com/apache/lucene/issues/12615#issuecomment-1868095892):
   
   > vectors: out-of-graph, rerank: sequential | 46.9 ms latency | 170 qps
   
   This indicates that there is very little benefit to Vamana. For DiskANN, one 
of the bragged benefits is being able to "get raw vectors for free" with 
disk-read-ahead when searching the in memory graph (PQ). If reranking with PQ'd 
search with vectors outside of the graph performs almost as well (without 
io_uring), it stands to reason that HNSW with PQ would do just as good. And 
with a better/smarter PQ implementation, less reranking may be necessary 
(combining with OPQ or something).
   
   I don't see any stand-out evidence that Vamana has a significant advantage 
over HNSW when it comes to being a graph based vector index.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [I] Should we explore DiskANN for aKNN vector search? [lucene]

2024-01-29 Thread via GitHub


jmazanec15 commented on issue #12615:
URL: https://github.com/apache/lucene/issues/12615#issuecomment-1915207682

   @kevindrosendahl This is really cool! I had a couple questions around 
product quantization implementation. I see in 
   
[VectorSandboxVamanaVectorsWriter](https://github.com/kevindrosendahl/lucene/blob/vamana2/lucene/core/src/java/org/apache/lucene/codecs/vectorsandbox/VectorSandboxVamanaVectorsWriter.java#L295),
 you create a new codebook for each segment. Do you know roughly how long it 
took to generate?  Also, how did you choose 6 iterations of k-Means? It seems 
from the results they look pretty promising - but just curious how you arrived 
there.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [I] Should we explore DiskANN for aKNN vector search? [lucene]

2024-01-11 Thread via GitHub


rmuir commented on issue #12615:
URL: https://github.com/apache/lucene/issues/12615#issuecomment-1888419082

   i would be extremely careful around io_uring, it is disabled in many 
environments (e.g. by default in container environments) for security reasons: 
   * 
https://security.googleblog.com/2023/06/learnings-from-kctf-vrps-42-linux.html
   * https://github.com/moby/moby/pull/46762
   * https://github.com/containerd/containerd/pull/9320
   
   To me, use of io_uring seems like a totally separate issue from KNN search. 
i don't think it is a good idea to entangle the two things.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [I] Should we explore DiskANN for aKNN vector search? [lucene]

2024-01-11 Thread via GitHub


kevindrosendahl commented on issue #12615:
URL: https://github.com/apache/lucene/issues/12615#issuecomment-1888208867

   > How is segment merging implemented by Lucene Vamana?
   
   I didn't do anything special for Vamana in these experiments, the index 
construction and merging are practically identical to HNSW. The implication of 
course here is that the vectors need to be in memory when building the merged 
graph or performance falls off a cliff. To make this useful operationally you'd 
want to ensure your max segment size does not exceed available memory, and 
ideally you would build the index on a dedicated node to not displace pages 
being used to server queries.
   
   The DiskANN paper suggests clustering the vectors and assigning each vector 
to the two closest clusters, building a graph for each vector, then overlaying 
the graphs and doing another round of pruning. You can then limit the amount of 
memory required based off the number of vectors and clusters. I didn't explore 
this, but it'd be a worthwhile improvement in the long term.
   
   > Correct me if I'm wrong, looks like the Java ANN implementations examine 
the node ids in a more or less random order, filtering requires a `Bits` 
object. I've been wondering if SPANN solves this by enabling ascending doc id 
intersection.
   
   That's correct, the graph is explored (roughly) by proximity order rather 
than doc id. It's an interesting question about SPANN, intuitively it seems 
like you may be able to do as you describe by keeping each vector postings list 
sorted by doc id order, then essentially doing a streaming merge sort on the 
relevant centroids' postings as you consume the candidates and iterate through 
the passed in sorted doc id iterator.
   
   > I have general concerns about large copy overheads
   
   I believe all of today's IO layer implementations in Lucene ([including mmap 
on java 
21](https://github.com/apache/lucene/blob/8bee41880e41024109bf5729584ebc5dd1003717/lucene/core/src/java21/org/apache/lucene/store/MemorySegmentIndexInput.java#L214))
 copy the bytes from the page cache onto the heap one way or another. This 
could potentially be improved in the future by passing a `MemorySegment` 
around, but it seems to have gotten things pretty far so far without problems.
   
   > I know that is a newer Linux kernel module/interface, so that would lead 
me to believe actually taking this implementation forward would require a bit 
of comprehensive testing. ... The other concern I have is that I suspect the 
community would probably need to keep an eye on io_uring given its age.
   
   Yeah, I wouldn't suggest that this be the default implementation, or perhaps 
even included in Lucene core itself at all given how platform specific it is 
(and this POC at least required a small amount ~50 LOC of C glue code linking 
in `liburing` to make things smooth). My main goal here was to highlight to the 
community a few things that may be relevant beyond vector indexes and DiskANN:
   - modern parallel I/O can provide significant performance improvements if 
you're able to remove I/O dependencies in the algorithms
   - the kernel is not doing a great job in this case of determining the best 
pages to keep in memory
   
   There's clearly a ton of perf left on the table by using sequential I/O and 
only the page cache in this case, but it's of course also much simpler both in 
terms of implementation and operational complexity.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [I] Should we explore DiskANN for aKNN vector search? [lucene]

2024-01-09 Thread via GitHub


MarcusSorealheis commented on issue #12615:
URL: https://github.com/apache/lucene/issues/12615#issuecomment-1883963445

   Great to finally see you in the Lucene repo @kevindrosendahl after all these 
years.  The work you have done here is stellar and the whole world welcomes 
the diligence. I hope we can keep this momentum. I have a few concerns that are 
not intended to block the project 
   
   I second the question, "How is segment merging implemented by Lucene 
Vamana?" I have a suspicion that extending `MergePolicy` could be in order for 
a production-ready implementation but I have such limited knowledge for such a 
thought.
   
   I have general concerns about large copy overheads and the restrictions of 
POSIX AIO from experience dealing with the failures of storing data on disk, as 
well as random disk failures.  `io_uring` certainly offers some theoretical 
promise in my mind. I know that is a newer Linux kernel module/interface, so 
that would lead me to believe actually taking this implementation forward would 
require a bit of comprehensive testing. 
   
   One particular concern I have, as you specifically might have guessed, would 
be around soak testing. Has anyone looked into this issue since November, or 
has anyone run the tests for an extended period?  
   
   The other concern I have is that I suspect the community would probably need 
to keep an eye on `io_uring` given its age. I was instructed long ago to use 
old interfaces whenever possible but the person who told me retired so I cannot 
ask why. I would suspect the newer interfaces tend to evolve more quickly than 
the ones than the ones that have been around for a while. Surely that burden 
would likely fall to the Policeman. In general, this is an interesting 
development and I look forward to continuing to explore. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [I] Should we explore DiskANN for aKNN vector search? [lucene]

2023-12-22 Thread via GitHub


robertvanwinkle1138 commented on issue #12615:
URL: https://github.com/apache/lucene/issues/12615#issuecomment-1868130520

   @kevindrosendahl Pretty interesting, thanks for the low level details, 
`io_uring` is fancy!
   
   How is segment merging implemented by Lucene Vamana?  
   
   Correct me if I'm wrong, looks like the Java ANN implementations examine the 
the node ids in a more or less random order, filtering requires a `Bits` 
object.  I've been wondering if SPANN solves this by enabling ascending doc id 
intersection.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [I] Should we explore DiskANN for aKNN vector search? [lucene]

2023-12-22 Thread via GitHub


kevindrosendahl commented on issue #12615:
URL: https://github.com/apache/lucene/issues/12615#issuecomment-1868095892

   Hi all, thanks for the patience, some interesting and exciting results to 
share.
   
   TL;DR:
   - DiskANN doesn't seem to lose any performance relative to HNSW when fully 
in memory, and may actually be faster
   - able to slightly beat JVector's larger-than-memory performance in Lucene 
using the same algorithm
   - able to get similar performance (sometimes better sometimes worse 
depending on the amount of available memory) when _not_ storing the full 
fidelity vectors in the graph, and doing the reranking without having cached 
the vectors during graph traversal
   - able to further improve performance ~7-10x by parallelizing the I/O during 
the reranking phase using [io_uring](https://en.wikipedia.org/wiki/Io_uring) 
when storing the vectors outside of the graph
   - with parallelized reranking and pinning the (relatively small) graph and 
PQ vectors in memory, against a 100 GB index latency when constraining memory 
to only 8GB is only about 40% slower than fully in memory lucene HNSW (and 
achieves ~10% higher recall given the tested configurations)
   - without pinning the graph and PQ vectors, the benefits of parallel I/O 
during reranking depend on how good the page cache is at keeping the graph and 
PQ vectors in memory. if they get paged out, the page faults during graph 
traversal can negate some or most of the gains that parallel I/O offers
   ### Algorithm Analysis
   The DiskANN algorithm consists of two phases:
   1. graph traversal using PQ vectors for distance calculations
   2. reranking the top *k* after finishing graph traversal by recalculating 
distance using the full fidelity vector instead of PQ vectors
   
   The original algorithm (and JVector's implementation) store the full 
fidelity vector and the adjacency list for a graph node together on disk. This 
means that when you retrieve the adjacency list during the first phase, the 
full fidelity vector will also be in memory due to locality. Immediately after 
retrieving the adjacency list, you cache the full fidelity vector. When 
re-ranking in phase 2, you simply use the cached vectors you've acquired during 
the first phase, and can rerank without any I/O.
   
   There are two interesting nuances to this algorithm:
   - the working set for a given workload is quite large, since it must consist 
of the adjacency list, PQ vectors, _and full fidelity vectors_ of all nodes 
that are visited during graph traversal
   - there are dependencies between each I/O due to the I/Os being tied to 
sequential graph traversal
   
   If you instead do not store the vectors in the graph and do not cache them 
during traversal, the above two are a little different, and give some intuition 
into the performance of the pinned graph/PQ vectors with `io_uring`:
   - the working set for the first phase is relatively small, just the 
adjacency lists and PQ vectors of the nodes that are visited
   - there are no ordering dependencies between the I/Os for the second phase, 
and thus they can be done in parallel
   
   ### Results
   The following results are from indexes built against the 
[Cohere/wiki-22-12-en-embeddings](https://huggingface.co/datasets/Cohere/wikipedia-22-12-en-embeddings)
 data set (35 million vectors, 768 dimensions) using L2 distance. The indexes 
are all around ~100 GB. The performance tests were run on the same setup as the 
prior results (`c7gd.16xlarge` using docker to artificially constrain available 
memory) and all ran 8 query threads (each query thread executing its given 
query in a single thread). Recall was calculated using 10k test vectors that 
were excluded from the data set used to build the index, and performance tests 
were run using 100k vectors that were included in the data set used to build 
the index (a large number of test vectors were required to prevent a small 
working set from being established). Pinning parts of the Lucene Vamana index 
into memory was done using `mlock`.
   
   Some high level information about the indexes:
   
   | Type | Parameters | Recall |
   |-|-|-|
   | Lucene HNSW | maxConn: 16, beamWidth: 100 | 0.78 |
   | Lucene Vamana In-Graph Vectors | maxConn: 32, beamWidth: 100, alpha: 1.2, 
pqFactor: 8 | 0.88 |
   | Lucene Vamana Out-of-Graph Vectors | maxConn: 32, beamWidth: 100, alpha: 
1.2, pqFactor: 8 | 0.88 |
   | JVector | maxConn: 16, beamWidth: 100, alpha: 1.2, pqFactor: 8 | 0.88 |
   
   NB: in Lucene HNSW `maxConn` configures all levels except level 0 and level 
0 uses `2*maxConn`, in JVector the single layer graph uses `2*maxConn`, and in 
Lucene Vamana the single layer graph uses `maxConn`. Put differently, the 
bottom layer of the graphs all use the same `maxConn` value even though they 
appear to be configured differently.
   
    Fully in Memory
   
   | Type | Configuration | Average Latency | QPS | 
   |-|-|-|-|
   | Lucene HNSW | N/A | 1.59 

Re: [I] Should we explore DiskANN for aKNN vector search? [lucene]

2023-11-20 Thread via GitHub


kevindrosendahl commented on issue #12615:
URL: https://github.com/apache/lucene/issues/12615#issuecomment-1819863900

   @navneet1v I've been using [oshi](https://github.com/oshi/oshi) in my 
testing framework, particularly 
[OSThread::getMajorFaults](https://www.oshi.ooo/oshi-core-java11/apidocs/com.github.oshi/oshi/software/os/OSThread.html#getMajorFaults()).
 If you need to do it with zero external dependencies, you'd need to implement 
it for each environment you'd be running on. On linux you can gather the info 
from `/proc/thread-self/stat`.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [I] Should we explore DiskANN for aKNN vector search? [lucene]

2023-11-20 Thread via GitHub


navneet1v commented on issue #12615:
URL: https://github.com/apache/lucene/issues/12615#issuecomment-1819800985

   @kevindrosendahl, unrelated to the thread I see that you have added a column 
named page faults. Can you provide me some details around how you got the page 
faults? I am doing some other tests and want to know what is the way to get the 
page faults.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [I] Should we explore DiskANN for aKNN vector search? [lucene]

2023-11-13 Thread via GitHub


kevindrosendahl commented on issue #12615:
URL: https://github.com/apache/lucene/issues/12615#issuecomment-1808717985

   @benwtrent 
   > Thank you @kevindrosendahl this does seem to confirm my suspicion that the 
improvement isn't necessarily due to the data structure, but due to 
quantization. But, this does confuse me as Vamana is supposedly good when 
things don't all fit in memory. It may be due to how we fetch things (MMAP). I 
wonder if Vamana is any good at all when using MMAP...
   
   Yeah, upon reflection the result makes sense. DiskANN only uses the 
in-memory PQ vectors for the distance calculations for deciding which new 
candidates it should consider visiting in the future. This is the critical 
piece which reduces I/O. The in-graph vectors mean that when you have decided 
the next candidate, you get the full fidelity vector for free on the I/O to get 
the adjacency list, but at that point you've already done the distance 
calculation against that candidate. If you were to grab the full fidelity 
vector from the graph for the distance calculation like the non-PQ vamana 
implementations do, you've moved the I/O complexity back from 
`O(candidates_explored)` to `O(nodes_considered)`, and probably need to fetch 
the page again when you've decided to actually consider a candidate to re-grab 
it's adjacency list.
   
   What that full fidelity vector is useful for is re-scoring, so you can just 
keep that full fidelity vector in memory until the end of the search ([jvector 
reference](https://github.com/jbellis/jvector/blob/main/jvector-base/src/main/java/io/github/jbellis/jvector/graph/GraphSearcher.java#L218-L220))
 and resort the final result list using their true distances ([jvector 
reference](https://github.com/jbellis/jvector/blob/main/jvector-base/src/main/java/io/github/jbellis/jvector/graph/GraphSearcher.java#L258C6-L259)).
   
   I'm not sure how impactful this re-ranking is, hoping to A/B it when I get 
PQ working, but assuming it's impactful, the index structure could save up to 
`numCandidates` I/Os. In this case jvector was only doing ~70 I/Os, so adding 
up to 100 on top could be a pretty big deal.
   
   The main thing MMAP may prevent us from "easily" doing is parallelizing the 
I/O, which I believe the DiskANN paper does but jvector does not currently 
(possibly due to the results looking pretty decent without it, and it being 
hard with mmap/Java).
   
   > Or is `pq=` not the number of subspaces, but `vectorDim/pq == 
numberOfSubSpaces`? If so, then recall should reduce as it increases.
   
   This is correct, `pq` in the tables relates to the `pqFactor` in JVector, 
which is as you've described. Agreed with you and @jbellis that the observed 
result is bizarre.
   
   > For your testing infra, int8 quantization might close the gap at that 
scale.
   
   I ran some tests with int8 quantization, nothing very surprising in the 
results. If you can get the full index to fit in memory it performs great, but 
performance degrades significantly once it falls out of memory proportional to 
I/O. The scalar quant vectors were 7.3 GB with the HNSW graph being 329 MB. For 
reference, jvector `pqFactor=8` was 32.45 GB graph with vectors + 0.97 GB 
quantized vectors.
   | index configuration | memory configuration | average recall | average 
latency | average page faults |
   |-|-|-|-|-|
   | hnsw scalar quant | fully in memory | 0.79819 | 0.001259s | 0 |
   | jvector `pqFactor=8` | fully in memory | 0.9121 | 0.00148s | 0 |
   | hnsw scalar quant | 10GB system 2GB heap | 0.79819 | 0.00119s | 0 |
   | hnsw scalar quant |  8GB system 4GB heap | 0.79819 | 0.856s | 606.98 |
   | jvector `pqFactor=8` | 4GB system 2GB heap | 0.9121 | 0.151s | 80.15 |
   
   So fundamentally with these graph structures, the key is to have the vectors 
needed for distance calculations of potential neighbor candidates in memory. 
Scalar quantization helps the cause here by reducing the size of the vectors by 
4. PQ can help even more by even more drastically reducing the memory impact. 
JVector further ensures that the quantized vectors are in memory by storing 
them on the heap.
   
   > FYI, as significant (and needed) refactor occurred recently for int8 
quantization & HNSW, so your testing branch might be significantly impacted :(.
   
   Is there any expected performance difference, or is it mainly 
organizational? The code in my branch is not in a state I would be comfortable 
suggesting for upstreaming, so if it's "just" a problem for that, I'm ok to 
keep my branch a bit outdated, and we could make any suitable ideas fit in 
if/when we thought upstreaming could be worthwhile.
   
   
   > Two significant issues with a Lucene implementation I can think of are:
   > 
   > * Segment merge time: We can maybe do some fancy things with better 
starter centroids in Lloyd's algorithm, but I think we will have to rerun 
Lloyd's algorithm on every merge. Additionally the graph building probably 
cannot be done 

Re: [I] Should we explore DiskANN for aKNN vector search? [lucene]

2023-11-13 Thread via GitHub


jbellis commented on issue #12615:
URL: https://github.com/apache/lucene/issues/12615#issuecomment-1808489834

   > recall actually improves when introducing pq, and only starts to decrease 
at a factor of 16
   
   I would guess that either there is a bug or you happen to be testing with a 
really unusual dataset.  PQ is fundamentally a lossy compression and can't 
magically create similarity that didn't exist in the original.
   
   > Regardless, is there any oversampling that is occurring when PQ is enabled 
in JVector?
   
   Today it's up to the caller (so on the Cassandra side, in Astra) but it's 
possible that it should move into JVector.
   
   > Additionally the graph building probably cannot be done with the PQ'd 
vectors.
   
   I suppose it's not impossible that you could compress first and then build 
but I have not seen anyone do it yet.  JVector follows DiskANN's lead and 
builds the graph using uncompressed vectors. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [I] Should we explore DiskANN for aKNN vector search? [lucene]

2023-11-13 Thread via GitHub


benwtrent commented on issue #12615:
URL: https://github.com/apache/lucene/issues/12615#issuecomment-1808340050

   Thank you @kevindrosendahl this does seem to confirm my suspicion that the 
improvement isn't necessarily due to the data structure, but due to 
quantization. But, this does confuse me as Vamana is supposedly good when 
things don't all fit in memory. It may be due to how we fetch things (MMAP). I 
wonder if Vamana is any good at all when using MMAP...
   
   For your testing infra, int8 quantization might close the gap at that scale. 
FYI, as significant (and needed) refactor occurred recently for int8 
quantization & HNSW, so your testing branch might be significantly impacted :(.
   
   > recall actually improves when introducing pq, and only starts to decrease 
at a factor of 16
   
   I am surprised it decreases as the number of sub-spaces increases. This 
makes me thing that JVector's PQ implementation is weird.
   
   Or is `pq=` not the number of subspaces, but `vectorDim/pq == 
numberOfSubSpaces`? If so, then recall should reduce as it increases.
   
   Regardless, is there any oversampling that is occurring when PQ is enabled 
in JVector?
   
   >  It's great that PQ goes quite a ways before hurting recall.
   
   PQ is a sharp tool, at scale it can have significant draw backs (eventually 
you have to start dramatically oversampling as centroids get very noisy). 
Though, I am not sure there is a significant recall cliff.
   
   Two significant issues with a Lucene implementation I can think of are:
   
- Segment merge time: We can maybe do some fancy things with better starter 
centroids in Lloyd's algorithm, but I think we will have to rerun Lloyd's 
algorithm on every merge. Additionally the graph building probably cannot be 
done with the PQ'd vectors.
- Graph quality: I don't think we can build the graph with PQ'd vectors and 
retain good recall. Meaning at merge time, we have to page in larger raw (or 
differently quantized) vectors and only do PQ after graph creation.
   
   There are [interesting approaches to PQ w/ graph 
exploration](https://medium.com/@masajiro.iwasaki/fusion-of-graph-based-indexing-and-product-quantization-for-ann-search-7d1f0336d0d0)
 and a different PQ implementation via Microsoft that is worthwhile 
[OPQ](https://www.microsoft.com/en-us/research/wp-content/uploads/2013/11/pami13opq.pdf)
   
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [I] Should we explore DiskANN for aKNN vector search? [lucene]

2023-11-13 Thread via GitHub


mikemccand commented on issue #12615:
URL: https://github.com/apache/lucene/issues/12615#issuecomment-1808286805

   > I've got my framework set up for testing larger than memory indexes and 
have some somewhat interesting first results.
   
   Thank you for setting this up @kevindrosendahl -- these are very interesting 
results.  It's great that PQ goes quite a ways before hurting recall.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [I] Should we explore DiskANN for aKNN vector search? [lucene]

2023-11-10 Thread via GitHub


kevindrosendahl commented on issue #12615:
URL: https://github.com/apache/lucene/issues/12615#issuecomment-1806615864

   I've got my framework set up for testing larger than memory indexes and have 
some somewhat interesting first results.
   
   TL;DR:
   - the main thing driving jvector's larger-than-memory performance is keeping 
product-quantized vectors in memory, which avoids the need for I/O to do 
distance calculations. This is reflected in ~24x less page faults, and a 
similar improvement in query latency
   - for this dataset, introducing PQ does not have a negative effect on recall 
until reaching a factor of 16, after which recall begins to drop off. recall 
actually slightly improves with each factor increase up to 4, and does not 
materially change at 8
   - storing the vectors inline in the graph does not seem to have a large 
impact on larger than memory performance
   - other differences like having a cache of neighbors close to the entry 
point or storing offsets in memory vs having consistent offsets are marginal 
differences, and don't account for the order of magnitude improvement
   
   Next steps:
   - measure lucene performance with scalar quantization
   - incorporate PQ into the lucene vamana implementation
   - explore SPANN to see how it performs relative to these implementations
   ### Setup
   I used the 
[Cohere/wikipedia-22-12-es-embeddings](https://huggingface.co/datasets/Cohere/wikipedia-22-12-es-embeddings)
 (768 dimensions) with L2 similarity for these. I split the dataset into 10,000 
vectors randomly sampled from the dataset as the test set, with the remaining 
10,114,929 vectors as the training set.
   
   I built and ran the query tests on a `c7gd.16xlarge` (128 GB memory, 64 
threads. Thank you for the addition of concurrent merging, it helped speed up 
the builds drastically!). The queries with restricted system memory were done 
by purging the host page cache, then running in a docker container with `-m 
8g`, a Coretto Java 21 JVM with `-Xms4g` and `-Xmx4g` and the Panama vector API 
enabled, and with the dataset and indexes bind mounted into the container. 
There is a warmup phase of 2 iterations of the 10,000 test vectors followed by 
3 measured iterations of the 10,000 vectors, all iterations running 48 queries 
at a time with a single thread per query.
   
   Note that the indexes aren't quite apples-to-apples, as the Lucene HNSW 
index configuration (`maxConn: 16`, `beamWidth: 100`) achieves 0.8247 recall 
(and slightly better latency when in memory) whereas both Vamana configurations 
(`M: 32`, `beamWidth: 100`, `alpha: 1.2`) achieve ~0.91 recall, but the broad 
strokes are obvious enough to draw some conclusions.
   
    Index size
   | configuration | size | breakdown |
   |-|-|-|
   | lucene hnsw | 31.42 GB  | 31.07 GB vectors + 0.35 GB graph |
   | lucene vamana | 31.73 GB | 31.73 GB graph with vectors |
   | jvector pq=0 | 32.45 GB | 32.45 GB graph with vectors |
   | jvector pq=2 | 36.33 GB | 32.45 GB graph with vectors + 3.88 GB quantized 
vectors |
   | jvector pq=4 | 34.39 GB | 32.45 GB graph with vectors + 1.94 GB quantized 
vectors |
   | jvector pq=8 | 33.42 GB | 32.45 GB graph with vectors + 0.97 GB quantized 
vectors |
   | jvector pq=16 | 32.93 GB | 32.45 GB graph with vectors + 0.48 GB quantized 
vectors |
   
    Queries fully in memory (124 GB of RAM available)
   |configuration | recall | average duration |
   |-|-|-|
   | lucene hnsw | 0.8247 | 0.00187s |
   | lucene vamana | 0.9086 | 0.00257s |
   | jvector pq=0 | 0.9108 | 0.00495s |
   | jvector pq=2 | 0.9109 | 0.00259s |
   | jvector pq=4 | 0.9122 | 0.00291s |
   | jvector pq=8 | 0.9121 | 0.00148s |
   | jvector pq=16 | 0.8467 | 0.0012s |
   
    Queries with  8 GB system memory, 4 GB heap
   |configuration | recall | average duration | average page faults | total 
page faults|
   |-|-|-|-|-|
   | lucene hnsw | 0.8247 | 2.950s | 1464.65 | 4.39395E7 | 
   | lucene vamana | 0.9086 | 3.122s | 1662.26 | 4.9867932E7 |
   | jvector pq=0 | 0.9108 | 3.651s | 1716.03 | 5.1481117E7 |
   | jvector pq=2 | 0.9109 | 0.127s | 69.65 | 2089449 |
   | jvector pq=4 | 0.9122 | 0.131s | 68.94 | 2068274 |
   | jvector pq=8 | 0.9121 | 0.119s | 70.35 | 2110418 |
   | jvector pq=16 | 0.8467 | 0.126s | 72.64 | 2179330 |
   
    Conclusions
   |conclusion|reasoning|
   |-|-|
   | storing the vectors inline in the graph does not seem to have a large 
impact on larger than memory performance | lucene hnsw and lucene vamana 
performance is in the same order of magnitude with simlar numbers of page 
faults |
   | jvector's neighbor cache and consistent offsets are not large determinants 
in improving larger than memory performance | lucene vamana (which has no cache 
and in-memory offsets) and jvector vamana `pq=0` performance is in the same 
order of magnitude |
   | pq is the largest determinant in improving performance | jvector `pq=2` 
performance is ~28x better than jvector `pq=0` 

Re: [I] Should we explore DiskANN for aKNN vector search? [lucene]

2023-11-10 Thread via GitHub


kevindrosendahl commented on issue #12615:
URL: https://github.com/apache/lucene/issues/12615#issuecomment-1806614314

   @**[benwtrent](https://github.com/benwtrent)**:
   > if I am reading the code correctly, it does the following:
   > - Write int8 quantized vectors along side the vector ordinals in the graph 
(`.vex` or whatever has a copy of each vector).
   > - Continue to write vectors in `.vec`. I am guessing this is a stop-gap 
and you are thinking of removing this? Maybe not?
   
   The implementation will write the vectors at their requested 
encoding/quantization level into the graph. So if your vectors are `float[]`s 
with no quantization, `float[]`s go into the graph. If your vectors are 
`byte[]`s or `float[]`s with quantization, `byte[]`s go into the graph. If you 
do enable quantization, it keeps the full fidelity vectors around on the side 
for good measure, otherwise it only keeps a copy in the graph.
   
   > I have a concern around index sorting. How does building the graph & 
subsequent `getFloatVectors()` play with index sorting? Usually when folks sort 
an index, they expect to be able to iterate values in that given order. Is it 
horrifically slow to iterate `.vex` in this scenario?
   >
   > What do we think about always keeping `.vec` around? Probably should for 
re-ranking purposes once more extreme quantization measures are used.
   
   Interesting re: sorting. I hadn't given that much deep thought yet. To 
level-set a little on how I'm planning on approaching this, I think it's still 
a bit of an open question what performance is theoretically possible, and what 
type of index could possibly get us there.
   
   I'm trying first to get a sense of the possibilities there, so my main focus 
first is on measuring the single segment performance in order to rule out ideas 
not worth pursuing before investing too much time in them.
   
   If a single segment index works well, I think the next step would probably 
be to make sure it would work well with concurrent query execution over 
multiple segments, and then to  start thinking about how it could be 
productionalized (sorted indexes, merges, etc).
   
   Thoughts on this approach?
   
   > One more question, have you tested your implementation in the situation 
where `.vex` cannot all be paged into memory and it faired ok?
   
   I've just got some first results for larger than memory performance, will 
post a separate comment.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [I] Should we explore DiskANN for aKNN vector search? [lucene]

2023-11-10 Thread via GitHub


robertvanwinkle1138 commented on issue #12615:
URL: https://github.com/apache/lucene/issues/12615#issuecomment-1805417696

   Another notable difference in the Lucene implementation is delta variable 
byte encoding of node ids. The increase in disk space requires the user to 
purchase more RAM per server.  Also variable byte encoding is significantly 
slower to decode than raw integers.
   
   In the example 
[listed](https://github.com/apache/lucene/issues/12615#issuecomment-1793186041) 
the jvec index size is a whopping 32% greater.  
   
   
https://github.com/apache/lucene/blob/main/lucene/core/src/java/org/apache/lucene/codecs/lucene99/Lucene99HnswVectorsReader.java#L593
   
   
https://github.com/jbellis/jvector/blob/main/jvector-base/src/main/java/io/github/jbellis/jvector/disk/OnDiskGraphIndex.java#L130
   
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [I] Should we explore DiskANN for aKNN vector search? [lucene]

2023-11-09 Thread via GitHub


robertvanwinkle1138 commented on issue #12615:
URL: https://github.com/apache/lucene/issues/12615#issuecomment-1804858072

   Perhaps much of the jvector performance improvement is simply from on heap 
caching.
   
   
https://github.com/jbellis/jvector/blob/main/jvector-base/src/main/java/io/github/jbellis/jvector/disk/GraphCache.java
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [I] Should we explore DiskANN for aKNN vector search? [lucene]

2023-11-08 Thread via GitHub


benwtrent commented on issue #12615:
URL: https://github.com/apache/lucene/issues/12615#issuecomment-1802705393

   @kevindrosendahl if I am reading the code correctly, it does the following:
   
- Write int8 quantized vectors along side the vector ordinals in the graph 
(`.vex` or whatever has a copy of each vector).
- Continue to write vectors in `.vec`. I am guessing this is a stop-gap and 
you are thinking of removing this? Maybe not?
   
   I have a concern around index sorting. How does building the graph & 
subsequent `getFloatVectors()` play with index sorting? Usually when folks sort 
an index, they expect to be able to iterate values in that given order. Is it 
horrifically slow to iterate `.vex` in this scenario? 
   
   What do we think about always keeping `.vec` around? Probably should for 
re-ranking purposes once more extreme quantization measures are used.
   
   
   One more question, have you tested your implementation in the situation 
where `.vex` cannot all be paged into memory and it faired ok?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [I] Should we explore DiskANN for aKNN vector search? [lucene]

2023-11-07 Thread via GitHub


kevindrosendahl commented on issue #12615:
URL: https://github.com/apache/lucene/issues/12615#issuecomment-1800243525

   > I haven't had a chance to read your branch yet, but hope to soon.
   
   Great, thanks! To save you a bit of time, the tl;dr of going from HNSW to 
vamana is that it's actually fairly simple in the end. For the most part, I 
just removed all references to levels and changed the [diversity 
check](https://github.com/kevindrosendahl/lucene/blob/075116396e8af35b0a19a277491be9acab150beb/lucene/core/src/java/org/apache/lucene/util/vamana/VamanaGraphBuilder.java#L376-L404)
 to incorporate vamana's `alpha` parameter (essentially just adding a second, 
outer loop). There were a few small other implementation detail changes as well 
like [pruning down all the way to `M` when passing the overflow threshold 
instead of just removing one 
neighbor](https://github.com/kevindrosendahl/lucene/blob/075116396e8af35b0a19a277491be9acab150beb/lucene/core/src/java/org/apache/lucene/util/vamana/VamanaGraphBuilder.java#L360-L369).
   
   Then on the storage size, you just [write vectors into the graph instead of 
into the `.vec` 
file](https://github.com/kevindrosendahl/lucene/blob/075116396e8af35b0a19a277491be9acab150beb/lucene/core/src/java/org/apache/lucene/codecs/vectorsandbox/VectorSandboxVamanaVectorsWriter.java#L703-L734),
 and [read from the right offsets to get the vector values as 
appropriate](https://github.com/kevindrosendahl/lucene/blob/075116396e8af35b0a19a277491be9acab150beb/lucene/core/src/java/org/apache/lucene/codecs/vectorsandbox/VectorSandboxVamanaVectorsReader.java#L598-L684).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [I] Should we explore DiskANN for aKNN vector search? [lucene]

2023-11-07 Thread via GitHub


benwtrent commented on issue #12615:
URL: https://github.com/apache/lucene/issues/12615#issuecomment-1798357836

   @kevindrosendahl this looks really interesting! Thank you for digging in and 
starting the experimentation!
   
   I haven't had a chance to read your branch yet, but hope to soon.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [I] Should we explore DiskANN for aKNN vector search? [lucene]

2023-11-03 Thread via GitHub


kevindrosendahl commented on issue #12615:
URL: https://github.com/apache/lucene/issues/12615#issuecomment-1793186041

   Hey @benwtrent and all, just wanted to let you know that I'm experimenting 
some with different index structures for larger than memory indexes.
   
   I have a working implementation of Vamana based off the existing HNSW 
implementation (with vectors colocated with the adjacency lists in storage) in 
[this branch](https://github.com/kevindrosendahl/lucene/tree/vamana2), which 
I'm currently working on integrating scalar quantization with, and a 
benchmarking framework 
[here](https://github.com/kevindrosendahl/java-ann-bench) which can run various 
Lucene and JVector configurations.
   
   I don't have many numbers to share yet besides the fact that the Vamana 
graph implementation in a single segment seems competitive while in memory with 
Lucene HNSW (single segment) and JVector on small data sets. For 
glove-100-angular from ann-benchmarks (k=10):
   ```
   lucene_hnsw_maxConn:16-beamWidth:100_numCandidates:150
average recall 0.82274001
average duration PT0.000968358S
   index size: 529M
   
   
jvector_vamana_M:16-beamWidth:100-neighborOverflow:2.0-alpha:1.2_pqFactor:0-numCandidates:100
average recall 0.82422001
average duration PT0.00124232S
   index size: 703M
   
   lucene_sandbox-vamana_maxConn:32-beamWidth:100-alpha:1.2-_numCandidates:100
average recall 0.82553
   average duration PT0.000940756S
   index size: 554M
   ```
   
   I plan on testing vamana without PQ, vamana with PQ (a la DiskANN), as well 
as SPANN. Happy to collaborate with anyone interested.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [I] Should we explore DiskANN for aKNN vector search? [lucene]

2023-11-01 Thread via GitHub


benwtrent commented on issue #12615:
URL: https://github.com/apache/lucene/issues/12615#issuecomment-1788822945

   So, I replicated the jvector benchmark (the lucene part) using the new int8 
quantization. 
   
   Note, this is with `0` fan out or extra top-k gathered. Since the benchmark 
on JVector didn't specify any recall, etc. I just did the absolute baseline of 
`top-100`.
   
   I reserved 12GB to heap, thus reducing off-heap memory to at most 30GB.
   
   ```
   1 thread over 37 segments:
   completed 1000 searches in 18411 ms: 54 QPS CPU time=18231ms
   checking results
   0.77718.23   1   0   16  100 100 0   
1.00post-filter
   ```
   
   Since kNN allows the segments to be searched in parallel, I used 8 threads 
for the query rewrite
   ```
   8 threads over 37 segments:
   completed 1000 searches in 2996 ms: 333 QPS CPU time=218ms
   checking results
   0.777 0.22   1   0   16  100 100 0   
1.00post-filter
   ```
   
   I am currently force-merging to a single segment to see what a single graph 
gives us. 
   
   FYI: the data set would require > 40GB of ram to be held in memory. With 
int8 quantization, its down to around 10GB.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [I] Should we explore DiskANN for aKNN vector search? [lucene]

2023-10-20 Thread via GitHub


jbellis commented on issue #12615:
URL: https://github.com/apache/lucene/issues/12615#issuecomment-1772724758

   > Or perhaps we "just" make a Lucene Codec component (KnnVectorsFormat) that 
wraps jvector? (https://github.com/jbellis/jvector)
   
   I'm happy to support anyone who wants to try this, including modifying 
JVector to make it a better fit if necessary.  I won't have time to tackle this 
myself in the medium-term, however.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [I] Should we explore DiskANN for aKNN vector search? [lucene]

2023-10-20 Thread via GitHub


jbellis commented on issue #12615:
URL: https://github.com/apache/lucene/issues/12615#issuecomment-1772722737

   > It is possible that the candidate postings (gathered via HNSW) don't 
contain ANY filtered docs. This would require gathering more candidate postings.
   
   This was a big problem for our initial deployment, we'd filter down to a few 
valid items and then spend forever searching the graph for them.  Had to add a 
fairly accurate estimate of how many comparisons the index would need, and use 
that to decide whether to brute-force the comparison instead.  (This is in 
Cassandra, not JVector, specifically VectorMemtableIndex.expectedNodesVisited.) 
 I don't remember seeing code for this in Lucene but I mostly only looked at 
the HNSW code so I could have missed it.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [I] Should we explore DiskANN for aKNN vector search? [lucene]

2023-10-20 Thread via GitHub


jbellis commented on issue #12615:
URL: https://github.com/apache/lucene/issues/12615#issuecomment-1772711571

   > DiskANN is known to be slower at indexing than HNSW 
   
   I don't remember the numbers here, maybe 10% slower?  It wasn't material 
enough to make me worry about it.  (This is with using an HNSW-style 
incremental build, not the two-pass build described in the paper.)
   
   > and the blog post does not compare single threaded index times with Lucene.
   
   It was about 30% worse several months ago with my concurrent hnsw patch, 
should be significantly improved now since JVector (1) doesn't need the 
CompletionTracker to keep the layers consistent, b/c it's single layer, (2) 
added a DenseIntMap concurrent collection to replace ConcurrentHashMap for the 
nodes.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [I] Should we explore DiskANN for aKNN vector search? [lucene]

2023-10-20 Thread via GitHub


jbellis commented on issue #12615:
URL: https://github.com/apache/lucene/issues/12615#issuecomment-1772704049

   Responding top to bottom,
   
   > I wonder how much the speed difference is due to (1) Vectors being out of 
memory (and if they used PQ for diskann, if they did, we should test PQ with 
HNSW). (2) Not merging to a single segment and searching multiple segments 
serially.
   
   (1) 90% of it is the PQ, yes.  I assume that storing the vector inline w/ 
the graph also helps some but I did not test that separately.  You could 
definitely get a big speed up just using PQ on top of HNSW.  
   
   (2) Single segment in both cases.  (JVector leaves segment management as an 
exercise for the user.)


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [I] Should we explore DiskANN for aKNN vector search? [lucene]

2023-10-19 Thread via GitHub


dsmiley commented on issue #12615:
URL: https://github.com/apache/lucene/issues/12615#issuecomment-1772084185

   What say you @jbellis :-)
   I recommended a module of Lucene when we spoke at Community-over-Code.  A 
dependency outside is okay for non-core.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [I] Should we explore DiskANN for aKNN vector search? [lucene]

2023-10-07 Thread via GitHub


mikemccand commented on issue #12615:
URL: https://github.com/apache/lucene/issues/12615#issuecomment-1751743673

   Or perhaps we "just" make a Lucene Codec component (KnnVectorsFormat) that 
wraps jvector?  (https://github.com/jbellis/jvector)


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [I] Should we explore DiskANN for aKNN vector search? [lucene]

2023-10-07 Thread via GitHub


mikemccand commented on issue #12615:
URL: https://github.com/apache/lucene/issues/12615#issuecomment-1751740152

   (listening to @jbellis talk at Community over Code).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [I] Should we explore DiskANN for aKNN vector search? [lucene]

2023-10-07 Thread via GitHub


mikemccand commented on issue #12615:
URL: https://github.com/apache/lucene/issues/12615#issuecomment-1751739864

   SPANN is another option?
   
   
https://www.researchgate.net/publication/356282356_SPANN_Highly-efficient_Billion-scale_Approximate_Nearest_Neighbor_Search


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [I] Should we explore DiskANN for aKNN vector search? [lucene]

2023-10-06 Thread via GitHub


benwtrent commented on issue #12615:
URL: https://github.com/apache/lucene/issues/12615#issuecomment-1750872046

   > This search with filter method seems to throw an error.
   
   LOL, I thought it was supported, I must have read a github issue and made an 
assumption.
   
   > Couldn't that be done with the current Lucene HNSW implementation?
   
   I think so. The tricky thing is making sure enough matching postings are 
gathered to give an accurate number for recall. Having to go back to the HNSW 
graph to get more postings list could get expensive if we keep getting to few 
matches.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [I] Should we explore DiskANN for aKNN vector search? [lucene]

2023-10-06 Thread via GitHub


robertvanwinkle1138 commented on issue #12615:
URL: https://github.com/apache/lucene/issues/12615#issuecomment-1750770721

   > QDrant's HNSW filter solution is the exact same as Lucene's
   
   Interesting thanks.
   
   > as candidate posting lists are gathered, ensure they have some candidates
   
   Couldn't that be done with the current Lucene HNSW implementation?
   
   > The [SPANN repository](https://github.com/microsoft/SPTAG) supports 
filtering, and its OSS, so we could always just read what they did.
   
   This search with filter method seems to throw an error.
   
   
https://github.com/microsoft/SPTAG/blob/main/AnnService/src/Core/SPANN/SPANNIndex.cpp#L262
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [I] Should we explore DiskANN for aKNN vector search? [lucene]

2023-10-05 Thread via GitHub


benwtrent commented on issue #12615:
URL: https://github.com/apache/lucene/issues/12615#issuecomment-1749322588

   > QDrant has a filter solution however the methodology described in their 
blog is opaque.
   
   QDrant's HNSW filter solution is the exact same as Lucene's. You can look at 
the code, they don't filter candidate exploration but filer result collection.
   
   You are correct that filtering with SPANN would be different. Though I am 
not sure its intractable. 
   
   It is possible that the candidate postings (gathered via HNSW) don't contain 
ANY filtered docs. This would require gathering more candidate postings.
   
   But I think we can do that before scoring. So, as candidate posting lists 
are gathered, ensure they have some candidates. 
   
   But I am pretty sure the SPANN repository supports filtering, and its OSS, 
so we could always just read what they did.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [I] Should we explore DiskANN for aKNN vector search? [lucene]

2023-10-05 Thread via GitHub


robertvanwinkle1138 commented on issue #12615:
URL: https://github.com/apache/lucene/issues/12615#issuecomment-1749187258

   The SPANN paper does not address efficient filtered queries.  Lucene's HNSW 
calculates the similarity score for every record, regardless of the record 
matching the filter.  
   
   Filtered − DiskANN [1] describes a solution for efficient filtered queries.  
   
   QDrant has a filter solution however the methodology described in their blog 
is opaque.  
   
   1. https://dl.acm.org/doi/pdf/10.1145/3543507.3583552
   
   > As Approximate Nearest Neighbor Search (ANNS)-based dense retrieval 
becomes ubiquitous for search and recommendation scenarios, efciently answering 
fltered ANNS queries has become a critical requirement. Filtered ANNS queries 
ask for the nearest neighbors of a query’s embedding from the points in the 
index that match the query’s labels such as date, price range, language. There 
has been little prior work on algorithms that use label metadata associated 
with vector data to build efcient indices for fltered ANNS queries. 
Consequently, current indices have high search latency or low recall which is 
not practical in interactive web-scenarios. We present two algorithms with 
native support for faster and more accurate fltered ANNS queries: one with 
streaming support, and another based on batch construction. Central to our 
algorithms is the construction of a graph-structured index which forms 
connections not only based on the geometry of the vector data, but also the 
associated lab
 el set. On real-world data with natural labels, both algorithms are an order 
of magnitude or more efcient for fltered queries than the current state of the 
art algorithms. The generated indices also be queried from an SSD and support 
thousands of queries per second at over 90% recall@10.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [I] Should we explore DiskANN for aKNN vector search? [lucene]

2023-10-04 Thread via GitHub


benwtrent commented on issue #12615:
URL: https://github.com/apache/lucene/issues/12615#issuecomment-1747350135

   @jmazanec15 I agree that SPANN seems more attractive. I would argue though 
we don't need to do clustering (in the paper they do clustering, but with 
minimal effectiveness), but could maybe get away with randomly selecting a 
subset of vectors per segment.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [I] Should we explore DiskANN for aKNN vector search? [lucene]

2023-10-04 Thread via GitHub


jmazanec15 commented on issue #12615:
URL: https://github.com/apache/lucene/issues/12615#issuecomment-1747329967

   A hybrid disk-memory algorithm would have very strong benefits. I did run a 
few tests recently that confirmed HNSW does not function very well when memory 
gets constrained (which I think everyone already knew). 
   
   I wonder though, instead of DiskANN, what about a partitioning based 
approach such as [SPANN](https://arxiv.org/pdf/2111.08566.pdf)? I think a 
partitioning based approach for Lucene would make merging, updating, filtering 
and indexing a lot easier. Also, it seems it would have better disk-access 
patterns. In the paper, they do show that in a memory constrained environment, 
they were able to outperfrom DiskANN.
   
   I guess the tradeoff might be that partitioning based approaches would 
struggle to achieve really low latency search when in memory compared to 
graph-based approaches. Additionally, partitioning approaches would require a 
potentially expensive "training" or "preprocessing" step such as k-Means and 
performance could degrade if data distribution drifts and the models are not 
updated. But, if PQ is implemented, the same considerations would need to be 
taken as well.  


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [I] Should we explore DiskANN for aKNN vector search? [lucene]

2023-10-04 Thread via GitHub


benwtrent commented on issue #12615:
URL: https://github.com/apache/lucene/issues/12615#issuecomment-1747298348

   > DiskANN is known to be slower at indexing than HNSW and the blog post does 
not compare single threaded index times with Lucene.
   
   @robertvanwinkle1138 this is just one of my concerns.
   
   Additionally, I think we would still have the "segment merging problem". I 
do think we can get HNSW merges down by keeping connections between graphs, I 
just haven't had cycles myself to dig into that. There is a Lucene issue around 
making HNSW merges faster somewhere...
   
   https://github.com/apache/lucene/issues/12440


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [I] Should we explore DiskANN for aKNN vector search? [lucene]

2023-10-04 Thread via GitHub


robertvanwinkle1138 commented on issue #12615:
URL: https://github.com/apache/lucene/issues/12615#issuecomment-1747228177

   @benwtrent 
   For merges there is "FreshDiskANN: A Fast and Accurate Graph-Based
   ANN Index for Streaming Similarity Search"
   https://arxiv.org/pdf/2105.09613.pdf
   
   DiskANN is known to be slower at indexing than HNSW and the blog post does 
not compare single threaded index times with Lucene.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [I] Should we explore DiskANN for aKNN vector search? [lucene]

2023-10-03 Thread via GitHub


benwtrent commented on issue #12615:
URL: https://github.com/apache/lucene/issues/12615#issuecomment-1744763221

   I do think Lucene's read-only segment based architecture leads itself to 
support quantization (required for DiskANN). 
   
   It would be an interesting experiment to see how indexing times, segment 
merges, etc. are all handled with DiskANN.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org