MarcusSorealheis commented on issue #12615:
URL: https://github.com/apache/lucene/issues/12615#issuecomment-1977567565
> HNSW and Vamana are "competing" proximity graphs, which differ mainly in
the number of layers in the graph (n vs 1) and the pruning algorithm used.
I do not think
kevindrosendahl commented on issue #12615:
URL: https://github.com/apache/lucene/issues/12615#issuecomment-1974089752
Think I agree with your points @benwtrent, will just jot down my thinking on
HNSW vs Vamana vs DiskANN in case it's useful.
HNSW and Vamana are "competing" proximity
benwtrent commented on issue #12615:
URL: https://github.com/apache/lucene/issues/12615#issuecomment-1929761190
So, I did some of my own experiments. I tested Vamana (vectors in-graph) &
HNSW, both with `int8` quantization (here is my Lucene branch:
jmazanec15 commented on issue #12615:
URL: https://github.com/apache/lucene/issues/12615#issuecomment-1915207682
@kevindrosendahl This is really cool! I had a couple questions around
product quantization implementation. I see in
rmuir commented on issue #12615:
URL: https://github.com/apache/lucene/issues/12615#issuecomment-1888419082
i would be extremely careful around io_uring, it is disabled in many
environments (e.g. by default in container environments) for security reasons:
*
kevindrosendahl commented on issue #12615:
URL: https://github.com/apache/lucene/issues/12615#issuecomment-1888208867
> How is segment merging implemented by Lucene Vamana?
I didn't do anything special for Vamana in these experiments, the index
construction and merging are
MarcusSorealheis commented on issue #12615:
URL: https://github.com/apache/lucene/issues/12615#issuecomment-1883963445
Great to finally see you in the Lucene repo @kevindrosendahl after all these
years. The work you have done here is stellar and the whole world welcomes
the diligence. I
robertvanwinkle1138 commented on issue #12615:
URL: https://github.com/apache/lucene/issues/12615#issuecomment-1868130520
@kevindrosendahl Pretty interesting, thanks for the low level details,
`io_uring` is fancy!
How is segment merging implemented by Lucene Vamana?
Correct
kevindrosendahl commented on issue #12615:
URL: https://github.com/apache/lucene/issues/12615#issuecomment-1868095892
Hi all, thanks for the patience, some interesting and exciting results to
share.
TL;DR:
- DiskANN doesn't seem to lose any performance relative to HNSW when fully
kevindrosendahl commented on issue #12615:
URL: https://github.com/apache/lucene/issues/12615#issuecomment-1819863900
@navneet1v I've been using [oshi](https://github.com/oshi/oshi) in my
testing framework, particularly
navneet1v commented on issue #12615:
URL: https://github.com/apache/lucene/issues/12615#issuecomment-1819800985
@kevindrosendahl, unrelated to the thread I see that you have added a column
named page faults. Can you provide me some details around how you got the page
faults? I am doing
kevindrosendahl commented on issue #12615:
URL: https://github.com/apache/lucene/issues/12615#issuecomment-1808717985
@benwtrent
> Thank you @kevindrosendahl this does seem to confirm my suspicion that the
improvement isn't necessarily due to the data structure, but due to
jbellis commented on issue #12615:
URL: https://github.com/apache/lucene/issues/12615#issuecomment-1808489834
> recall actually improves when introducing pq, and only starts to decrease
at a factor of 16
I would guess that either there is a bug or you happen to be testing with a
benwtrent commented on issue #12615:
URL: https://github.com/apache/lucene/issues/12615#issuecomment-1808340050
Thank you @kevindrosendahl this does seem to confirm my suspicion that the
improvement isn't necessarily due to the data structure, but due to
quantization. But, this does
mikemccand commented on issue #12615:
URL: https://github.com/apache/lucene/issues/12615#issuecomment-1808286805
> I've got my framework set up for testing larger than memory indexes and
have some somewhat interesting first results.
Thank you for setting this up @kevindrosendahl --
kevindrosendahl commented on issue #12615:
URL: https://github.com/apache/lucene/issues/12615#issuecomment-1806615864
I've got my framework set up for testing larger than memory indexes and have
some somewhat interesting first results.
TL;DR:
- the main thing driving jvector's
kevindrosendahl commented on issue #12615:
URL: https://github.com/apache/lucene/issues/12615#issuecomment-1806614314
@**[benwtrent](https://github.com/benwtrent)**:
> if I am reading the code correctly, it does the following:
> - Write int8 quantized vectors along side the vector
robertvanwinkle1138 commented on issue #12615:
URL: https://github.com/apache/lucene/issues/12615#issuecomment-1805417696
Another notable difference in the Lucene implementation is delta variable
byte encoding of node ids. The increase in disk space requires the user to
purchase more RAM
robertvanwinkle1138 commented on issue #12615:
URL: https://github.com/apache/lucene/issues/12615#issuecomment-1804858072
Perhaps much of the jvector performance improvement is simply from on heap
caching.
benwtrent commented on issue #12615:
URL: https://github.com/apache/lucene/issues/12615#issuecomment-1802705393
@kevindrosendahl if I am reading the code correctly, it does the following:
- Write int8 quantized vectors along side the vector ordinals in the graph
(`.vex` or whatever
kevindrosendahl commented on issue #12615:
URL: https://github.com/apache/lucene/issues/12615#issuecomment-1800243525
> I haven't had a chance to read your branch yet, but hope to soon.
Great, thanks! To save you a bit of time, the tl;dr of going from HNSW to
vamana is that it's
benwtrent commented on issue #12615:
URL: https://github.com/apache/lucene/issues/12615#issuecomment-1798357836
@kevindrosendahl this looks really interesting! Thank you for digging in and
starting the experimentation!
I haven't had a chance to read your branch yet, but hope to soon.
kevindrosendahl commented on issue #12615:
URL: https://github.com/apache/lucene/issues/12615#issuecomment-1793186041
Hey @benwtrent and all, just wanted to let you know that I'm experimenting
some with different index structures for larger than memory indexes.
I have a working
benwtrent commented on issue #12615:
URL: https://github.com/apache/lucene/issues/12615#issuecomment-1788822945
So, I replicated the jvector benchmark (the lucene part) using the new int8
quantization.
Note, this is with `0` fan out or extra top-k gathered. Since the benchmark
on
jbellis commented on issue #12615:
URL: https://github.com/apache/lucene/issues/12615#issuecomment-1772724758
> Or perhaps we "just" make a Lucene Codec component (KnnVectorsFormat) that
wraps jvector? (https://github.com/jbellis/jvector)
I'm happy to support anyone who wants to try
jbellis commented on issue #12615:
URL: https://github.com/apache/lucene/issues/12615#issuecomment-1772722737
> It is possible that the candidate postings (gathered via HNSW) don't
contain ANY filtered docs. This would require gathering more candidate postings.
This was a big problem
jbellis commented on issue #12615:
URL: https://github.com/apache/lucene/issues/12615#issuecomment-1772711571
> DiskANN is known to be slower at indexing than HNSW
I don't remember the numbers here, maybe 10% slower? It wasn't material
enough to make me worry about it. (This is
jbellis commented on issue #12615:
URL: https://github.com/apache/lucene/issues/12615#issuecomment-1772704049
Responding top to bottom,
> I wonder how much the speed difference is due to (1) Vectors being out of
memory (and if they used PQ for diskann, if they did, we should test PQ
dsmiley commented on issue #12615:
URL: https://github.com/apache/lucene/issues/12615#issuecomment-1772084185
What say you @jbellis :-)
I recommended a module of Lucene when we spoke at Community-over-Code. A
dependency outside is okay for non-core.
--
This is an automated message
mikemccand commented on issue #12615:
URL: https://github.com/apache/lucene/issues/12615#issuecomment-1751743673
Or perhaps we "just" make a Lucene Codec component (KnnVectorsFormat) that
wraps jvector? (https://github.com/jbellis/jvector)
--
This is an automated message from the Apache
mikemccand commented on issue #12615:
URL: https://github.com/apache/lucene/issues/12615#issuecomment-1751740152
(listening to @jbellis talk at Community over Code).
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
mikemccand commented on issue #12615:
URL: https://github.com/apache/lucene/issues/12615#issuecomment-1751739864
SPANN is another option?
https://www.researchgate.net/publication/356282356_SPANN_Highly-efficient_Billion-scale_Approximate_Nearest_Neighbor_Search
--
This is an
benwtrent commented on issue #12615:
URL: https://github.com/apache/lucene/issues/12615#issuecomment-1750872046
> This search with filter method seems to throw an error.
LOL, I thought it was supported, I must have read a github issue and made an
assumption.
> Couldn't that be
robertvanwinkle1138 commented on issue #12615:
URL: https://github.com/apache/lucene/issues/12615#issuecomment-1750770721
> QDrant's HNSW filter solution is the exact same as Lucene's
Interesting thanks.
> as candidate posting lists are gathered, ensure they have some
benwtrent commented on issue #12615:
URL: https://github.com/apache/lucene/issues/12615#issuecomment-1749322588
> QDrant has a filter solution however the methodology described in their
blog is opaque.
QDrant's HNSW filter solution is the exact same as Lucene's. You can look at
the
robertvanwinkle1138 commented on issue #12615:
URL: https://github.com/apache/lucene/issues/12615#issuecomment-1749187258
The SPANN paper does not address efficient filtered queries. Lucene's HNSW
calculates the similarity score for every record, regardless of the record
matching the
benwtrent commented on issue #12615:
URL: https://github.com/apache/lucene/issues/12615#issuecomment-1747350135
@jmazanec15 I agree that SPANN seems more attractive. I would argue though
we don't need to do clustering (in the paper they do clustering, but with
minimal effectiveness), but
jmazanec15 commented on issue #12615:
URL: https://github.com/apache/lucene/issues/12615#issuecomment-1747329967
A hybrid disk-memory algorithm would have very strong benefits. I did run a
few tests recently that confirmed HNSW does not function very well when memory
gets constrained
benwtrent commented on issue #12615:
URL: https://github.com/apache/lucene/issues/12615#issuecomment-1747298348
> DiskANN is known to be slower at indexing than HNSW and the blog post does
not compare single threaded index times with Lucene.
@robertvanwinkle1138 this is just one of
robertvanwinkle1138 commented on issue #12615:
URL: https://github.com/apache/lucene/issues/12615#issuecomment-1747228177
@benwtrent
For merges there is "FreshDiskANN: A Fast and Accurate Graph-Based
ANN Index for Streaming Similarity Search"
https://arxiv.org/pdf/2105.09613.pdf
benwtrent commented on issue #12615:
URL: https://github.com/apache/lucene/issues/12615#issuecomment-1744763221
I do think Lucene's read-only segment based architecture leads itself to
support quantization (required for DiskANN).
It would be an interesting experiment to see how
41 matches
Mail list logo