sorry to interrupt, but I think we get side-tracked from the original
discussion to increase the vector dimension limit.
I think improving the vector indexing performance is one thing and
making sure Lucene does not crash when increasing the vector dimension
limit is another.
I think it is
> Keep in mind, there may be other ways to do it. In general if merging
something is going to be "heavyweight", we should think about it to
prevent things from going really bad overall.
Yep I agree. Personally I don t see how we can solve this without prior
knowledge of the vectors. Faiss has a
> It is designed to build an in-memory datastructure and "merge" means
"rebuild".
The main idea imo in the diskann paper is to build the graph with the full
dimensions to preserve the quality of the neighbors. At query time it uses
the reduced dimensions (using product quantization) to compute
Personally i'd have to re-read the paper, but in general the merging
issue has to be addressed somehow to fix the overall indexing time
problem. It seems it gets "dodged" with huge rambuffers in the emails
here.
Keep in mind, there may be other ways to do it. In general if merging
something is
I am also not sure that diskann would solve the merging issue. The idea
describe in the paper is to run kmeans first to create multiple graphs, one
per cluster. In our case the vectors in each segment could belong to
different cluster so I don’t see how we could merge them efficiently.
On Fri, 7
The inference time (and cost) to generate these big vectors must be quite
large too ;).
Regarding the ram buffer, we could drastically reduce the size by writing
the vectors on disk instead of keeping them in the heap. With 1k dimensions
the ram buffer is filled with these vectors quite rapidly.
On Fri, Apr 7, 2023 at 5:13 PM Benjamin Trent wrote:
>
> From all I have seen when hooking up JFR when indexing a medium number of
> vectors(1M +), almost all the time is spent simply comparing the vectors
> (e.g. dot_product).
>
> This indicates to me that another algorithm won't really help
>From all I have seen when hooking up JFR when indexing a medium number of
vectors(1M +), almost all the time is spent simply comparing the vectors
(e.g. dot_product).
This indicates to me that another algorithm won't really help index build
time tremendously. Unless others do dramatically fewer
On Fri, Apr 7, 2023 at 7:47 AM Michael Sokolov wrote:
>
> 8M 1024d float vectors indexed in 1h48m (16G heap, IW buffer size=1994)
> 4M 2048d float vectors indexed in 1h44m (w/ 4G heap, IW buffer size=1994)
>
> Robert, since you're the only on-the-record veto here, does this
> change your thinking
Important data point and it doesn't seem too bad or good. What is
acceptable performance should be decided by the user? What do you all think?
On Fri, Apr 7, 2023 at 8:20 AM Michael Sokolov wrote:
> one more data point:
>
> 32M 100dim (fp32) vectors indexed in 1h20m (M=16, IW cache=1994,
one more data point:
32M 100dim (fp32) vectors indexed in 1h20m (M=16, IW cache=1994, heap=4GB)
On Fri, Apr 7, 2023 at 8:52 AM Michael Sokolov wrote:
>
> I also want to add that we do impose some other limits on graph
> construction to help ensure that HNSW-based vector fields remain
>
I also want to add that we do impose some other limits on graph
construction to help ensure that HNSW-based vector fields remain
manageable; M is limited to <= 512, and maximum segment size also
helps limit merge costs
On Fri, Apr 7, 2023 at 7:45 AM Michael Sokolov wrote:
>
> Thanks Kent - I
Thanks Kent - I tried something similar to what you did I think. Took
a set of 256d vectors I had and concatenated them to make bigger ones,
then shifted the dimensions to make more of them. Here are a few
single-threaded indexing test runs. I ran all tests with M=16.
8M 100d float vectors
Hi,
I have been testing Lucene with a custom vector similarity and loaded 192m
vectors of dim 512 bytes. (Yes, segment merges use a lot of java memory..).
As this was a performance test, the 192m vectors were derived by dithering
47k original vectors in such a way to allow realistic ANN
you might want to use SentenceBERT to generate vectors
https://sbert.net
whereas for example the model "all-mpnet-base-v2" generates vectors with
dimension 768
We have SentenceBERT running as a web service, which we could open for
these tests, but because of network latency it should be
I've started to look on the internet, and surely someone will come, but the
challenge I suspect is that these vectors are expensive to generate so
people have not gone all in on generating such large vectors for large
datasets. They certainly have not made them easy to find. Here is the most
16 matches
Mail list logo