Re: [Proposal] Remove max number of dimensions for KNN vectors

2023-04-07 Thread Michael Wechner
sorry to interrupt, but I think we get side-tracked from the original discussion to increase the vector dimension limit. I think improving the vector indexing performance is one thing and making sure Lucene does not crash when increasing the vector dimension limit is another. I think it is

Re: [Proposal] Remove max number of dimensions for KNN vectors

2023-04-07 Thread jim ferenczi
> Keep in mind, there may be other ways to do it. In general if merging something is going to be "heavyweight", we should think about it to prevent things from going really bad overall. Yep I agree. Personally I don t see how we can solve this without prior knowledge of the vectors. Faiss has a

Re: [Proposal] Remove max number of dimensions for KNN vectors

2023-04-07 Thread jim ferenczi
> It is designed to build an in-memory datastructure and "merge" means "rebuild". The main idea imo in the diskann paper is to build the graph with the full dimensions to preserve the quality of the neighbors. At query time it uses the reduced dimensions (using product quantization) to compute

Re: [Proposal] Remove max number of dimensions for KNN vectors

2023-04-07 Thread Robert Muir
Personally i'd have to re-read the paper, but in general the merging issue has to be addressed somehow to fix the overall indexing time problem. It seems it gets "dodged" with huge rambuffers in the emails here. Keep in mind, there may be other ways to do it. In general if merging something is

Re: [Proposal] Remove max number of dimensions for KNN vectors

2023-04-07 Thread jim ferenczi
I am also not sure that diskann would solve the merging issue. The idea describe in the paper is to run kmeans first to create multiple graphs, one per cluster. In our case the vectors in each segment could belong to different cluster so I don’t see how we could merge them efficiently. On Fri, 7

Re: [Proposal] Remove max number of dimensions for KNN vectors

2023-04-07 Thread jim ferenczi
The inference time (and cost) to generate these big vectors must be quite large too ;). Regarding the ram buffer, we could drastically reduce the size by writing the vectors on disk instead of keeping them in the heap. With 1k dimensions the ram buffer is filled with these vectors quite rapidly.

Re: [Proposal] Remove max number of dimensions for KNN vectors

2023-04-07 Thread Robert Muir
On Fri, Apr 7, 2023 at 5:13 PM Benjamin Trent wrote: > > From all I have seen when hooking up JFR when indexing a medium number of > vectors(1M +), almost all the time is spent simply comparing the vectors > (e.g. dot_product). > > This indicates to me that another algorithm won't really help

Re: [Proposal] Remove max number of dimensions for KNN vectors

2023-04-07 Thread Benjamin Trent
>From all I have seen when hooking up JFR when indexing a medium number of vectors(1M +), almost all the time is spent simply comparing the vectors (e.g. dot_product). This indicates to me that another algorithm won't really help index build time tremendously. Unless others do dramatically fewer

Re: [Proposal] Remove max number of dimensions for KNN vectors

2023-04-07 Thread Robert Muir
On Fri, Apr 7, 2023 at 7:47 AM Michael Sokolov wrote: > > 8M 1024d float vectors indexed in 1h48m (16G heap, IW buffer size=1994) > 4M 2048d float vectors indexed in 1h44m (w/ 4G heap, IW buffer size=1994) > > Robert, since you're the only on-the-record veto here, does this > change your thinking

Re: [Proposal] Remove max number of dimensions for KNN vectors

2023-04-07 Thread Marcus Eagan
Important data point and it doesn't seem too bad or good. What is acceptable performance should be decided by the user? What do you all think? On Fri, Apr 7, 2023 at 8:20 AM Michael Sokolov wrote: > one more data point: > > 32M 100dim (fp32) vectors indexed in 1h20m (M=16, IW cache=1994,

Re: [Proposal] Remove max number of dimensions for KNN vectors

2023-04-07 Thread Michael Sokolov
one more data point: 32M 100dim (fp32) vectors indexed in 1h20m (M=16, IW cache=1994, heap=4GB) On Fri, Apr 7, 2023 at 8:52 AM Michael Sokolov wrote: > > I also want to add that we do impose some other limits on graph > construction to help ensure that HNSW-based vector fields remain >

Re: [Proposal] Remove max number of dimensions for KNN vectors

2023-04-07 Thread Michael Sokolov
I also want to add that we do impose some other limits on graph construction to help ensure that HNSW-based vector fields remain manageable; M is limited to <= 512, and maximum segment size also helps limit merge costs On Fri, Apr 7, 2023 at 7:45 AM Michael Sokolov wrote: > > Thanks Kent - I

Re: [Proposal] Remove max number of dimensions for KNN vectors

2023-04-07 Thread Michael Sokolov
Thanks Kent - I tried something similar to what you did I think. Took a set of 256d vectors I had and concatenated them to make bigger ones, then shifted the dimensions to make more of them. Here are a few single-threaded indexing test runs. I ran all tests with M=16. 8M 100d float vectors

Re: [Proposal] Remove max number of dimensions for KNN vectors

2023-04-07 Thread Kent Fitch
Hi, I have been testing Lucene with a custom vector similarity and loaded 192m vectors of dim 512 bytes. (Yes, segment merges use a lot of java memory..). As this was a performance test, the 192m vectors were derived by dithering 47k original vectors in such a way to allow realistic ANN

Re: [Proposal] Remove max number of dimensions for KNN vectors

2023-04-07 Thread Michael Wechner
you might want to use SentenceBERT to generate vectors https://sbert.net whereas for example the model "all-mpnet-base-v2" generates vectors with dimension 768 We have SentenceBERT running as a web service, which we could open for these tests, but because of network latency it should be

Re: [Proposal] Remove max number of dimensions for KNN vectors

2023-04-07 Thread Marcus Eagan
I've started to look on the internet, and surely someone will come, but the challenge I suspect is that these vectors are expensive to generate so people have not gone all in on generating such large vectors for large datasets. They certainly have not made them easy to find. Here is the most