Re: CEP-30: Approximate Nearest Neighbor(ANN) Vector Search via Storage-Attached Indexes

2023-05-17 Thread David Capwell
Thanks for the update, LGTM > On May 17, 2023, at 5:35 AM, Jasonstack Zhao Yang > wrote: > > Hi, > > I have updated the CEP with some details about distributed queries in the > Approach section. > > David: > > > given results have a real ranking, the current 2i logic may yield incorrect >

Re: CEP-30: Approximate Nearest Neighbor(ANN) Vector Search via Storage-Attached Indexes

2023-05-17 Thread Jasonstack Zhao Yang
Hi, I have updated the CEP with some details about distributed queries in the *Approach* section. David: > given results have a real ranking, the current 2i logic may yield incorrect results C* internal iterators are all in primary key order. So we need two in-memory top-k filters, one at

Re: CEP-30: Approximate Nearest Neighbor(ANN) Vector Search via Storage-Attached Indexes

2023-05-09 Thread Jeremy Hanna
Just wanted to add that I don't have any special knowledge of CEP-30 beyond what Jonathan posted and just trying to help clarify and answer questions as I can with some knowledge and experience from DSE Search and SAI. Thanks to Caleb for helping validate some things as well. And to be clear

Re: CEP-30: Approximate Nearest Neighbor(ANN) Vector Search via Storage-Attached Indexes

2023-05-09 Thread Jeremy Hanna
I talked to David and some others in slack to hopefully clarify: With SAI, can you have partial results? When you have a query that is non-key based, you need to have full token range coverage of the results. If that isn't possible, will Vector Search/SAI return partial results? Anything can

Re: CEP-30: Approximate Nearest Neighbor(ANN) Vector Search via Storage-Attached Indexes

2023-05-09 Thread Caleb Rackliffe
Anyone on this ML who still remembers DSE Search (or has experience w/ Elastic or SolrCloud) probably also knows that there are some significant pieces of an optimized scatter/gather apparatus for IR (even without sorting, which also doesn't exist yet) that do not exist in C* or it's range query

Re: CEP-30: Approximate Nearest Neighbor(ANN) Vector Search via Storage-Attached Indexes

2023-05-09 Thread Benedict
HNSW can in principle be made into a distributed index. But that would be quite a different paradigm to SAI.On 9 May 2023, at 19:30, Patrick McFadin wrote:Under the goals section, there is this line:Scatter/gather across replicas, combining topK from each to get global topK.But what I'm hearing

Re: CEP-30: Approximate Nearest Neighbor(ANN) Vector Search via Storage-Attached Indexes

2023-05-09 Thread Patrick McFadin
Under the goals section, there is this line: 1. Scatter/gather across replicas, combining topK from each to get global topK. But what I'm hearing is, exactly how will that happen? Maybe this is an SAI question too. How is that verified in SAI? On Tue, May 9, 2023 at 11:07 AM David

Re: CEP-30: Approximate Nearest Neighbor(ANN) Vector Search via Storage-Attached Indexes

2023-05-09 Thread David Capwell
Approach section doesn’t go over how this will handle cross replica search, this would be good to flesh out… given results have a real ranking, the current 2i logic may yield incorrect results… so would think we need num_ranges / rf queries in the best case, with some new capability to sort the