Re: [POLL] Vector type for ML

2023-05-05 Thread Rahul Xavier Singh
Love it. Thank you folks for coming to a decision on this. This is very helpful to move forward on planning on for the current Python frameworks: - Langchain.CassandraVectorStore - Langchain.CassandraVectorRetriever - Langchain.CassandraVectorStoreAgent -

Re: [POLL] Vector type for ML

2023-05-05 Thread David Capwell
https://issues.apache.org/jira/browse/CASSANDRA-18504 > On May 5, 2023, at 12:27 PM, David Capwell wrote: > > Yep, fair point…. SPARSE VECTOR better maps to NON NULL MAP > >> On May 5, 2023, at 11:58 AM, David Capwell wrote: >> >>> If we ever add sparse vectors, we can assume that DENSE is

Re: [POLL] Vector type for ML

2023-05-05 Thread David Capwell
Yep, fair point…. SPARSE VECTOR better maps to NON NULL MAP > On May 5, 2023, at 11:58 AM, David Capwell wrote: > >> If we ever add sparse vectors, we can assume that DENSE is the default and >> allow to use either DENSE, SPARSE or nothing. > > I have been feeling that sparse is just a fixed

Re: [POLL] Vector type for ML

2023-05-05 Thread Jonathan Ellis
Sparse vector in ML has the semantics that elements not explicitly set are zero. I believe most (all?) sparse vector implementations use a map under the hood; the point is to save a lot of space when you have 10K zeros and 100 that are nonzero. On Fri, May 5, 2023 at 2:00 PM David Capwell

Re: [POLL] Vector type for ML

2023-05-05 Thread David Capwell
> If we ever add sparse vectors, we can assume that DENSE is the default and > allow to use either DENSE, SPARSE or nothing. I have been feeling that sparse is just a fixed size list with nulls… so array… if you insert {0: 42, 3: 17} then you get a array of [42, null, null, 17]? One negative

Re: [POLL] Vector type for ML

2023-05-05 Thread Andrés de la Peña
My vote is: 1. VECTOR 2. DENSE VECTOR 3. type[dimension] If we ever add sparse vectors, we can assume that DENSE is the default and allow to use either DENSE, SPARSE or nothing. Perhaps the dimension could be separated from the type, such as in VECTOR[dimension] or VECTOR(dimension). On Fri, 5

Re: [POLL] Vector type for ML

2023-05-05 Thread David Capwell
>> ...where, just to be clear, VECTOR means a frozen fixed >> size array w/ no null values? > Assuming this is the case The current agreed requirements are: 1) non-null elements 2) fixed length 3) frozen You pointed out 3 isn’t actually required, but that would be a different conversation to

Re: [POLL] Vector type for ML

2023-05-05 Thread Mike Adamson
> > ...where, just to be clear, VECTOR means a frozen fixed > size array w/ no null values? > Assuming this is the case, my vote is: 1. VECTOR 2. DENSE VECTOR I don't really have a 3rd vote because I think that *type[dimension]* is too ambiguous. On Fri, 5 May 2023 at 18:32, Derek Chen-Becker

Re: [POLL] Vector type for ML

2023-05-05 Thread Derek Chen-Becker
LOL, I'm holding you to that at the summit :) In all seriousness, I'm glad to see a robust debate around it. I guess for completeness, my order of preference is 1 - NONNULL FROZEN> 2 - NONNULL TYPE (which part of this implies frozen? The NONNULL or the cardinality?) 3 - DENSE_VECTOR I guess my

Re: [POLL] Vector type for ML

2023-05-05 Thread Patrick McFadin
Derek, despite your preference, I would hang out with you at a party. On Fri, May 5, 2023 at 9:44 AM Derek Chen-Becker wrote: > Speaking as someone who likes Erlang, maybe that's why I also like NONNULL > FROZEN>. It's unambiguous what Cassandra is going to do with that > type. DENSE VECTOR

Re: [POLL] Vector type for ML

2023-05-05 Thread Patrick McFadin
My vote is: 1. DENSE VECTOR 2. VECTOR 3. ARRAY On Fri, May 5, 2023 at 9:43 AM David Capwell wrote: > Went through and created a spreed sheet of current votes… For Patric and > Mike, I don’t see a clear vote, so I put a ? where I “think” your > preference is… for Mick, I only put one vote as

Re: [POLL] Vector type for ML

2023-05-05 Thread David Capwell
Sorry, DENSE_VECTOR was pointing to the wrong row, updated score Syntax Score VECTOR 16 DENSE VECTOR 11 type[dimension] 9 NON NULL [dimention] 6 VECTOR type[n] 5 DENSE_VECTOR 3 NON-NULL FROZEN 3 ARRAY 0 > On May 5, 2023, at 10:01 AM, David Capwell wrote: > > Updated > > Syntax > Jonathan

Re: [POLL] Vector type for ML

2023-05-05 Thread David Capwell
Updated Syntax Jonathan Ellis David Capwell Josh McKenzie Caleb Rackliffe Patrick McFadin Brandon Williams Mike Adamson Benedict Mick Semb Wever Derek Chen-Becker VECTOR 1 2 2 1 ? 3 2 DENSE VECTOR 2 1 ? ? type[dimension] 3 3 3 1 3 2 DENSE_VECTOR 1 NON NULL [dimention] 1

Re: [POLL] Vector type for ML

2023-05-05 Thread Mick Semb Wever
On Fri, 5 May 2023 at 18:43, David Capwell wrote: > Went through and created a spreed sheet of current votes… For Patric and > Mike, I don’t see a clear vote, so I put a ? where I “think” your > preference is… for Mick, I only put one vote as the list looked like a > summary, but you mentioned

Re: [POLL] Vector type for ML

2023-05-05 Thread Derek Chen-Becker
Speaking as someone who likes Erlang, maybe that's why I also like NONNULL FROZEN>. It's unambiguous what Cassandra is going to do with that type. DENSE VECTOR means I need to go read docs (and then probably double-check in the source to be sure) to be sure what exactly is going on. Cheers,

Re: [POLL] Vector type for ML

2023-05-05 Thread David Capwell
Went through and created a spreed sheet of current votes… For Patric and Mike, I don’t see a clear vote, so I put a ? where I “think” your preference is… for Mick, I only put one vote as the list looked like a summary, but you mentioned the first was your preference Syntax Jonathan Ellis David

Re: [POLL] Vector type for ML

2023-05-05 Thread Caleb Rackliffe
...where, just to be clear, VECTOR means a frozen fixed size array w/ no null values? On Fri, May 5, 2023 at 11:23 AM Jonathan Ellis wrote: > +10 for not inflicting unwieldy keywords on ML users. > > Re Josh's summary, mostly agreed, my only objection to adding the DENSE > keyword is that I

Re: [POLL] Vector type for ML

2023-05-05 Thread Jonathan Ellis
+10 for not inflicting unwieldy keywords on ML users. Re Josh's summary, mostly agreed, my only objection to adding the DENSE keyword is that I don't see a foreseeable future where we also support sparse vectors, so it would end up being unnecessary extra verbosity. So my preference would be 1.

Re: [POLL] Vector type for ML

2023-05-05 Thread David Capwell
> The hnsw index can be built just as easily from a non-frozen array. I have 0 issues removing that limitation =) > I am in favour of enforcing non-null on the elements of an array by default. This is why I feel DENSE or NON NULL are the best prefix, as those both imply elements may not be

Re: [POLL] Vector type for ML

2023-05-05 Thread Patrick McFadin
I hope we are willing to consider developers that use our system because if I had to teach people to use "NON-NULL FROZEN" I'm pretty sure the response would be: Did you tell me to go write a distributed map-reduce job in Erlang? I beleive I did, Bob. On Fri, May 5, 2023 at 8:05 AM Josh McKenzie

Re: [POLL] Vector type for ML

2023-05-05 Thread Josh McKenzie
Idiomatically, to my mind, there's a question of "what space are we thinking about this datatype in"? - In the context of mathematics, nullability in a vector would be 0 - In the context of Cassandra, nullability tends to mean a tombstone (or nothing) - In the context of programming languages,

Re: [POLL] Vector type for ML

2023-05-05 Thread Patrick McFadin
I think we are still discussing implementation here when I'm talking about developer experience. I want developers to adopt this quickly, easily and be successful. Vector search is already a thing. People use it every day. A successful outcome, in my view, is developers picking up this feature

Re: [POLL] Vector type for ML

2023-05-05 Thread Mike Adamson
> > Then we can have the indexing apparatus only accept *frozen* for > the HSNW case. > I'm inclined to agree with Benedict that the index will need to be specifically select by option rather than inferred based on type. As such there is no real reason for the *frozen* requirement on the type. The

Re: [POLL] Vector type for ML

2023-05-04 Thread Caleb Rackliffe
Even in the ML case, sparse can just mean zeros rather than nulls, and they should compress similarly anyway. If we really want null values, I'd rather leave that in collections space. On Thu, May 4, 2023 at 8:59 PM Caleb Rackliffe wrote: > I actually still prefer *type[dimension]*, because I

Re: [POLL] Vector type for ML

2023-05-04 Thread Caleb Rackliffe
I actually still prefer *type[dimension]*, because I think I intuitively read this as a primitive (meaning no null elements) array. Then we can have the indexing apparatus only accept *frozen* for the HSNW case. If that isn't intuitive to anyone else, I don't really have a strong

Re: [POLL] Vector type for ML

2023-05-04 Thread Patrick McFadin
I agree with David's reasoning and the use of DENSE (and maybe eventually SPARSE). This is terminology well established in the data world, and it would lead to much easier adoption from users. VECTOR is close, but I can see having to create a lot of content around "How to use it and not get in

Re: [POLL] Vector type for ML

2023-05-04 Thread David Capwell
My views have changed over time on syntax and I feel type[dimention] may not be the best, so it has gone lower in my own personal ranking… this is my current preference 1) DENSE [dimention] | NON NULL [dimention] 2) VECTOR 3) type[dimention] My reasoning for this order * type[dimention] looks

Re: [POLL] Vector type for ML

2023-05-04 Thread Brandon Williams
1. VECTOR 2. VECTOR FLOAT[n] 3. FLOAT[N] (Non null by default) Redundant or not, I think having the VECTOR keyword helps signify what the app is generally about and helps get buy-in from ML stakeholders. On Thu, May 4, 2023 at 3:45 AM Benedict wrote: > > Hurrah for initial agreement. > > For

Re: [POLL] Vector type for ML

2023-05-04 Thread Mike Adamson
That's fair comment. In this case I would be happy with any of your suggestions although I would prefer that the datatype did not support nulls. On Thu, 4 May 2023 at 11:55, Benedict wrote: > I would expect that the type of index would be specified anyway? > > I don’t think it’s good API design

Re: [POLL] Vector type for ML

2023-05-04 Thread Benedict
I would expect that the type of index would be specified anyway?I don’t think it’s good API design to have the field define the index you create - only to shape what is permitted.A HNSW index is very specific and should be asked for specifically, not implicitly, IMO.On 4 May 2023, at 11:47, Mike

Re: [POLL] Vector type for ML

2023-05-04 Thread Mike Adamson
> > For syntax, I think one option was just FLOAT[N]. In VECTOR FLOAT[N], > VECTOR is redundant - FLOAT[N] is fully descriptive by itself. I don’t > think VECTOR should be used to simply imply non-null, as this would be very > unintuitive. More logical would be NONNULL, if this is the only

Re: [POLL] Vector type for ML

2023-05-04 Thread Benedict
Hurrah for initial agreement. For syntax, I think one option was just FLOAT[N]. In VECTOR FLOAT[N], VECTOR is redundant - FLOAT[N] is fully descriptive by itself. I don’t think VECTOR should be used to simply imply non-null, as this would be very unintuitive. More logical would be NONNULL, if

Re: [POLL] Vector type for ML

2023-05-04 Thread Mick Semb Wever
> > Did we agree on a CQL syntax? > > I don’t believe there has been a pool on CQL syntax… my understanding > reading all the threads is that there are ~4-5 options and non are -1ed, so > believe we are waiting for majority rule on this? > Re-reading that thread, IIUC the valid choices remaining

Re: [POLL] Vector type for ML

2023-05-03 Thread David Capwell
> Did we agree on a CQL syntax? I don’t believe there has been a pool on CQL syntax… my understanding reading all the threads is that there are ~4-5 options and non are -1ed, so believe we are waiting for majority rule on this? > On May 3, 2023, at 1:23 PM, Jeremiah D Jordan wrote: > >> To

Re: [POLL] Vector type for ML

2023-05-03 Thread Jeremiah D Jordan
> To be clear, I support the general agreement David and Jonathan seem to have > reached. +1 as well. > On May 3, 2023, at 3:07 PM, Caleb Rackliffe wrote: > > To be clear, I support the general agreement David and Jonathan seem to have > reached. > > On Wed, May 3, 2023 at 3:05 PM Caleb

Re: [POLL] Vector type for ML

2023-05-03 Thread Caleb Rackliffe
To be clear, I support the general agreement David and Jonathan seem to have reached. On Wed, May 3, 2023 at 3:05 PM Caleb Rackliffe wrote: > Did we agree on a CQL syntax? > > On Wed, May 3, 2023 at 2:06 PM Rahul Xavier Singh < > rahul.xavier.si...@gmail.com> wrote: > >> I like this approach.

Re: [POLL] Vector type for ML

2023-05-03 Thread Caleb Rackliffe
Did we agree on a CQL syntax? On Wed, May 3, 2023 at 2:06 PM Rahul Xavier Singh < rahul.xavier.si...@gmail.com> wrote: > I like this approach. Thank you for those working on this vector search > initiative. > > Here's the feedback from my "user" hat for someone who is looking at > databases /

Re: [POLL] Vector type for ML

2023-05-03 Thread Rahul Xavier Singh
I like this approach. Thank you for those working on this vector search initiative. Here's the feedback from my "user" hat for someone who is looking at databases / indexes for my next LLM app. Can I take some python code and go from using an in memory vector store like numpy or FAISS to

Re: [POLL] Vector type for ML

2023-05-02 Thread Patrick McFadin
\o/ Bring it in team. Group hug. Now if you'll excuse me, I'm going to go build my preso on how Cassandra is the only distributed database you can do vector search in an ACID transaction. Patrick On Tue, May 2, 2023 at 3:27 PM Jonathan Ellis wrote: > I had a call with David. We agreed that

Re: [POLL] Vector type for ML

2023-05-02 Thread Dinesh Joshi
I'm also in favor of having a general data type that is not tied to numeric data types alone. On 2023/05/02 22:27:24 Jonathan Ellis wrote: > I had a call with David. We agreed that we want a "vector" data type with > these properties > > - Fixed length > - No nulls > - Random access not

Re: [POLL] Vector type for ML

2023-05-02 Thread Jonathan Ellis
I had a call with David. We agreed that we want a "vector" data type with these properties - Fixed length - No nulls - Random access not supported Where we disagreed was on my proposal to restrict vectors to only numeric data. David's points were that (1) He has a use case today for a data

Re: [POLL] Vector type for ML

2023-05-02 Thread David Capwell
> How about it, David? Did you already make this? I checked out the patch, fixed serialize/deserialize, added the constraints, then added a composeForFloat(ByteBuffer), with this the impact to the POC patch was the following 1) move away from VectorType.instance.serializer().deserialize(bb)

Re: [POLL] Vector type for ML

2023-05-02 Thread Jeremy Hanna
I'm all for bringing more functionality to the masses sooner, but the original idea has a very very specific use case. Do we have use cases for a general purpose Vector/Array data structure? If so, awesome. I just wondered if generalizing provides value, beyond being straightforward to

Re: [POLL] Vector type for ML

2023-05-02 Thread Patrick McFadin
Yeah, it's a bit of a mess but mailing list yo. People reading this would have no idea we are friends. ;) (Which we are, for anyone reading this later!) I must have missed the point of this already being done. How about it, David? Did you already make this? "FWIW, my interpretation of the votes

Re: [POLL] Vector type for ML

2023-05-02 Thread Benedict
But it’s so trivial it was already implemented by David in the span of ten minutes? If anything, we’re slowing progress down by refusing to do the extra types, as we’re busy arguing about it rather than delivering a feature?FWIW, my interpretation of the votes today is that we SHOULD NOT (ever)

Re: [POLL] Vector type for ML

2023-05-02 Thread Patrick McFadin
I'll speak up on that one. If you look at my ranked voting, that is where my head is. I get accused of scope creep (a lot) and looking at the initial proposal Jonathan put on the ML it was mostly "Developers are adopting vector search at a furious pace and I think I have a simple way of adding

Re: [POLL] Vector type for ML

2023-05-02 Thread Benedict
Could folk voting against a general purpose type (that could well be called a vector) briefly explain their reasoning?We established in the other thread that it’s technically trivial, meaning folk must think it is strictly superior to only support float rather than eg all numeric types (note: for

Re: [POLL] Vector type for ML

2023-05-02 Thread Patrick McFadin
A > B > C on both polls. Having talked to several users in the community that are highly excited about this change, this gets to what developers want to do at Cassandra scale: store embeddings and retrieve them. On Tue, May 2, 2023 at 11:47 AM Andrés de la Peña wrote: > A > B > C > > I don't

Re: [POLL] Vector type for ML

2023-05-02 Thread Andrés de la Peña
A > B > C I don't think that ML is such a niche application that it can't have its own CQL data type. Also, vectors are mathematical elements that have more applications that ML. On Tue, 2 May 2023 at 19:15, Mick Semb Wever wrote: > > > On Tue, 2 May 2023 at 17:14, Jonathan Ellis wrote: > >>

Re: [POLL] Vector type for ML

2023-05-02 Thread Mick Semb Wever
On Tue, 2 May 2023 at 17:14, Jonathan Ellis wrote: > Should we add a vector type to Cassandra designed to meet the needs of > machine learning use cases, specifically feature and embedding vectors for > training, inference, and vector search? > > ML vectors are fixed-dimension (fixed-length)

Re: [POLL] Vector type for ML

2023-05-02 Thread David Capwell
> B) Should we introduce a type that is general purpose, and supports all > Cassandra types, so that this may be used to support ML (and perhaps other) > workloads I vote B only as well... > On May 2, 2023, at 9:02 AM, Benedict wrote: > > This is not the poll I thought we would be

Re: [POLL] Vector type for ML

2023-05-02 Thread Benedict
This is not the poll I thought we would be conducting, and I don’t really support its framing. There are two parallel questions: what the functionality should be and how they should be exposed. This poll compresses the optionality poorly.Whether or not we support a “vector” concept (or something

Re: [POLL] Vector type for ML

2023-05-02 Thread Jonathan Ellis
My preference: A > B > C. Vectors are distinct enough from arrays that we should not make adding the latter a prerequisite for adding the former. On Tue, May 2, 2023 at 10:13 AM Jonathan Ellis wrote: > Should we add a vector type to Cassandra designed to meet the needs of > machine learning