Re: [DISCUSS] New data type for vector search

2023-05-02 Thread Benedict
If we agree we’re delivering some general purpose array type, that supports all types as elements (ie, is logicaly equivalent to a frozen list of fixed length, however it is actually implemented), I think we are in technical agreement and it’s just a matter of presentation.At which point I think

Re: [DISCUSS] New data type for vector search

2023-05-02 Thread Jonathan Ellis
To make sure I understand correctly -- are you saying that you're fine with a vector type, but you want to see it implemented as a special case of arrays, or that you are not fine with a vector type because you would prefer to only add arrays and that should be "good enough" for ML? On Mon, May

Re: [DISCUSS] New data type for vector search

2023-05-02 Thread Mick Semb Wever
I have no problem with `VECTOR` hanging around forever as an alias for `NON-NULL FROZEN`. Even without ANN, it makes sense and will stick with new C* users. A plug-in system would be great, but it shouldn't hold back this work imho. On Mon, 1 May 2023 at 22:17, Benedict wrote: > I have

Re: [DISCUSS] New data type for vector search

2023-05-01 Thread J. D. Jordan
Yes. Plugging in a new type server side is very easy. Adding that type to every client is not.Cassandra already supports plugging in custom types through a jar.  What a given client does when encountering a custom type it doesn’t know about depends on the client.I was recently looking at this for

Re: [DISCUSS] New data type for vector search

2023-05-01 Thread David Capwell
> A data type plug-in is actually really easy today, I think? Sadly not, the client reads the class from our schema tables and has to have duplicate logic to serialize/deserialize results… types are easy to add if you are ok with client not understanding them (and will some clients fail due to

Re: [DISCUSS] New data type for vector search

2023-05-01 Thread Benedict
A data type plug-in is actually really easy today, I think? But, developing further hooks should probably be thought through as they’re necessary. I think in this case it would be simpler to deliver a general purpose type, which is why I’m trying to propose types that would be acceptable.I also

Re: [DISCUSS] New data type for vector search

2023-05-01 Thread Josh McKenzie
> If we want to make an ML-specific data type, it should be in an ML plug-in. How can we encourage a healthier plug-in ecosystem? As far as I know it's been pretty anemic historically: cassandra: https://cassandra.apache.org/doc/latest/cassandra/plugins/index.html postgres:

Re: [DISCUSS] New data type for vector search

2023-05-01 Thread David Capwell
> I think a simple and easy case can be made for fixed length array types that > do not seem to create random bits of cruft in the language that dangle by > themselves should this play not pan out. If I am understanding you correctly, then a "VECTOR FLOAT[n]” is fine as its a array type but

Re: [DISCUSS] New data type for vector search

2023-05-01 Thread Benedict
I have explained repeatedly why I am opposed to ML-specific data types. If we want to make an ML-specific data type, it should be in an ML plug-in. We should not pollute the general purpose language with hastily-considered features that target specific bandwagons - at best partially - no matter

Re: [DISCUSS] New data type for vector search

2023-05-01 Thread Mick Semb Wever
Yes! What you (David) and Benedict write beautifully supports `VECTOR FLOAT[n]` imho. You are definitely bringing up valid implementation details, and that can be dealt with during patch review. This thread is about the CQL API addition. No matter which way the technical review goes with the

Re: [DISCUSS] New data type for vector search

2023-05-01 Thread David Capwell
> I think it is totally reasonable that the ANN patch (and Jonathan) is not > asked to implement on top of, or towards, other array (or other) new data > types. This impacts serialization, if you do not think about this day 1 you then can’t add later on without having to worry about migration

Re: [DISCUSS] New data type for vector search

2023-05-01 Thread Benedict
Has anybody yet claimed it would be hard? Several folk seem ready to jump to the conclusion that this would be onerous, but as somebody with a good understanding of the storage layer I can assert with reasonable confidence that it would not be. As previously stated, the implementation largely

Re: [DISCUSS] New data type for vector search

2023-05-01 Thread Mick Semb Wever
> > > > But suggesting that Jonathan should work on implementing general purpose > arrays seems to fall outside the scope of this discussion, since the result > of such work wouldn't even fill the need Jonathan is targeting for here. > > Every comment I have made so far I have argued that the v1

Re: [DISCUSS] New data type for vector search

2023-05-01 Thread David Capwell
> In particular it makes no sense at all from an ML perspective to have vector > types of anything other than numerics Back to what Benedict was saying, if the proposal was a ML pluggin, then this limitation makes sense, but that is not the proposal at hand. If you wish to change the scope to

Re: [DISCUSS] New data type for vector search

2023-04-28 Thread Henrik Ingo
By my superficial reading I get the impression that the main distinction is that vectors don't need to support random access into a single element/float. I haven't looked at what Jonathan is doing, but I assume, and it seems Jonathan assumes or knows that this makes implementation both easier and

Re: [DISCUSS] New data type for vector search

2023-04-28 Thread Patrick McFadin
> > So is the goal here to provide something specific and idiomatic for the ML > community or is the goal to make a primitive that's C*-centric that then > another layer can write to? I personally argue for the former; I don't see > this specific data type going away any time soon. +1 on this

Re: [DISCUSS] New data type for vector search

2023-04-28 Thread Benedict
I and others have claimed that an array concept will work, since it is isomorphic with a vector. I have seen the following counterclaims:1. Vectors don’t need to support index lookups2. Vectors don’t need to support ordered indexes3. Vectors don’t need to support other types besides floatNone of

Re: [DISCUSS] New data type for vector search

2023-04-28 Thread Henrik Ingo
Benedict, I don't quite see why that matters? The argument is merely that this kind of vector, for this use case, a) is different from arrays, and b) arrays apparently don't serve the use case well enough (or at all). Now, if from the above it follows a discussion that a vector type cannot be a

Re: [DISCUSS] New data type for vector search

2023-04-28 Thread Benedict
pgvector is a plug-in. If you were proposing a plug-in you could ignore these considerations.On 28 Apr 2023, at 16:58, Jonathan Ellis wrote:I'm proposing a vector data type for ML use cases.  It's not the same thing as an array or a list and it's not supposed to be.While it's true that it would

Re: [DISCUSS] New data type for vector search

2023-04-28 Thread Jonathan Ellis
I'm proposing a vector data type for ML use cases. It's not the same thing as an array or a list and it's not supposed to be. While it's true that it would be possible to build a vector type on top of an array type, it's not necessary to do it that way, and given the lack of interest in an array

Re: [DISCUSS] New data type for vector search

2023-04-28 Thread Benedict
But you’re proposing introducing a general purpose type - this isn’t an ML plug-in, it’s modifying the core language in a manner that makes targeting your workload easier. Which is fine, but that means you have to consider its impact on the general language, not just your target use case.On 28 Apr

Re: [DISCUSS] New data type for vector search

2023-04-28 Thread Jonathan Ellis
That's exactly right. In particular it makes no sense at all from an ML perspective to have vector types of anything other than numerics. And as I mentioned in the POC thread (but I did not mention here), float is overwhelmingly the most frequently used vector type, to the point that Pinecone

Re: [DISCUSS] New data type for vector search

2023-04-28 Thread Benedict
This feature may be targeting ML users but it isn’t part of some “ML plug-in” it’s a general purpose type available to all users that happens to permit the use of ANN. So it needs to make sense in a general context, not just to ML users.I also doubt users will struggle with understanding an array

Re: [DISCUSS] New data type for vector search

2023-04-27 Thread steve landiss via dev
+1On Thursday, April 27, 2023 at 07:36:19 PM PDT, Caleb Rackliffe wrote: I don’t have a lot to add here, other than to say I’m broadly in agreement w/ David on syntax preference, element selectability, and making this a new type that roughly corresponds to a primitive

Re: [DISCUSS] New data type for vector search

2023-04-27 Thread Caleb Rackliffe
I don’t have a lot to add here, other than to say I’m broadly in agreement w/ David on syntax preference, element selectability, and making this a new type that roughly corresponds to a primitive (non-null-allowing) array.On Apr 27, 2023, at 9:18 PM, Anthony Grasso wrote:It would be strange for

Re: [DISCUSS] New data type for vector search

2023-04-27 Thread Anthony Grasso
It would be strange for this declaration to look different from other collection types. We may want to reconsider using the collection syntax. I also like the idea of the vector dimensions being declared with the VECTOR keyword. An alternative syntax option to explore is: VECTOR[size] On Fri, 28

Re: [DISCUSS] New data type for vector search

2023-04-27 Thread Josh McKenzie
>From a machine learning perspective, vectors are a well-known concept that are >effectively immutable fixed-length n-dimensional values that are then later >used either as part of a model or in conjunction with a model after the fact. While we could have this be non-frozen and not call it a

Re: [DISCUSS] New data type for vector search

2023-04-27 Thread David Capwell
> but as you point out it has the problem of allowing nulls. If nulls are not allowed for the elements, then either we need a) a new type, or b) add some way to say elements may not be null…. As much as I do like b, I am leaning towards new type for this use case. So, to flesh out the type

Re: [DISCUSS] New data type for vector search

2023-04-27 Thread Benedict
That’s a bounded ring buffer, not a fixed length array.This definitely isn’t a tuple because the types are all the same, which is pretty crucial for matrix operations. Matrix libraries generally work on arrays of known dimensionality, or sparse representations.Whether we draw any semantic link

Re: [DISCUSS] New data type for vector search

2023-04-27 Thread Jeff Jirsa
On Thu, Apr 27, 2023 at 7:39 AM Jonathan Ellis wrote: > It's been a while, so I may be missing something, but do we already have > fixed-size lists? If not, I don't see why we'd try to make this fit into a > List-shaped problem. > We do not. The proposal got closed as wont-fix

Re: [DISCUSS] New data type for vector search

2023-04-27 Thread Jonathan Ellis
It's been a while, so I may be missing something, but do we already have fixed-size lists? If not, I don't see why we'd try to make this fit into a List-shaped problem. A tuple would be a better fit from that perspective, but as you point out it has the problem of allowing nulls. The key thing

Re: [DISCUSS] New data type for vector search

2023-04-26 Thread Andrés de la Peña
If we are going to use FLOAT[N] as sugar for another CQL data type, maybe tuples are more convenient than lists. So FLOAT[N] could be equivalent to TUPLE. Differently to collections, tuples have a fixed size, they are always frozen and I think they don't support random access. These properties

Re: [DISCUSS] New data type for vector search

2023-04-26 Thread Mick Semb Wever
> > My inclination then would be to say you declare an ARRAY (which > is semantic sugar for FROZEN>). This is very consistent with > our existing style. We then simply permit such columns to define ANN > indexes. > So long as nulls aren't a problem as David questions, an alternative is:

Re: [DISCUSS] New data type for vector search

2023-04-26 Thread David Capwell
Benedicts comments also makes me question; can any of the values in the vector be null? The patch sent works with float arrays, so null isn’t possible… is null not valid for a vector type? If so this would help justify why is a vector not a array or a list (both allow null) > On Apr 26,

Re: [DISCUSS] New data type for vector search

2023-04-26 Thread David Capwell
Thanks for starting this thread! > In the initial commits and thread, this was DENSE FLOAT32. Nobody really > loved that, so we considered a bunch of alternatives, including > > - `FLOAT[N]`: This minimal option resembles C and Java array syntax, which > would make it familiar for many users.

Re: [DISCUSS] New data type for vector search

2023-04-26 Thread Benedict Elliott Smith
I think we need to briefly step back and think about what the syntax means and how it fits into existing syntax.It seems that the dimensionality verbiage assumes we’re logically introducing N vector fields, so that each row adopts a value for all of the vector fields or none. But in practice we