Re: Cassandra 5.0 Beta1 - vector searching results

2024-03-25 Thread Brebner, Paul via user
Hi all, curious if there is support for the new Cassandra vector data type in 
any open-source Kafka Connect Cassandra Sink connectors please? i.e. To write 
vector data to Cassandra from Kafka. Regards, Paul

From: Caleb Rackliffe 
Date: Friday, 22 March 2024 at 1:52 pm
To: user@cassandra.apache.org 
Subject: Re: Cassandra 5.0 Beta1 - vector searching results
You don't often get email from calebrackli...@gmail.com. Learn why this is 
important

EXTERNAL EMAIL - USE CAUTION when clicking links or attachments


To expand on Jonathan’s response, the best way to get SAI to perform on the 
read side is to use it as a tool for large-partition search. In other words, if 
you can model your data such that your queries will be restricted to a single 
partition, two things will happen…

1.) With all queries (not just ANN queries), you will only hit as many nodes as 
your read consistency level and replication factor require. For vector 
searches, that means you should only hit one node, and it should be the 
coordinating node w/ a properly configured, token-aware client.

2.) You can use LCS (or UCS configured to mimic LCS) instead of STCS as your 
table compaction strategy. This will essentially guarantee your 
(partition-restricted) SAI query hits a small number of SSTable-attached 
indexes. (It’ll hit Memtable-attached indexes as well for any recently added 
data, so if you’re seeing latencies shoot up, it’s possible there could be 
contention on the Memtable-attached index that supports ANN queries. I haven’t 
done a deep dive on it. You can always flush Memtables directly before queries 
to factor that out.)

If you can do all of the above, the simple performance of the local index query 
and its post-filtering reads is probably the place to explore further. If you 
manage to collect any profiling data (JFR, flamegraphs via async-profiler, etc) 
I’d be happy to dig into it with you.

Thanks for kicking the tires!


On Mar 21, 2024, at 8:20 PM, Brebner, Paul via user  
wrote:

Hi Joe,

Have you considered submitting something for Community Over Code NA 2024? The 
CFP is still open for a few more weeks, options could be my Performance 
Engineering track or the Cassandra track – or both 

https://www.linkedin.com/pulse/cfp-community-over-code-na-denver-2024-performance-track-paul-brebner-nagmc/?trackingId=PlmmMjMeQby0Mozq8cnIpA%3D%3D

Regards, Paul Brebner



From: Joe Obernberger 
Date: Friday, 22 March 2024 at 3:19 am
To: user@cassandra.apache.org 
Subject: Cassandra 5.0 Beta1 - vector searching results
EXTERNAL EMAIL - USE CAUTION when clicking links or attachments




Hi All - I'd like to share some initial results for the vector search on
Cassandra 5.0 beta1.  3 node cluster running in kubernetes; fast Netapp
storage.

Have a table (doc.embeddings_googleflan5tlarge) with definition:

CREATE TABLE doc.embeddings_googleflant5large (
 uuid text,
 type text,
 fieldname text,
 offset int,
 sourceurl text,
 textdata text,
 creationdate timestamp,
 embeddings vector,
 metadata boolean,
 source text,
 PRIMARY KEY ((uuid, type), fieldname, offset, sourceurl, textdata)
) WITH CLUSTERING ORDER BY (fieldname ASC, offset ASC, sourceurl ASC,
textdata ASC)
 AND additional_write_policy = '99p'
 AND allow_auto_snapshot = true
 AND bloom_filter_fp_chance = 0.01
 AND caching = {'keys': 'ALL', 'rows_per_partition': 'NONE'}
 AND cdc = false
 AND comment = ''
 AND compaction = {'class':
'org.apache.cassandra.db.compaction.SizeTieredCompactionStrategy',
'max_threshold': '32', 'min_threshold': '4'}
 AND compression = {'chunk_length_in_kb': '16', 'class':
'org.apache.cassandra.io.compress.LZ4Compressor'}
 AND memtable = 'default'
 AND crc_check_chance = 1.0
 AND default_time_to_live = 0
 AND extensions = {}
 AND gc_grace_seconds = 864000
 AND incremental_backups = true
 AND max_index_interval = 2048
 AND memtable_flush_period_in_ms = 0
 AND min_index_interval = 128
 AND read_repair = 'BLOCKING'
 AND speculative_retry = '99p';

CREATE CUSTOM INDEX ann_index_googleflant5large ON
doc.embeddings_googleflant5large (embeddings) USING 'sai';
CREATE CUSTOM INDEX offset_index_googleflant5large ON
doc.embeddings_googleflant5large (offset) USING 'sai';

nodetool status -r

UN  cassandra-1.cassandra5.cassandra5-jos.svc.cluster.local 18.02 GiB
128 100.0% f2989dea-908b-4c06-9caa-4aacad8ba0e8  rack1
UN  cassandra-2.cassandra5.cassandra5-jos.svc.cluster.local  17.98 GiB
128 100.0% ec4e506d-5f0d-475a-a3c1-aafe58399412  rack1
UN  cassandra-0.cassandra5.cassandra5-jos.svc.cluster.local  18.16 GiB
128 100.0% 92c6d909-ee01-4124-ae03-3b9e2d5e74c0  rack1

nodetool tablestats doc.embeddings_googleflant5large

Total number of tables: 1

Keyspace: doc
 Read Count: 0
 Read Latency: NaN ms
 Write Count: 2893108
 Write Latency: 

Apache Cassandra Virtual Meetups this week

2024-03-25 Thread Paul Au
Hello Cassandra community!

There are two virtual events happening this week. Hope to see you all
there.

*Cassandra Contributor Call*
*CEP-34: mTLS Based Client and Internode Authenticators*
Presented by Jyothsna Konica  & Dinesh Josh
Tuesday, March 26 at 10:00AM PDT
https://www.meetup.com/cassandra-global/events/299617622/

*Cassandra Town Hall*
*Scalable Objects Persistence V2* | Gerardo Recinto
*Cassandra Corner: Behind the Scenes* | Aaron Ploetz
*State of Cassandra Quarterly Update* | Josh McKenzie
Thursday, March 28th at 8:00AM PDT
https://www.meetup.com/cassandra-global/events/299617844/


*Paul Au*
Community Manager
Constantia / DoK Community / Data Mesh Learning / Apache Cassandra
Contributor
LinkedIn