Re: Cassandra 5.0 Beta1 - vector searching results

Brebner, Paul via user Mon, 25 Mar 2024 16:26:32 -0700

Hi all, curious if there is support for the new Cassandra vector data type in 
any open-source Kafka Connect Cassandra Sink connectors please? i.e. To write 
vector data to Cassandra from Kafka. Regards, Paul

From: Caleb Rackliffe <calebrackli...@gmail.com>
Date: Friday, 22 March 2024 at 1:52 pm
To: user@cassandra.apache.org <user@cassandra.apache.org>
Subject: Re: Cassandra 5.0 Beta1 - vector searching results
You don't often get email from calebrackli...@gmail.com. Learn why this is 
important<https://aka.ms/LearnAboutSenderIdentification>

EXTERNAL EMAIL - USE CAUTION when clicking links or attachments

To expand on Jonathan’s response, the best way to get SAI to perform on the 
read side is to use it as a tool for large-partition search. In other words, if 
you can model your data such that your queries will be restricted to a single 
partition, two things will happen…

1.) With all queries (not just ANN queries), you will only hit as many nodes as 
your read consistency level and replication factor require. For vector 
searches, that means you should only hit one node, and it should be the 
coordinating node w/ a properly configured, token-aware client.

2.) You can use LCS (or UCS configured to mimic LCS) instead of STCS as your 
table compaction strategy. This will essentially guarantee your 
(partition-restricted) SAI query hits a small number of SSTable-attached 
indexes. (It’ll hit Memtable-attached indexes as well for any recently added 
data, so if you’re seeing latencies shoot up, it’s possible there could be 
contention on the Memtable-attached index that supports ANN queries. I haven’t 
done a deep dive on it. You can always flush Memtables directly before queries 
to factor that out.)

If you can do all of the above, the simple performance of the local index query 
and its post-filtering reads is probably the place to explore further. If you 
manage to collect any profiling data (JFR, flamegraphs via async-profiler, etc) 
I’d be happy to dig into it with you.

Thanks for kicking the tires!

On Mar 21, 2024, at 8:20 PM, Brebner, Paul via user <user@cassandra.apache.org> 
wrote:

Hi Joe,

Have you considered submitting something for Community Over Code NA 2024? The 
CFP is still open for a few more weeks, options could be my Performance 
Engineering track or the Cassandra track – or both 😊

https://www.linkedin.com/pulse/cfp-community-over-code-na-denver-2024-performance-track-paul-brebner-nagmc/?trackingId=PlmmMjMeQby0Mozq8cnIpA%3D%3D

Regards, Paul Brebner

From: Joe Obernberger <joseph.obernber...@gmail.com>
Date: Friday, 22 March 2024 at 3:19 am
To: user@cassandra.apache.org <user@cassandra.apache.org>
Subject: Cassandra 5.0 Beta1 - vector searching results
EXTERNAL EMAIL - USE CAUTION when clicking links or attachments

Hi All - I'd like to share some initial results for the vector search on
Cassandra 5.0 beta1.  3 node cluster running in kubernetes; fast Netapp
storage.

Have a table (doc.embeddings_googleflan5tlarge) with definition:

CREATE TABLE doc.embeddings_googleflant5large (
     uuid text,
     type text,
     fieldname text,
     offset int,
     sourceurl text,
     textdata text,
     creationdate timestamp,
     embeddings vector<float, 768>,
     metadata boolean,
     source text,
     PRIMARY KEY ((uuid, type), fieldname, offset, sourceurl, textdata)
) WITH CLUSTERING ORDER BY (fieldname ASC, offset ASC, sourceurl ASC,
textdata ASC)
     AND additional_write_policy = '99p'
     AND allow_auto_snapshot = true
     AND bloom_filter_fp_chance = 0.01
     AND caching = {'keys': 'ALL', 'rows_per_partition': 'NONE'}
     AND cdc = false
     AND comment = ''
     AND compaction = {'class':
'org.apache.cassandra.db.compaction.SizeTieredCompactionStrategy',
'max_threshold': '32', 'min_threshold': '4'}
     AND compression = {'chunk_length_in_kb': '16', 'class':
'org.apache.cassandra.io.compress.LZ4Compressor'}
     AND memtable = 'default'
     AND crc_check_chance = 1.0
     AND default_time_to_live = 0
     AND extensions = {}
     AND gc_grace_seconds = 864000
     AND incremental_backups = true
     AND max_index_interval = 2048
     AND memtable_flush_period_in_ms = 0
     AND min_index_interval = 128
     AND read_repair = 'BLOCKING'
     AND speculative_retry = '99p';

CREATE CUSTOM INDEX ann_index_googleflant5large ON
doc.embeddings_googleflant5large (embeddings) USING 'sai';
CREATE CUSTOM INDEX offset_index_googleflant5large ON
doc.embeddings_googleflant5large (offset) USING 'sai';

nodetool status -r

UN  cassandra-1.cassandra5.cassandra5-jos.svc.cluster.local 18.02 GiB
128     100.0% f2989dea-908b-4c06-9caa-4aacad8ba0e8  rack1
UN  cassandra-2.cassandra5.cassandra5-jos.svc.cluster.local  17.98 GiB
128     100.0% ec4e506d-5f0d-475a-a3c1-aafe58399412  rack1
UN  cassandra-0.cassandra5.cassandra5-jos.svc.cluster.local  18.16 GiB
128     100.0% 92c6d909-ee01-4124-ae03-3b9e2d5e74c0  rack1

nodetool tablestats doc.embeddings_googleflant5large

Total number of tables: 1
----------------
Keyspace: doc
         Read Count: 0
         Read Latency: NaN ms
         Write Count: 2893108
         Write Latency: 326.3586520174843 ms
         Pending Flushes: 0
                 Table: embeddings_googleflant5large
                 SSTable count: 6
                 Old SSTable count: 0
                 Max SSTable size: 5.108GiB
                 Space used (live): 19318114423
                 Space used (total): 19318114423
                 Space used by snapshots (total): 0
                 Off heap memory used (total): 4874912
                 SSTable Compression Ratio: 0.97448
                 Number of partitions (estimate): 58399
                 Memtable cell count: 0
                 Memtable data size: 0
                 Memtable off heap memory used: 0
                 Memtable switch count: 16
                 Speculative retries: 0
                 Local read count: 0
                 Local read latency: NaN ms
                 Local write count: 2893108
                 Local write latency: NaN ms
                 Local read/write ratio: 0.00000
                 Pending flushes: 0
                 Percent repaired: 100.0
                 Bytes repaired: 9.066GiB
                 Bytes unrepaired: 0B
                 Bytes pending repair: 0B
                 Bloom filter false positives: 7245
                 Bloom filter false ratio: 0.00286
                 Bloom filter space used: 87264
                 Bloom filter off heap memory used: 87216
                 Index summary off heap memory used: 34624
                 Compression metadata off heap memory used: 4753072
                 Compacted partition minimum bytes: 2760
                 Compacted partition maximum bytes: 4866323
                 Compacted partition mean bytes: 154523
                 Average live cells per slice (last five minutes): NaN
                 Maximum live cells per slice (last five minutes): 0
                 Average tombstones per slice (last five minutes): NaN
                 Maximum tombstones per slice (last five minutes): 0
                 Droppable tombstone ratio: 0.00000

nodetool tablehistograms doc.embeddings_googleflant5large

doc/embeddings_googleflant5large histograms
Percentile      Read Latency     Write Latency          SSTables
Partition Size        Cell Count
                     (micros) (micros)                             (bytes)
50%                     0.00              0.00 0.00
105778               124
75%                     0.00              0.00 0.00
182785               215
95%                     0.00              0.00 0.00
379022               446
98%                     0.00              0.00 0.00
545791               642
99%                     0.00              0.00 0.00
654949               924
Min                     0.00              0.00 0.00
2760                 4
Max                     0.00              0.00 0.00
4866323              5722

Running a query such as:

select uuid,offset,type,textdata from doc.embeddings_googleflant5large
order by embeddings ANN OF [768 dimension vector] limit 20;

Works fine - typically less than 5 seconds to return.  Subsequent
queries are even faster.  If I'm activity adding data to the table, the
searches can sometimes timeout (using cqlsh).
If I add something to the where clause, the performance drops
significantly:

select uuid,offset,type,textdata from doc.embeddings_googleflant5large
where offset=1 order by embeddings ANN OF [] limit 20;

That query will timeout when running in cqlsh and with no data being
added to the table.
We've been running a Weaviate database side-by-side with Cassandra 4,
and would love to drop Weaviate if we can do all the vector searches
inside of Cassandra.
What else can I try?  Anything to increase performance?
Thanks all!

-Joe

--
This email has been checked for viruses by AVG antivirus software.
https://nam04.safelinks.protection.outlook.com/?url=http%3A%2F%2Fwww.avg.com%2F&data=05%7C02%7CPaul.Brebner%40netapp.com%7C8aabd40ede0c42dafe9908dc49c2a581%7C4b0911a0929b4715944bc03745165b3a%7C0%7C0%7C638466347558648524%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C60000%7C%7C%7C&sdata=p0VIw5MyiqtgI1qQ22mfbcgXkxfLl1%2FS1I9zDfE1rpY%3D&reserved=0<http://www.avg.com/>

Re: Cassandra 5.0 Beta1 - vector searching results

Reply via email to