Hi Joe,

Have you considered submitting something for Community Over Code NA 2024? The 
CFP is still open for a few more weeks, options could be my Performance 
Engineering track or the Cassandra track – or both 😊

https://www.linkedin.com/pulse/cfp-community-over-code-na-denver-2024-performance-track-paul-brebner-nagmc/?trackingId=PlmmMjMeQby0Mozq8cnIpA%3D%3D

Regards, Paul Brebner



From: Joe Obernberger <joseph.obernber...@gmail.com>
Date: Friday, 22 March 2024 at 3:19 am
To: user@cassandra.apache.org <user@cassandra.apache.org>
Subject: Cassandra 5.0 Beta1 - vector searching results
EXTERNAL EMAIL - USE CAUTION when clicking links or attachments




Hi All - I'd like to share some initial results for the vector search on
Cassandra 5.0 beta1.  3 node cluster running in kubernetes; fast Netapp
storage.

Have a table (doc.embeddings_googleflan5tlarge) with definition:

CREATE TABLE doc.embeddings_googleflant5large (
     uuid text,
     type text,
     fieldname text,
     offset int,
     sourceurl text,
     textdata text,
     creationdate timestamp,
     embeddings vector<float, 768>,
     metadata boolean,
     source text,
     PRIMARY KEY ((uuid, type), fieldname, offset, sourceurl, textdata)
) WITH CLUSTERING ORDER BY (fieldname ASC, offset ASC, sourceurl ASC,
textdata ASC)
     AND additional_write_policy = '99p'
     AND allow_auto_snapshot = true
     AND bloom_filter_fp_chance = 0.01
     AND caching = {'keys': 'ALL', 'rows_per_partition': 'NONE'}
     AND cdc = false
     AND comment = ''
     AND compaction = {'class':
'org.apache.cassandra.db.compaction.SizeTieredCompactionStrategy',
'max_threshold': '32', 'min_threshold': '4'}
     AND compression = {'chunk_length_in_kb': '16', 'class':
'org.apache.cassandra.io.compress.LZ4Compressor'}
     AND memtable = 'default'
     AND crc_check_chance = 1.0
     AND default_time_to_live = 0
     AND extensions = {}
     AND gc_grace_seconds = 864000
     AND incremental_backups = true
     AND max_index_interval = 2048
     AND memtable_flush_period_in_ms = 0
     AND min_index_interval = 128
     AND read_repair = 'BLOCKING'
     AND speculative_retry = '99p';

CREATE CUSTOM INDEX ann_index_googleflant5large ON
doc.embeddings_googleflant5large (embeddings) USING 'sai';
CREATE CUSTOM INDEX offset_index_googleflant5large ON
doc.embeddings_googleflant5large (offset) USING 'sai';

nodetool status -r

UN  cassandra-1.cassandra5.cassandra5-jos.svc.cluster.local 18.02 GiB
128     100.0% f2989dea-908b-4c06-9caa-4aacad8ba0e8  rack1
UN  cassandra-2.cassandra5.cassandra5-jos.svc.cluster.local  17.98 GiB
128     100.0% ec4e506d-5f0d-475a-a3c1-aafe58399412  rack1
UN  cassandra-0.cassandra5.cassandra5-jos.svc.cluster.local  18.16 GiB
128     100.0% 92c6d909-ee01-4124-ae03-3b9e2d5e74c0  rack1

nodetool tablestats doc.embeddings_googleflant5large

Total number of tables: 1
----------------
Keyspace: doc
         Read Count: 0
         Read Latency: NaN ms
         Write Count: 2893108
         Write Latency: 326.3586520174843 ms
         Pending Flushes: 0
                 Table: embeddings_googleflant5large
                 SSTable count: 6
                 Old SSTable count: 0
                 Max SSTable size: 5.108GiB
                 Space used (live): 19318114423
                 Space used (total): 19318114423
                 Space used by snapshots (total): 0
                 Off heap memory used (total): 4874912
                 SSTable Compression Ratio: 0.97448
                 Number of partitions (estimate): 58399
                 Memtable cell count: 0
                 Memtable data size: 0
                 Memtable off heap memory used: 0
                 Memtable switch count: 16
                 Speculative retries: 0
                 Local read count: 0
                 Local read latency: NaN ms
                 Local write count: 2893108
                 Local write latency: NaN ms
                 Local read/write ratio: 0.00000
                 Pending flushes: 0
                 Percent repaired: 100.0
                 Bytes repaired: 9.066GiB
                 Bytes unrepaired: 0B
                 Bytes pending repair: 0B
                 Bloom filter false positives: 7245
                 Bloom filter false ratio: 0.00286
                 Bloom filter space used: 87264
                 Bloom filter off heap memory used: 87216
                 Index summary off heap memory used: 34624
                 Compression metadata off heap memory used: 4753072
                 Compacted partition minimum bytes: 2760
                 Compacted partition maximum bytes: 4866323
                 Compacted partition mean bytes: 154523
                 Average live cells per slice (last five minutes): NaN
                 Maximum live cells per slice (last five minutes): 0
                 Average tombstones per slice (last five minutes): NaN
                 Maximum tombstones per slice (last five minutes): 0
                 Droppable tombstone ratio: 0.00000

nodetool tablehistograms doc.embeddings_googleflant5large

doc/embeddings_googleflant5large histograms
Percentile      Read Latency     Write Latency          SSTables
Partition Size        Cell Count
                     (micros) (micros)                             (bytes)
50%                     0.00              0.00 0.00
105778               124
75%                     0.00              0.00 0.00
182785               215
95%                     0.00              0.00 0.00
379022               446
98%                     0.00              0.00 0.00
545791               642
99%                     0.00              0.00 0.00
654949               924
Min                     0.00              0.00 0.00
2760                 4
Max                     0.00              0.00 0.00
4866323              5722

Running a query such as:

select uuid,offset,type,textdata from doc.embeddings_googleflant5large
order by embeddings ANN OF [768 dimension vector] limit 20;

Works fine - typically less than 5 seconds to return.  Subsequent
queries are even faster.  If I'm activity adding data to the table, the
searches can sometimes timeout (using cqlsh).
If I add something to the where clause, the performance drops
significantly:

select uuid,offset,type,textdata from doc.embeddings_googleflant5large
where offset=1 order by embeddings ANN OF [] limit 20;

That query will timeout when running in cqlsh and with no data being
added to the table.
We've been running a Weaviate database side-by-side with Cassandra 4,
and would love to drop Weaviate if we can do all the vector searches
inside of Cassandra.
What else can I try?  Anything to increase performance?
Thanks all!

-Joe


--
This email has been checked for viruses by AVG antivirus software.
https://nam04.safelinks.protection.outlook.com/?url=http%3A%2F%2Fwww.avg.com%2F&data=05%7C02%7CPaul.Brebner%40netapp.com%7C8aabd40ede0c42dafe9908dc49c2a581%7C4b0911a0929b4715944bc03745165b3a%7C0%7C0%7C638466347558648524%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C60000%7C%7C%7C&sdata=p0VIw5MyiqtgI1qQ22mfbcgXkxfLl1%2FS1I9zDfE1rpY%3D&reserved=0<http://www.avg.com/>

Reply via email to