Cassandra 5.0 Beta1 - vector searching results

Joe Obernberger Thu, 21 Mar 2024 09:18:55 -0700

Hi All - I'd like to share some initial results for the vector search onCassandra 5.0 beta1. 3 node cluster running in kubernetes; fast Netappstorage.


Have a table (doc.embeddings_googleflan5tlarge) with definition:


CREATE TABLE doc.embeddings_googleflant5large (
    uuid text,
    type text,
    fieldname text,
    offset int,
    sourceurl text,
    textdata text,
    creationdate timestamp,
    embeddings vector<float, 768>,
    metadata boolean,
    source text,
    PRIMARY KEY ((uuid, type), fieldname, offset, sourceurl, textdata)

) WITH CLUSTERING ORDER BY (fieldname ASC, offset ASC, sourceurl ASC,textdata ASC)

    AND additional_write_policy = '99p'
    AND allow_auto_snapshot = true
    AND bloom_filter_fp_chance = 0.01
    AND caching = {'keys': 'ALL', 'rows_per_partition': 'NONE'}
    AND cdc = false
    AND comment = ''

AND compaction = {'class':'org.apache.cassandra.db.compaction.SizeTieredCompactionStrategy','max_threshold': '32', 'min_threshold': '4'} AND compression = {'chunk_length_in_kb': '16', 'class':'org.apache.cassandra.io.compress.LZ4Compressor'}

    AND memtable = 'default'
    AND crc_check_chance = 1.0
    AND default_time_to_live = 0
    AND extensions = {}
    AND gc_grace_seconds = 864000
    AND incremental_backups = true
    AND max_index_interval = 2048
    AND memtable_flush_period_in_ms = 0
    AND min_index_interval = 128
    AND read_repair = 'BLOCKING'
    AND speculative_retry = '99p';

CREATE CUSTOM INDEX ann_index_googleflant5large ONdoc.embeddings_googleflant5large (embeddings) USING 'sai';CREATE CUSTOM INDEX offset_index_googleflant5large ONdoc.embeddings_googleflant5large (offset) USING 'sai';


nodetool status -r

UN cassandra-1.cassandra5.cassandra5-jos.svc.cluster.local 18.02 GiB 128 100.0% f2989dea-908b-4c06-9caa-4aacad8ba0e8 rack1UN cassandra-2.cassandra5.cassandra5-jos.svc.cluster.local 17.98 GiB 128 100.0% ec4e506d-5f0d-475a-a3c1-aafe58399412 rack1UN cassandra-0.cassandra5.cassandra5-jos.svc.cluster.local 18.16 GiB 128 100.0% 92c6d909-ee01-4124-ae03-3b9e2d5e74c0 rack1


nodetool tablestats doc.embeddings_googleflant5large

Total number of tables: 1
----------------
Keyspace: doc
        Read Count: 0
        Read Latency: NaN ms
        Write Count: 2893108
        Write Latency: 326.3586520174843 ms
        Pending Flushes: 0
                Table: embeddings_googleflant5large
                SSTable count: 6
                Old SSTable count: 0
                Max SSTable size: 5.108GiB
                Space used (live): 19318114423
                Space used (total): 19318114423
                Space used by snapshots (total): 0
                Off heap memory used (total): 4874912
                SSTable Compression Ratio: 0.97448
                Number of partitions (estimate): 58399
                Memtable cell count: 0
                Memtable data size: 0
                Memtable off heap memory used: 0
                Memtable switch count: 16
                Speculative retries: 0
                Local read count: 0
                Local read latency: NaN ms
                Local write count: 2893108
                Local write latency: NaN ms
                Local read/write ratio: 0.00000
                Pending flushes: 0
                Percent repaired: 100.0
                Bytes repaired: 9.066GiB
                Bytes unrepaired: 0B
                Bytes pending repair: 0B
                Bloom filter false positives: 7245
                Bloom filter false ratio: 0.00286
                Bloom filter space used: 87264
                Bloom filter off heap memory used: 87216
                Index summary off heap memory used: 34624
                Compression metadata off heap memory used: 4753072
                Compacted partition minimum bytes: 2760
                Compacted partition maximum bytes: 4866323
                Compacted partition mean bytes: 154523
                Average live cells per slice (last five minutes): NaN
                Maximum live cells per slice (last five minutes): 0
                Average tombstones per slice (last five minutes): NaN
                Maximum tombstones per slice (last five minutes): 0
                Droppable tombstone ratio: 0.00000

nodetool tablehistograms doc.embeddings_googleflant5large

doc/embeddings_googleflant5large histograms

Percentile Read Latency Write Latency SSTables Partition Size Cell Count

                    (micros) (micros)                             (bytes)

50% 0.00 0.00 0.00 105778 12475% 0.00 0.00 0.00 182785 21595% 0.00 0.00 0.00 379022 44698% 0.00 0.00 0.00 545791 64299% 0.00 0.00 0.00 654949 924Min 0.00 0.00 0.00 2760 4Max 0.00 0.00 0.00 4866323 5722


Running a query such as:

select uuid,offset,type,textdata from doc.embeddings_googleflant5largeorder by embeddings ANN OF [768 dimension vector] limit 20;

Works fine - typically less than 5 seconds to return. Subsequentqueries are even faster. If I'm activity adding data to the table, thesearches can sometimes timeout (using cqlsh).If I add something to the where clause, the performance dropssignificantly:

select uuid,offset,type,textdata from doc.embeddings_googleflant5largewhere offset=1 order by embeddings ANN OF [] limit 20;

That query will timeout when running in cqlsh and with no data beingadded to the table.We've been running a Weaviate database side-by-side with Cassandra 4,and would love to drop Weaviate if we can do all the vector searchesinside of Cassandra.

What else can I try?  Anything to increase performance?
Thanks all!

-Joe


--
This email has been checked for viruses by AVG antivirus software.
www.avg.com

Cassandra 5.0 Beta1 - vector searching results

Reply via email to