Re: Cassandra 5.0 Beta1 - vector searching results

2024-03-21 Thread Caleb Rackliffe
To expand on Jonathan’s response, the best way to get SAI to perform on the read side is to use it as a tool for large-partition search. In other words, if you can model your data such that your queries will be restricted to a single partition, two things will happen…1.) With all queries (not just ANN queries), you will only hit as many nodes as your read consistency level and replication factor require. For vector searches, that means you should only hit one node, and it should be the coordinating node w/ a properly configured, token-aware client.2.) You can use LCS (or UCS configured to mimic LCS) instead of STCS as your table compaction strategy. This will essentially guarantee your (partition-restricted) SAI query hits a small number of SSTable-attached indexes. (It’ll hit Memtable-attached indexes as well for any recently added data, so if you’re seeing latencies shoot up, it’s possible there could be contention on the Memtable-attached index that supports ANN queries. I haven’t done a deep dive on it. You can always flush Memtables directly before queries to factor that out.)If you can do all of the above, the simple performance of the local index query and its post-filtering reads is probably the place to explore further. If you manage to collect any profiling data (JFR, flamegraphs via async-profiler, etc) I’d be happy to dig into it with you.Thanks for kicking the tires!On Mar 21, 2024, at 8:20 PM, Brebner, Paul via user  wrote:







Hi Joe,
 
Have you considered submitting something for Community Over Code NA 2024? The CFP is still open for a few more weeks, options could be my Performance Engineering track or the Cassandra
 track – or both 

 
https://www.linkedin.com/pulse/cfp-community-over-code-na-denver-2024-performance-track-paul-brebner-nagmc/?trackingId=PlmmMjMeQby0Mozq8cnIpA%3D%3D
 
Regards, Paul Brebner
 
 
 



From:
Joe Obernberger 
Date: Friday, 22 March 2024 at 3:19 am
To: user@cassandra.apache.org 
Subject: Cassandra 5.0 Beta1 - vector searching results


EXTERNAL EMAIL - USE CAUTION when clicking links or attachments




Hi All - I'd like to share some initial results for the vector search on
Cassandra 5.0 beta1.  3 node cluster running in kubernetes; fast Netapp
storage.

Have a table (doc.embeddings_googleflan5tlarge) with definition:

CREATE TABLE doc.embeddings_googleflant5large (
 uuid text,
 type text,
 fieldname text,
 offset int,
 sourceurl text,
 textdata text,
 creationdate timestamp,
 embeddings vector,
 metadata boolean,
 source text,
 PRIMARY KEY ((uuid, type), fieldname, offset, sourceurl, textdata)
) WITH CLUSTERING ORDER BY (fieldname ASC, offset ASC, sourceurl ASC,
textdata ASC)
 AND additional_write_policy = '99p'
 AND allow_auto_snapshot = true
 AND bloom_filter_fp_chance = 0.01
 AND caching = {'keys': 'ALL', 'rows_per_partition': 'NONE'}
 AND cdc = false
 AND comment = ''
 AND compaction = {'class':
'org.apache.cassandra.db.compaction.SizeTieredCompactionStrategy',
'max_threshold': '32', 'min_threshold': '4'}
 AND compression = {'chunk_length_in_kb': '16', 'class':
'org.apache.cassandra.io.compress.LZ4Compressor'}
 AND memtable = 'default'
 AND crc_check_chance = 1.0
 AND default_time_to_live = 0
 AND extensions = {}
 AND gc_grace_seconds = 864000
 AND incremental_backups = true
 AND max_index_interval = 2048
 AND memtable_flush_period_in_ms = 0
 AND min_index_interval = 128
 AND read_repair = 'BLOCKING'
 AND speculative_retry = '99p';

CREATE CUSTOM INDEX ann_index_googleflant5large ON
doc.embeddings_googleflant5large (embeddings) USING 'sai';
CREATE CUSTOM INDEX offset_index_googleflant5large ON
doc.embeddings_googleflant5large (offset) USING 'sai';

nodetool status -r

UN  cassandra-1.cassandra5.cassandra5-jos.svc.cluster.local 18.02 GiB
128 100.0% f2989dea-908b-4c06-9caa-4aacad8ba0e8  rack1
UN  cassandra-2.cassandra5.cassandra5-jos.svc.cluster.local  17.98 GiB
128 100.0% ec4e506d-5f0d-475a-a3c1-aafe58399412  rack1
UN  cassandra-0.cassandra5.cassandra5-jos.svc.cluster.local  18.16 GiB
128 100.0% 92c6d909-ee01-4124-ae03-3b9e2d5e74c0  rack1

nodetool tablestats doc.embeddings_googleflant5large

Total number of tables: 1

Keyspace: doc
 Read Count: 0
 Read Latency: NaN ms
 Write Count: 2893108
 Write Latency: 326.3586520174843 ms
 Pending Flushes: 0
 Table: embeddings_googleflant5large
 SSTable count: 6
 Old SSTable count: 0
 Max SSTable size: 5.108GiB
 Space used (live): 19318114423
 Space used (total): 19318114423
 Space used by snapshots (total): 0
 Off heap memory used (total): 4874912
 SSTable Compression Ratio: 0.97448
 Number of partitions (estimate): 58399
 Memtable cell count: 0

Re: Cassandra 5.0 Beta1 - vector searching results

2024-03-21 Thread Brebner, Paul via user
Hi Joe,

Have you considered submitting something for Community Over Code NA 2024? The 
CFP is still open for a few more weeks, options could be my Performance 
Engineering track or the Cassandra track – or both 

https://www.linkedin.com/pulse/cfp-community-over-code-na-denver-2024-performance-track-paul-brebner-nagmc/?trackingId=PlmmMjMeQby0Mozq8cnIpA%3D%3D

Regards, Paul Brebner



From: Joe Obernberger 
Date: Friday, 22 March 2024 at 3:19 am
To: user@cassandra.apache.org 
Subject: Cassandra 5.0 Beta1 - vector searching results
EXTERNAL EMAIL - USE CAUTION when clicking links or attachments




Hi All - I'd like to share some initial results for the vector search on
Cassandra 5.0 beta1.  3 node cluster running in kubernetes; fast Netapp
storage.

Have a table (doc.embeddings_googleflan5tlarge) with definition:

CREATE TABLE doc.embeddings_googleflant5large (
 uuid text,
 type text,
 fieldname text,
 offset int,
 sourceurl text,
 textdata text,
 creationdate timestamp,
 embeddings vector,
 metadata boolean,
 source text,
 PRIMARY KEY ((uuid, type), fieldname, offset, sourceurl, textdata)
) WITH CLUSTERING ORDER BY (fieldname ASC, offset ASC, sourceurl ASC,
textdata ASC)
 AND additional_write_policy = '99p'
 AND allow_auto_snapshot = true
 AND bloom_filter_fp_chance = 0.01
 AND caching = {'keys': 'ALL', 'rows_per_partition': 'NONE'}
 AND cdc = false
 AND comment = ''
 AND compaction = {'class':
'org.apache.cassandra.db.compaction.SizeTieredCompactionStrategy',
'max_threshold': '32', 'min_threshold': '4'}
 AND compression = {'chunk_length_in_kb': '16', 'class':
'org.apache.cassandra.io.compress.LZ4Compressor'}
 AND memtable = 'default'
 AND crc_check_chance = 1.0
 AND default_time_to_live = 0
 AND extensions = {}
 AND gc_grace_seconds = 864000
 AND incremental_backups = true
 AND max_index_interval = 2048
 AND memtable_flush_period_in_ms = 0
 AND min_index_interval = 128
 AND read_repair = 'BLOCKING'
 AND speculative_retry = '99p';

CREATE CUSTOM INDEX ann_index_googleflant5large ON
doc.embeddings_googleflant5large (embeddings) USING 'sai';
CREATE CUSTOM INDEX offset_index_googleflant5large ON
doc.embeddings_googleflant5large (offset) USING 'sai';

nodetool status -r

UN  cassandra-1.cassandra5.cassandra5-jos.svc.cluster.local 18.02 GiB
128 100.0% f2989dea-908b-4c06-9caa-4aacad8ba0e8  rack1
UN  cassandra-2.cassandra5.cassandra5-jos.svc.cluster.local  17.98 GiB
128 100.0% ec4e506d-5f0d-475a-a3c1-aafe58399412  rack1
UN  cassandra-0.cassandra5.cassandra5-jos.svc.cluster.local  18.16 GiB
128 100.0% 92c6d909-ee01-4124-ae03-3b9e2d5e74c0  rack1

nodetool tablestats doc.embeddings_googleflant5large

Total number of tables: 1

Keyspace: doc
 Read Count: 0
 Read Latency: NaN ms
 Write Count: 2893108
 Write Latency: 326.3586520174843 ms
 Pending Flushes: 0
 Table: embeddings_googleflant5large
 SSTable count: 6
 Old SSTable count: 0
 Max SSTable size: 5.108GiB
 Space used (live): 19318114423
 Space used (total): 19318114423
 Space used by snapshots (total): 0
 Off heap memory used (total): 4874912
 SSTable Compression Ratio: 0.97448
 Number of partitions (estimate): 58399
 Memtable cell count: 0
 Memtable data size: 0
 Memtable off heap memory used: 0
 Memtable switch count: 16
 Speculative retries: 0
 Local read count: 0
 Local read latency: NaN ms
 Local write count: 2893108
 Local write latency: NaN ms
 Local read/write ratio: 0.0
 Pending flushes: 0
 Percent repaired: 100.0
 Bytes repaired: 9.066GiB
 Bytes unrepaired: 0B
 Bytes pending repair: 0B
 Bloom filter false positives: 7245
 Bloom filter false ratio: 0.00286
 Bloom filter space used: 87264
 Bloom filter off heap memory used: 87216
 Index summary off heap memory used: 34624
 Compression metadata off heap memory used: 4753072
 Compacted partition minimum bytes: 2760
 Compacted partition maximum bytes: 4866323
 Compacted partition mean bytes: 154523
 Average live cells per slice (last five minutes): NaN
 Maximum live cells per slice (last five minutes): 0
 Average tombstones per slice (last five minutes): NaN
 Maximum tombstones per slice (last five minutes): 0
 Droppable tombstone ratio: 0.0

nodetool tablehistograms doc.embeddings_googleflant5large


Re: Cassandra 5.0 Beta1 - vector searching results

2024-03-21 Thread Jonathan Ellis
Hi Joe,

Thanks for testing out vector search!

Cassandra 5.0 is about six months behind on vector search progress.  Part
of this is keeping up with JVector releases but more of it is core
improvements to SAI.  Unfortunately there's no easy fix for the impedance
mismatch between a field where the state of the art is improving almost
daily, and a project with a release cycle measured in years.

DataStax's cutting-edge vector search work is public and open source [1]
but it's going to be a while before we have bandwidth to upstream it to
Apache, and longer before it can be released in 5.1 or 6.0.  If you're
interested in collaborating on this, I'm happy to get you pointed in the
right direction.

In the meantime, I can also recommend trying out DataStax's Astra [2]
service, where we deploy improvements regularly.  My guesstimate is that
Astra will be 2x faster at vanilla ANN queries (with no WHERE clause) and
10x-100x faster at queries with additional predicates, depending on the
cardinality.  (As an example of what needs to be upstreamed, we added a
primitive cost-based analyzer back in January to fix the kind of timeouts
you're seeing with offset=1, and we just committed a more sophisticated one
this week [3].)

If you're stuck with 5.0, my best advice is to compact as aggressively as
possible, since SAI queries are O(N) in the number of sstables.

[1] https://github.com/datastax/cassandra/tree/vsearch
[2] https://www.datastax.com/products/datastax-astra
[3]
https://github.com/datastax/cassandra/commit/eeb33dd62b9b74ecf818a263fd73dbe6714b0df0

On Thu, Mar 21, 2024 at 9:19 AM Joe Obernberger <
joseph.obernber...@gmail.com> wrote:

> Hi All - I'd like to share some initial results for the vector search on
> Cassandra 5.0 beta1.  3 node cluster running in kubernetes; fast Netapp
> storage.
>
> Have a table (doc.embeddings_googleflan5tlarge) with definition:
>
> CREATE TABLE doc.embeddings_googleflant5large (
>  uuid text,
>  type text,
>  fieldname text,
>  offset int,
>  sourceurl text,
>  textdata text,
>  creationdate timestamp,
>  embeddings vector,
>  metadata boolean,
>  source text,
>  PRIMARY KEY ((uuid, type), fieldname, offset, sourceurl, textdata)
> ) WITH CLUSTERING ORDER BY (fieldname ASC, offset ASC, sourceurl ASC,
> textdata ASC)
>  AND additional_write_policy = '99p'
>  AND allow_auto_snapshot = true
>  AND bloom_filter_fp_chance = 0.01
>  AND caching = {'keys': 'ALL', 'rows_per_partition': 'NONE'}
>  AND cdc = false
>  AND comment = ''
>  AND compaction = {'class':
> 'org.apache.cassandra.db.compaction.SizeTieredCompactionStrategy',
> 'max_threshold': '32', 'min_threshold': '4'}
>  AND compression = {'chunk_length_in_kb': '16', 'class':
> 'org.apache.cassandra.io.compress.LZ4Compressor'}
>  AND memtable = 'default'
>  AND crc_check_chance = 1.0
>  AND default_time_to_live = 0
>  AND extensions = {}
>  AND gc_grace_seconds = 864000
>  AND incremental_backups = true
>  AND max_index_interval = 2048
>  AND memtable_flush_period_in_ms = 0
>  AND min_index_interval = 128
>  AND read_repair = 'BLOCKING'
>  AND speculative_retry = '99p';
>
> CREATE CUSTOM INDEX ann_index_googleflant5large ON
> doc.embeddings_googleflant5large (embeddings) USING 'sai';
> CREATE CUSTOM INDEX offset_index_googleflant5large ON
> doc.embeddings_googleflant5large (offset) USING 'sai';
>
> nodetool status -r
>
> UN  cassandra-1.cassandra5.cassandra5-jos.svc.cluster.local 18.02 GiB
> 128 100.0% f2989dea-908b-4c06-9caa-4aacad8ba0e8  rack1
> UN  cassandra-2.cassandra5.cassandra5-jos.svc.cluster.local  17.98 GiB
> 128 100.0% ec4e506d-5f0d-475a-a3c1-aafe58399412  rack1
> UN  cassandra-0.cassandra5.cassandra5-jos.svc.cluster.local  18.16 GiB
> 128 100.0% 92c6d909-ee01-4124-ae03-3b9e2d5e74c0  rack1
>
> nodetool tablestats doc.embeddings_googleflant5large
>
> Total number of tables: 1
> 
> Keyspace: doc
>  Read Count: 0
>  Read Latency: NaN ms
>  Write Count: 2893108
>  Write Latency: 326.3586520174843 ms
>  Pending Flushes: 0
>  Table: embeddings_googleflant5large
>  SSTable count: 6
>  Old SSTable count: 0
>  Max SSTable size: 5.108GiB
>  Space used (live): 19318114423
>  Space used (total): 19318114423
>  Space used by snapshots (total): 0
>  Off heap memory used (total): 4874912
>  SSTable Compression Ratio: 0.97448
>  Number of partitions (estimate): 58399
>  Memtable cell count: 0
>  Memtable data size: 0
>  Memtable off heap memory used: 0
>  Memtable switch count: 16
>  Speculative retries: 0
>  Local read count: 0
>  Local read latency: NaN ms
>  Local 

Cassandra 5.0 Beta1 - vector searching results

2024-03-21 Thread Joe Obernberger
Hi All - I'd like to share some initial results for the vector search on 
Cassandra 5.0 beta1.  3 node cluster running in kubernetes; fast Netapp 
storage.


Have a table (doc.embeddings_googleflan5tlarge) with definition:

CREATE TABLE doc.embeddings_googleflant5large (
    uuid text,
    type text,
    fieldname text,
    offset int,
    sourceurl text,
    textdata text,
    creationdate timestamp,
    embeddings vector,
    metadata boolean,
    source text,
    PRIMARY KEY ((uuid, type), fieldname, offset, sourceurl, textdata)
) WITH CLUSTERING ORDER BY (fieldname ASC, offset ASC, sourceurl ASC, 
textdata ASC)

    AND additional_write_policy = '99p'
    AND allow_auto_snapshot = true
    AND bloom_filter_fp_chance = 0.01
    AND caching = {'keys': 'ALL', 'rows_per_partition': 'NONE'}
    AND cdc = false
    AND comment = ''
    AND compaction = {'class': 
'org.apache.cassandra.db.compaction.SizeTieredCompactionStrategy', 
'max_threshold': '32', 'min_threshold': '4'}
    AND compression = {'chunk_length_in_kb': '16', 'class': 
'org.apache.cassandra.io.compress.LZ4Compressor'}

    AND memtable = 'default'
    AND crc_check_chance = 1.0
    AND default_time_to_live = 0
    AND extensions = {}
    AND gc_grace_seconds = 864000
    AND incremental_backups = true
    AND max_index_interval = 2048
    AND memtable_flush_period_in_ms = 0
    AND min_index_interval = 128
    AND read_repair = 'BLOCKING'
    AND speculative_retry = '99p';

CREATE CUSTOM INDEX ann_index_googleflant5large ON 
doc.embeddings_googleflant5large (embeddings) USING 'sai';
CREATE CUSTOM INDEX offset_index_googleflant5large ON 
doc.embeddings_googleflant5large (offset) USING 'sai';


nodetool status -r

UN  cassandra-1.cassandra5.cassandra5-jos.svc.cluster.local 18.02 GiB  
128 100.0% f2989dea-908b-4c06-9caa-4aacad8ba0e8  rack1
UN  cassandra-2.cassandra5.cassandra5-jos.svc.cluster.local  17.98 GiB  
128 100.0% ec4e506d-5f0d-475a-a3c1-aafe58399412  rack1
UN  cassandra-0.cassandra5.cassandra5-jos.svc.cluster.local  18.16 GiB  
128 100.0% 92c6d909-ee01-4124-ae03-3b9e2d5e74c0  rack1


nodetool tablestats doc.embeddings_googleflant5large

Total number of tables: 1

Keyspace: doc
    Read Count: 0
    Read Latency: NaN ms
    Write Count: 2893108
    Write Latency: 326.3586520174843 ms
    Pending Flushes: 0
    Table: embeddings_googleflant5large
    SSTable count: 6
    Old SSTable count: 0
    Max SSTable size: 5.108GiB
    Space used (live): 19318114423
    Space used (total): 19318114423
    Space used by snapshots (total): 0
    Off heap memory used (total): 4874912
    SSTable Compression Ratio: 0.97448
    Number of partitions (estimate): 58399
    Memtable cell count: 0
    Memtable data size: 0
    Memtable off heap memory used: 0
    Memtable switch count: 16
    Speculative retries: 0
    Local read count: 0
    Local read latency: NaN ms
    Local write count: 2893108
    Local write latency: NaN ms
    Local read/write ratio: 0.0
    Pending flushes: 0
    Percent repaired: 100.0
    Bytes repaired: 9.066GiB
    Bytes unrepaired: 0B
    Bytes pending repair: 0B
    Bloom filter false positives: 7245
    Bloom filter false ratio: 0.00286
    Bloom filter space used: 87264
    Bloom filter off heap memory used: 87216
    Index summary off heap memory used: 34624
    Compression metadata off heap memory used: 4753072
    Compacted partition minimum bytes: 2760
    Compacted partition maximum bytes: 4866323
    Compacted partition mean bytes: 154523
    Average live cells per slice (last five minutes): NaN
    Maximum live cells per slice (last five minutes): 0
    Average tombstones per slice (last five minutes): NaN
    Maximum tombstones per slice (last five minutes): 0
    Droppable tombstone ratio: 0.0

nodetool tablehistograms doc.embeddings_googleflant5large

doc/embeddings_googleflant5large histograms
Percentile  Read Latency Write Latency  SSTables    
Partition Size    Cell Count

    (micros) (micros) (bytes)
50% 0.00  0.00 0.00    
105778   124
75% 0.00  0.00 0.00    
182785   215
95% 0.00  0.00 0.00    
379022   446
98% 0.00  0.00 0.00    
545791   642
99% 0.00  0.00 0.00    
654949