Re: Replication factor, LOCAL_QUORUM write consistency and materialized views

2024-05-17 Thread Gábor Auth
Hi,

On Fri, May 17, 2024 at 6:18 PM Jon Haddad  wrote:

> I strongly suggest you don't use materialized views at all.  There are
> edge cases that in my opinion make them unsuitable for production, both in
> terms of cluster stability as well as data integrity.
>

Oh, there is already an open and fresh Jira ticket about it:
https://issues.apache.org/jira/browse/CASSANDRA-19383

Bye,
Gábor AUTH


Re: Replication factor, LOCAL_QUORUM write consistency and materialized views

2024-05-17 Thread Gábor Auth
Hi,

On Fri, May 17, 2024 at 6:18 PM Jon Haddad  wrote:

> I strongly suggest you don't use materialized views at all.  There are
> edge cases that in my opinion make them unsuitable for production, both in
> terms of cluster stability as well as data integrity.
>

I totally agree with you about it. But it looks like a strange and
interesting issue... the affected table has only ~1300 rows and less than
200 kB data. :)

Also, I found a same issue:
https://dba.stackexchange.com/questions/325140/single-node-failure-in-cassandra-4-0-7-causes-cluster-to-run-into-high-cpu

Bye,
Gábor AUTH


> On Fri, May 17, 2024 at 8:58 AM Gábor Auth  wrote:
>
>> Hi,
>>
>> I know, I know, the materialized view is experimental... :)
>>
>> So, I ran into a strange error. Among others, I have a very small 4-nodes
>> cluster, with very minimal data (~100 MB at all), the keyspace's
>> replication factor is 3, everything is works fine... except: if I restart a
>> node, I get a lot of errors with materialized views and consistency level
>> ONE, but only for those tables for which there is more than one
>> materialized view.
>>
>> Tables without materialized view don't have it, works fine.
>> Tables that have it, but only one materialized view, also works fine.
>> But, a table with more than one materialized view, whoops, the cluster
>> crashes temporarily, I can also see on the calling side (Java backend) that
>> no nodes are responding:
>>
>> Caused by: com.datastax.driver.core.exceptions.WriteFailureException:
>> Cassandra failure during write query at consistency LOCAL_QUORUM (2
>> responses were required but only 1 replica responded, 2 failed)
>>
>> I am surprised by this behavior, because there is so little data
>> involved, and it occurs when there is more than one materialized view only,
>> so it might be a concurrency issue under the hood.
>>
>> Have you seen an issue like this?
>>
>> Here is a stack trace on the Cassandra's side:
>>
>> [cassandra-dc03-1] ERROR [MutationStage-1] 2024-05-17 08:51:47,425
>> Keyspace.java:652 - Unknown exception caught while attempting to update
>> MaterializedView! pope.unit
>> [cassandra-dc03-1] org.apache.cassandra.exceptions.UnavailableException:
>> Cannot achieve consistency level ONE
>> [cassandra-dc03-1]  at
>> org.apache.cassandra.exceptions.UnavailableException.create(UnavailableException.java:37)
>> [cassandra-dc03-1]  at
>> org.apache.cassandra.locator.ReplicaPlans.assureSufficientLiveReplicas(ReplicaPlans.java:170)
>> [cassandra-dc03-1]  at
>> org.apache.cassandra.locator.ReplicaPlans.assureSufficientLiveReplicasForWrite(ReplicaPlans.java:113)
>> [cassandra-dc03-1]  at
>> org.apache.cassandra.locator.ReplicaPlans.forWrite(ReplicaPlans.java:354)
>> [cassandra-dc03-1]  at
>> org.apache.cassandra.locator.ReplicaPlans.forWrite(ReplicaPlans.java:345)
>> [cassandra-dc03-1]  at
>> org.apache.cassandra.locator.ReplicaPlans.forWrite(ReplicaPlans.java:339)
>> [cassandra-dc03-1]  at
>> org.apache.cassandra.service.StorageProxy.wrapViewBatchResponseHandler(StorageProxy.java:1312)
>> [cassandra-dc03-1]  at
>> org.apache.cassandra.service.StorageProxy.mutateMV(StorageProxy.java:1004)
>> [cassandra-dc03-1]  at
>> org.apache.cassandra.db.view.TableViews.pushViewReplicaUpdates(TableViews.java:167)
>> [cassandra-dc03-1]  at
>> org.apache.cassandra.db.Keyspace.applyInternal(Keyspace.java:647)
>> [cassandra-dc03-1]  at
>> org.apache.cassandra.db.Keyspace.applyFuture(Keyspace.java:477)
>> [cassandra-dc03-1]  at
>> org.apache.cassandra.db.Mutation.applyFuture(Mutation.java:210)
>> [cassandra-dc03-1]  at
>> org.apache.cassandra.db.MutationVerbHandler.doVerb(MutationVerbHandler.java:58)
>> [cassandra-dc03-1]  at
>> org.apache.cassandra.net.InboundSink.lambda$new$0(InboundSink.java:78)
>> [cassandra-dc03-1]  at
>> org.apache.cassandra.net.InboundSink.accept(InboundSink.java:97)
>> [cassandra-dc03-1]  at
>> org.apache.cassandra.net.InboundSink.accept(InboundSink.java:45)
>> [cassandra-dc03-1]  at
>> org.apache.cassandra.net.InboundMessageHandler$ProcessMessage.run(InboundMessageHandler.java:432)
>> [cassandra-dc03-1]  at
>> java.base/java.util.concurrent.Executors$RunnableAdapter.call(Unknown
>> Source)
>> [cassandra-dc03-1]  at
>> org.apache.cassandra.concurrent.AbstractLocalAwareExecutorService$FutureTask.run(AbstractLocalAwareExecutorService.java:165)
>> [cassandra-dc03-1]  at
>> org.apache.cassandra.concurrent.AbstractLocalAwareExecutorService$LocalSessionFutureTask.run(AbstractLocalAwareExecutorService.java:137)
>> [cassandra-dc03-1]  at
>> org.apache.cassandra.concurrent.SEPWorker.run(SEPWorker.java:119)
>> [cassandra-dc03-1]  at
>> io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30)
>> [cassandra-dc03-1]  at java.base/java.lang.Thread.run(Unknown Source)
>>
>> --
>> Bye,
>> Gábor AUTH
>>
>


Re: Replication factor, LOCAL_QUORUM write consistency and materialized views

2024-05-17 Thread Jon Haddad
I strongly suggest you don't use materialized views at all.  There are edge
cases that in my opinion make them unsuitable for production, both in terms
of cluster stability as well as data integrity.

Jon

On Fri, May 17, 2024 at 8:58 AM Gábor Auth  wrote:

> Hi,
>
> I know, I know, the materialized view is experimental... :)
>
> So, I ran into a strange error. Among others, I have a very small 4-nodes
> cluster, with very minimal data (~100 MB at all), the keyspace's
> replication factor is 3, everything is works fine... except: if I restart a
> node, I get a lot of errors with materialized views and consistency level
> ONE, but only for those tables for which there is more than one
> materialized view.
>
> Tables without materialized view don't have it, works fine.
> Tables that have it, but only one materialized view, also works fine.
> But, a table with more than one materialized view, whoops, the cluster
> crashes temporarily, I can also see on the calling side (Java backend) that
> no nodes are responding:
>
> Caused by: com.datastax.driver.core.exceptions.WriteFailureException:
> Cassandra failure during write query at consistency LOCAL_QUORUM (2
> responses were required but only 1 replica responded, 2 failed)
>
> I am surprised by this behavior, because there is so little data involved,
> and it occurs when there is more than one materialized view only, so it
> might be a concurrency issue under the hood.
>
> Have you seen an issue like this?
>
> Here is a stack trace on the Cassandra's side:
>
> [cassandra-dc03-1] ERROR [MutationStage-1] 2024-05-17 08:51:47,425
> Keyspace.java:652 - Unknown exception caught while attempting to update
> MaterializedView! pope.unit
> [cassandra-dc03-1] org.apache.cassandra.exceptions.UnavailableException:
> Cannot achieve consistency level ONE
> [cassandra-dc03-1]  at
> org.apache.cassandra.exceptions.UnavailableException.create(UnavailableException.java:37)
> [cassandra-dc03-1]  at
> org.apache.cassandra.locator.ReplicaPlans.assureSufficientLiveReplicas(ReplicaPlans.java:170)
> [cassandra-dc03-1]  at
> org.apache.cassandra.locator.ReplicaPlans.assureSufficientLiveReplicasForWrite(ReplicaPlans.java:113)
> [cassandra-dc03-1]  at
> org.apache.cassandra.locator.ReplicaPlans.forWrite(ReplicaPlans.java:354)
> [cassandra-dc03-1]  at
> org.apache.cassandra.locator.ReplicaPlans.forWrite(ReplicaPlans.java:345)
> [cassandra-dc03-1]  at
> org.apache.cassandra.locator.ReplicaPlans.forWrite(ReplicaPlans.java:339)
> [cassandra-dc03-1]  at
> org.apache.cassandra.service.StorageProxy.wrapViewBatchResponseHandler(StorageProxy.java:1312)
> [cassandra-dc03-1]  at
> org.apache.cassandra.service.StorageProxy.mutateMV(StorageProxy.java:1004)
> [cassandra-dc03-1]  at
> org.apache.cassandra.db.view.TableViews.pushViewReplicaUpdates(TableViews.java:167)
> [cassandra-dc03-1]  at
> org.apache.cassandra.db.Keyspace.applyInternal(Keyspace.java:647)
> [cassandra-dc03-1]  at
> org.apache.cassandra.db.Keyspace.applyFuture(Keyspace.java:477)
> [cassandra-dc03-1]  at
> org.apache.cassandra.db.Mutation.applyFuture(Mutation.java:210)
> [cassandra-dc03-1]  at
> org.apache.cassandra.db.MutationVerbHandler.doVerb(MutationVerbHandler.java:58)
> [cassandra-dc03-1]  at
> org.apache.cassandra.net.InboundSink.lambda$new$0(InboundSink.java:78)
> [cassandra-dc03-1]  at
> org.apache.cassandra.net.InboundSink.accept(InboundSink.java:97)
> [cassandra-dc03-1]  at
> org.apache.cassandra.net.InboundSink.accept(InboundSink.java:45)
> [cassandra-dc03-1]  at
> org.apache.cassandra.net.InboundMessageHandler$ProcessMessage.run(InboundMessageHandler.java:432)
> [cassandra-dc03-1]  at
> java.base/java.util.concurrent.Executors$RunnableAdapter.call(Unknown
> Source)
> [cassandra-dc03-1]  at
> org.apache.cassandra.concurrent.AbstractLocalAwareExecutorService$FutureTask.run(AbstractLocalAwareExecutorService.java:165)
> [cassandra-dc03-1]  at
> org.apache.cassandra.concurrent.AbstractLocalAwareExecutorService$LocalSessionFutureTask.run(AbstractLocalAwareExecutorService.java:137)
> [cassandra-dc03-1]  at
> org.apache.cassandra.concurrent.SEPWorker.run(SEPWorker.java:119)
> [cassandra-dc03-1]  at
> io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30)
> [cassandra-dc03-1]  at java.base/java.lang.Thread.run(Unknown Source)
>
> --
> Bye,
> Gábor AUTH
>


Replication factor, LOCAL_QUORUM write consistency and materialized views

2024-05-17 Thread Gábor Auth
Hi,

I know, I know, the materialized view is experimental... :)

So, I ran into a strange error. Among others, I have a very small 4-nodes
cluster, with very minimal data (~100 MB at all), the keyspace's
replication factor is 3, everything is works fine... except: if I restart a
node, I get a lot of errors with materialized views and consistency level
ONE, but only for those tables for which there is more than one
materialized view.

Tables without materialized view don't have it, works fine.
Tables that have it, but only one materialized view, also works fine.
But, a table with more than one materialized view, whoops, the cluster
crashes temporarily, I can also see on the calling side (Java backend) that
no nodes are responding:

Caused by: com.datastax.driver.core.exceptions.WriteFailureException:
Cassandra failure during write query at consistency LOCAL_QUORUM (2
responses were required but only 1 replica responded, 2 failed)

I am surprised by this behavior, because there is so little data involved,
and it occurs when there is more than one materialized view only, so it
might be a concurrency issue under the hood.

Have you seen an issue like this?

Here is a stack trace on the Cassandra's side:

[cassandra-dc03-1] ERROR [MutationStage-1] 2024-05-17 08:51:47,425
Keyspace.java:652 - Unknown exception caught while attempting to update
MaterializedView! pope.unit
[cassandra-dc03-1] org.apache.cassandra.exceptions.UnavailableException:
Cannot achieve consistency level ONE
[cassandra-dc03-1]  at
org.apache.cassandra.exceptions.UnavailableException.create(UnavailableException.java:37)
[cassandra-dc03-1]  at
org.apache.cassandra.locator.ReplicaPlans.assureSufficientLiveReplicas(ReplicaPlans.java:170)
[cassandra-dc03-1]  at
org.apache.cassandra.locator.ReplicaPlans.assureSufficientLiveReplicasForWrite(ReplicaPlans.java:113)
[cassandra-dc03-1]  at
org.apache.cassandra.locator.ReplicaPlans.forWrite(ReplicaPlans.java:354)
[cassandra-dc03-1]  at
org.apache.cassandra.locator.ReplicaPlans.forWrite(ReplicaPlans.java:345)
[cassandra-dc03-1]  at
org.apache.cassandra.locator.ReplicaPlans.forWrite(ReplicaPlans.java:339)
[cassandra-dc03-1]  at
org.apache.cassandra.service.StorageProxy.wrapViewBatchResponseHandler(StorageProxy.java:1312)
[cassandra-dc03-1]  at
org.apache.cassandra.service.StorageProxy.mutateMV(StorageProxy.java:1004)
[cassandra-dc03-1]  at
org.apache.cassandra.db.view.TableViews.pushViewReplicaUpdates(TableViews.java:167)
[cassandra-dc03-1]  at
org.apache.cassandra.db.Keyspace.applyInternal(Keyspace.java:647)
[cassandra-dc03-1]  at
org.apache.cassandra.db.Keyspace.applyFuture(Keyspace.java:477)
[cassandra-dc03-1]  at
org.apache.cassandra.db.Mutation.applyFuture(Mutation.java:210)
[cassandra-dc03-1]  at
org.apache.cassandra.db.MutationVerbHandler.doVerb(MutationVerbHandler.java:58)
[cassandra-dc03-1]  at
org.apache.cassandra.net.InboundSink.lambda$new$0(InboundSink.java:78)
[cassandra-dc03-1]  at
org.apache.cassandra.net.InboundSink.accept(InboundSink.java:97)
[cassandra-dc03-1]  at
org.apache.cassandra.net.InboundSink.accept(InboundSink.java:45)
[cassandra-dc03-1]  at
org.apache.cassandra.net.InboundMessageHandler$ProcessMessage.run(InboundMessageHandler.java:432)
[cassandra-dc03-1]  at
java.base/java.util.concurrent.Executors$RunnableAdapter.call(Unknown
Source)
[cassandra-dc03-1]  at
org.apache.cassandra.concurrent.AbstractLocalAwareExecutorService$FutureTask.run(AbstractLocalAwareExecutorService.java:165)
[cassandra-dc03-1]  at
org.apache.cassandra.concurrent.AbstractLocalAwareExecutorService$LocalSessionFutureTask.run(AbstractLocalAwareExecutorService.java:137)
[cassandra-dc03-1]  at
org.apache.cassandra.concurrent.SEPWorker.run(SEPWorker.java:119)
[cassandra-dc03-1]  at
io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30)
[cassandra-dc03-1]  at java.base/java.lang.Thread.run(Unknown Source)

-- 
Bye,
Gábor AUTH


Re: Change num_tokens in a live cluster

2024-05-16 Thread Gábor Auth
Hi,

On Thu, 16 May 2024, 17:40 Bowen Song via user, 
wrote:

> Replacing nodes one by one in the existing DC is not the same as replacing
> an entire DC.
>
> For example, if you change from 256 vnodes to 4 vnodes on a 100 nodes
> single DC cluster. Before you start, each node owns ~1% of the cluster's
> data. But after changing 99 nodes, the last remaining node will own ~39% of
> the cluster's data. Will that node have enough storage and computing
> capacity to handle that? Unless you have significantly over-provisioned
> node size, the answer is definitely no. The way to work around this is to
> gradually reduce the vnodes number. E.g. reducing from 256 to 128 will
> require the last node to have 2x the capacity, which is much more doable
> than 39x. To do it this way, you will need to repeat the process to reduce
> vnodes number from 256 to 128, then to 64, 32, 16, 8 and finally 4.
>
> So, the most significant difference is, how many times do the data need to
> be moved?
>
Thank you for the explanation, this will help others think about it when
they search about changing num_tokens... :)

I am aware about it, but in my current case there are only 4 nodes, with a
total of maybe ~25GB of data. So, creation of a new DC is more hassle for
me than replace nodes one-by-one.

My question was whether there is a simpler solution. And it looks like
there is no... :(

Bye,
Gábor AUTH


Re: Change num_tokens in a live cluster

2024-05-16 Thread Bowen Song via user
Replacing nodes one by one in the existing DC is not the same as 
replacing an entire DC.


For example, if you change from 256 vnodes to 4 vnodes on a 100 nodes 
single DC cluster. Before you start, each node owns ~1% of the cluster's 
data. But after changing 99 nodes, the last remaining node will own ~39% 
of the cluster's data. Will that node have enough storage and computing 
capacity to handle that? Unless you have significantly over-provisioned 
node size, the answer is definitely no. The way to work around this is 
to gradually reduce the vnodes number. E.g. reducing from 256 to 128 
will require the last node to have 2x the capacity, which is much more 
doable than 39x. To do it this way, you will need to repeat the process 
to reduce vnodes number from 256 to 128, then to 64, 32, 16, 8 and 
finally 4.


So, the most significant difference is, how many times do the data need 
to be moved?



On 16/05/2024 15:54, Gábor Auth wrote:

Hi,

On Thu, 16 May 2024, 10:37 Bowen Song via user, 
 wrote:


You can also add a new DC with the desired number of nodes and
num_tokens on each node with auto bootstrap disabled, then rebuild
the new DC from the existing DC before decommission the existing
DC. This method only needs to copy data once, and can copy from/to
multiple nodes concurrently, therefore is significantly faster, at
the cost of doubling the number of nodes temporarily.

For me it's easier the replacement of nodes one-by-one in the same DC, 
so that, no any new technique... :)


Thanks,
Gábor AUTH

Re: Change num_tokens in a live cluster

2024-05-16 Thread Gábor Auth
Hi,

On Thu, 16 May 2024, 16:55 Jon Haddad,  wrote:

> Unless your cluster is very small, using the method of adding / removing
> nodes will eventually result in putting a much larger portion of your
> dataset on a very few number of nodes.  I *highly* discourage this.
>

It has ~15 GB data on one-one node and it has only 4 nodes, so, I name it
very small. :)

Bye,
Gábor AUTH


Re: Change num_tokens in a live cluster

2024-05-16 Thread Gábor Auth
Hi,

On Thu, 16 May 2024, 10:37 Bowen Song via user, 
wrote:

> You can also add a new DC with the desired number of nodes and num_tokens
> on each node with auto bootstrap disabled, then rebuild the new DC from the
> existing DC before decommission the existing DC. This method only needs to
> copy data once, and can copy from/to multiple nodes concurrently, therefore
> is significantly faster, at the cost of doubling the number of nodes
> temporarily.
>
For me it's easier the replacement of nodes one-by-one in the same DC, so
that, no any new technique... :)

Thanks,
Gábor AUTH


Re: Change num_tokens in a live cluster

2024-05-16 Thread Jon Haddad
Unless your cluster is very small, using the method of adding / removing
nodes will eventually result in putting a much larger portion of your
dataset on a very few number of nodes.  I *highly* discourage this.

The only correct, safe path is Bowen's suggestion of adding another DC and
decommissioning the old one.

Jon

On Thu, May 16, 2024 at 1:37 AM Bowen Song via user <
user@cassandra.apache.org> wrote:

> You can also add a new DC with the desired number of nodes and num_tokens
> on each node with auto bootstrap disabled, then rebuild the new DC from the
> existing DC before decommission the existing DC. This method only needs to
> copy data once, and can copy from/to multiple nodes concurrently, therefore
> is significantly faster, at the cost of doubling the number of nodes
> temporarily.
> On 16/05/2024 09:21, Gábor Auth wrote:
>
> Hi.
>
> Is there a newer/easier workflow to change num_tokens in an existing
> cluster than add a new node to the cluster with the other num_tokens value
> and decommission an old one, repeat and rinse through all nodes?
>
> --
> Bye,
> Gábor AUTH
>
>


Re: Change num_tokens in a live cluster

2024-05-16 Thread Bowen Song via user
You can also add a new DC with the desired number of nodes and 
num_tokens on each node with auto bootstrap disabled, then rebuild the 
new DC from the existing DC before decommission the existing DC. This 
method only needs to copy data once, and can copy from/to multiple nodes 
concurrently, therefore is significantly faster, at the cost of doubling 
the number of nodes temporarily.


On 16/05/2024 09:21, Gábor Auth wrote:

Hi.

Is there a newer/easier workflow to change num_tokens in an existing 
cluster than add a new node to the cluster with the other num_tokens 
value and decommission an old one, repeat and rinse through all nodes?


--
Bye,
Gábor AUTH

Change num_tokens in a live cluster

2024-05-16 Thread Gábor Auth
Hi.

Is there a newer/easier workflow to change num_tokens in an existing
cluster than add a new node to the cluster with the other num_tokens value
and decommission an old one, repeat and rinse through all nodes?

-- 
Bye,
Gábor AUTH


Re: null values injected while drop compact storage was executed

2024-05-14 Thread Matthias Pfau via user
This happened with version 3.11.10.

We were analyzing impact until now as this happened only on our production 
systems. There was probably not enough concurrency on our staging environments 
so it did not happen there.

We will start to write a reproducer and file an issue afterwards.

Thanks for the feedback Jeff and Scott!

Best,
Matthias



Re: null values injected while drop compact storage was executed

2024-05-07 Thread C. Scott Andreas

If you don't have an explicit goal of dropping compact storage, it's not necessary to as a 
prerequisite to upgrading to 4.x+. Development community members recognized that 
introducing mandatory schema changes as a prerequisite to upgrading to 4.x would increase 
operator + user overhead and limit adoption of the release. To address this, support for 
compact storage was reintroduced into 4.x which eliminated the requirement to drop it: 
https://issues.apache.org/jira/browse/CASSANDRA-16217 @Matthias, would you be willing to 
share the Apache Cassandra version you were running when you observed this behavior and to 
file a Jira ticket? – Scott On May 7, 2024, at 7:18 AM, Jeff Jirsa  
wrote: This sounds a lot like cassandra-13004 which was fixed, but broke data being 
read-repaired during an alter statement I suspect it’s not actually that same bug, but may 
be close/related. Reproducing it reliably would be a huge help. - Jeff On May 7, 2024, at 
1:50 AM, Matthias Pfau via user  wrote: Hi there, we just 
ran drop compact storage in order to prepare for the upgrade to version 4. We observed that 
column values have been written as null, if they where inserted while the drop compact 
storage statement was running. This just happened for the couple seconds the drop compact 
storage statement ran. Did anyone else observe this? What are the proposed strategies to 
prevent data loss. Best, Matthias

Re: null values injected while drop compact storage was executed

2024-05-07 Thread Jeff Jirsa
This sounds a lot like cassandra-13004 which was fixed, but broke data being 
read-repaired during an alter statement

I suspect it’s not actually that same bug, but may be close/related. 
Reproducing it reliably would be a huge help. 

- Jeff



> On May 7, 2024, at 1:50 AM, Matthias Pfau via user 
>  wrote:
> 
> Hi there,
> we just ran drop compact storage in order to prepare for the upgrade to 
> version 4.
> 
> We observed that column values have been written as null, if they where 
> inserted while the drop compact storage statement was running. This just 
> happened for the couple seconds the drop compact storage statement ran.
> 
> Did anyone else observe this? What are the proposed strategies to prevent 
> data loss.
> 
> Best,
> Matthias


null values injected while drop compact storage was executed

2024-05-07 Thread Matthias Pfau via user
Hi there,
we just ran drop compact storage in order to prepare for the upgrade to version 
4.

We observed that column values have been written as null, if they where 
inserted while the drop compact storage statement was running. This just 
happened for the couple seconds the drop compact storage statement ran.

Did anyone else observe this? What are the proposed strategies to prevent data 
loss.

Best,
Matthias


Re: storage engine series

2024-05-02 Thread Michael Shuler

On 4/29/24 18:23, Jon Haddad wrote:
[4] https://www.youtube.com/live/yj0NQw9DgcE?si=ra1zqusMdSs6vl4T 


Optimizations (upcoming) URL:
[4] https://www.youtube.com/watch?v=MAxQ0QygcKk

:)


Re: storage engine series

2024-04-30 Thread Ranjib Dey
Great set of learning material Jon, thank you so much for the hard work

Sincerely
Ranjib

On Mon, Apr 29, 2024 at 4:24 PM Jon Haddad  wrote:

> Hey everyone,
>
> I'm doing a 4 week YouTube series on the C* storage engine.  My first
> video was last week where I gave an overview into some of the storage
> engine internals [1].
>
> The next 3 weeks are looking at the new Trie indexes coming in 5.0 [2],
> running Cassandra on EBS [3], and finally looking at some potential
> optimizations [4] that could be done to improve things even further in the
> future.
>
> I hope these videos are useful to the community, and I welcome feedback!
>
> Jon
>
> [1] https://www.youtube.com/live/yj0NQw9DgcE?si=ra1zqusMdSs6vl4T
> [2] https://www.youtube.com/live/ZdzwtH0cJDE?si=CumcPny2UG8zwtsw
> [3] https://www.youtube.com/live/kcq1TC407U4?si=pZ8AkXkMzIylQgB6
> [4] https://www.youtube.com/live/yj0NQw9DgcE?si=ra1zqusMdSs6vl4T
>


Re: storage engine series

2024-04-30 Thread Jon Haddad
Thanks Aaron!

Just realized I made a mistake, the 4th week's URL is
https://www.youtube.com/watch?v=MAxQ0QygcKk.

Jon

On Tue, Apr 30, 2024 at 4:58 AM Aaron Ploetz  wrote:

> Nice! This sounds awesome, Jon.
>
> On Mon, Apr 29, 2024 at 6:25 PM Jon Haddad  wrote:
>
>> Hey everyone,
>>
>> I'm doing a 4 week YouTube series on the C* storage engine.  My first
>> video was last week where I gave an overview into some of the storage
>> engine internals [1].
>>
>> The next 3 weeks are looking at the new Trie indexes coming in 5.0 [2],
>> running Cassandra on EBS [3], and finally looking at some potential
>> optimizations [4] that could be done to improve things even further in the
>> future.
>>
>> I hope these videos are useful to the community, and I welcome feedback!
>>
>> Jon
>>
>> [1] https://www.youtube.com/live/yj0NQw9DgcE?si=ra1zqusMdSs6vl4T
>> [2] https://www.youtube.com/live/ZdzwtH0cJDE?si=CumcPny2UG8zwtsw
>> [3] https://www.youtube.com/live/kcq1TC407U4?si=pZ8AkXkMzIylQgB6
>> [4] https://www.youtube.com/live/yj0NQw9DgcE?si=ra1zqusMdSs6vl4T
>>
>


Re: storage engine series

2024-04-30 Thread Aaron Ploetz
Nice! This sounds awesome, Jon.

On Mon, Apr 29, 2024 at 6:25 PM Jon Haddad  wrote:

> Hey everyone,
>
> I'm doing a 4 week YouTube series on the C* storage engine.  My first
> video was last week where I gave an overview into some of the storage
> engine internals [1].
>
> The next 3 weeks are looking at the new Trie indexes coming in 5.0 [2],
> running Cassandra on EBS [3], and finally looking at some potential
> optimizations [4] that could be done to improve things even further in the
> future.
>
> I hope these videos are useful to the community, and I welcome feedback!
>
> Jon
>
> [1] https://www.youtube.com/live/yj0NQw9DgcE?si=ra1zqusMdSs6vl4T
> [2] https://www.youtube.com/live/ZdzwtH0cJDE?si=CumcPny2UG8zwtsw
> [3] https://www.youtube.com/live/kcq1TC407U4?si=pZ8AkXkMzIylQgB6
> [4] https://www.youtube.com/live/yj0NQw9DgcE?si=ra1zqusMdSs6vl4T
>


storage engine series

2024-04-29 Thread Jon Haddad
Hey everyone,

I'm doing a 4 week YouTube series on the C* storage engine.  My first video
was last week where I gave an overview into some of the storage engine
internals [1].

The next 3 weeks are looking at the new Trie indexes coming in 5.0 [2],
running Cassandra on EBS [3], and finally looking at some potential
optimizations [4] that could be done to improve things even further in the
future.

I hope these videos are useful to the community, and I welcome feedback!

Jon

[1] https://www.youtube.com/live/yj0NQw9DgcE?si=ra1zqusMdSs6vl4T
[2] https://www.youtube.com/live/ZdzwtH0cJDE?si=CumcPny2UG8zwtsw
[3] https://www.youtube.com/live/kcq1TC407U4?si=pZ8AkXkMzIylQgB6
[4] https://www.youtube.com/live/yj0NQw9DgcE?si=ra1zqusMdSs6vl4T


Re: compaction trigger after every fix interval

2024-04-28 Thread Bowen Song via user
There's many things that can trigger a compaction, knowing the type of 
compaction can help narrow it down.


Have you looked at the nodetool compactionstats command output when it 
is happening? What is the compaction type? It can be "compaction", but 
can also be something else, such as "validation" or "cleanup".



On 28/04/2024 10:49, Prerna Jain wrote:

Hi team,

I have a query, in our prod environment, there are multiple key spaces 
and tables. According to requirements, every table has different 
compaction strategies like level/time/size.
Somehow, when I checked the compaction history, I noticed that 
compaction occurs every 6 hr for every table.
We did not trigger any job manual and neither did I find any 
configuration. Also, write traffic is also not happening at fix 
interval on that tables

Can you please help me find out the root cause of this case?

I appreciate any help you can provide.

Regards
Prerna Jain

compaction trigger after every fix interval

2024-04-28 Thread Prerna Jain
Hi team,

I have a query, in our prod environment, there are multiple key spaces and
tables. According to requirements, every table has different compaction
strategies like level/time/size.
Somehow, when I checked the compaction history, I noticed that compaction
occurs every 6 hr for every table.
We did not trigger any job manual and neither did I find any configuration.
Also, write traffic is also not happening at fix interval on that tables
Can you please help me find out the root cause of this case?

I appreciate any help you can provide.

Regards
Prerna Jain


Re: compaction trigger after every fix interval

2024-04-28 Thread manish khandelwal
Hi Prerna

Compactions are triggered automatically based on the compaction strategy.
Since you are seeing compactions triggered every 6 hours, the  thing that
can be  happening  is you have such a kind of traffic where you have lots
of writes every 6 hours.

PS: Please use the user mailing list (user@cassandra.apache.org) for
posting such queries.

Regards
Manish

On Sun, Apr 28, 2024 at 2:26 PM Prerna Jain  wrote:

> Hi team,
>
> I have a query, in our prod environment, there are multiple key spaces and
> tables. According to requirements, every table has different compaction
> strategies like level/time/size.
> Somehow, when I checked the compaction history, I noticed that compaction
> occurs every 6 hr for every table.
> We did not trigger any job manual and neither did I find any configuration.
> Can you please help me find out the root cause of this case.
>
> I appreciate any help you can provide.
>
> Regards
> Prerna Jain
>


Apache Cassandra Contributor Call - Next Tuesday April 30th

2024-04-26 Thread Paul Au
Hi Everyone!

The Apache Cassandra Contributor Call will take place next *Tuesday, April
30th at 10AM PDT / 1PM EDT / 19:00 CET*.

This session will feature *Shailaja Koppu* who will be discussing *CEP-33 |
CIDR Filtering Authorizer*. You can register for the event on the Planet
Cassandra Global Meetup page


Best,
Paul Au


*Paul Au*
Community Manager
Constantia / DoK Community / Data Mesh Learning / Apache Cassandra
Contributor
LinkedIn 


Quick poll on content

2024-04-24 Thread Patrick McFadin
Hi everyone,

Yesterday, I did a live stream on "GenAI for Cassandra Teams" you can see
it on YouTube[1].

I love creating content that helps you work through problems or new things.
The GenAI thing has been hitting Cassandra teams with requests for new app
features and there are a lot of topics I could cover there. I put together
a quick poll and would love your feedback:
https://www.surveymonkey.com/r/S2XLR7B

It's just one question "What kind of content would be helpful for you?"
with multi-checkbox. If you don't see what you are looking for, add
something in the "Other" box.

Thanks for your time!

Patrick

1: https://www.youtube.com/live/k7EBhN_xXHA?si=H5iN27qUinx-bH6b


Re: Trouble with using group commitlog_sync

2024-04-24 Thread Bowen Song via user

Okay, that proves I was wrong on the client side bottleneck.

On 24/04/2024 17:55, Nathan Marz wrote:
I tried running two client processes in parallel and the numbers were 
unchanged. The max throughput is still a single client doing 10 
in-flight BatchStatement containing 100 inserts.


On Tue, Apr 23, 2024 at 10:24 PM Bowen Song via user 
 wrote:


You might have run into the bottleneck of the driver's IO thread.
Try increase the driver's connections-per-server limit to 2 or 3
if you've only got 1 server in the cluster. Or alternatively, run
two client processes in parallel.


On 24/04/2024 07:19, Nathan Marz wrote:

Tried it again with one more client thread, and that had no
effect on performance. This is unsurprising as there's only 2 CPU
on this node and they were already at 100%. These were good
ideas, but I'm still unable to even match the performance of
batch commit mode with group commit mode.

On Tue, Apr 23, 2024 at 12:46 PM Bowen Song via user
 wrote:

To achieve 10k loop iterations per second, each iteration
must take 0.1 milliseconds or less. Considering that each
iteration needs to lock and unlock the semaphore (two
syscalls) and make network requests (more syscalls), that's a
lots of context switches. It may a bit too much to ask for a
single thread. I would suggest try multi-threading or
multi-processing, and see if the combined insert rate is higher.

I should also note that executeAsync() also has implicit
limits on the number of in-flight requests, which default to
1024 requests per connection and 1 connection per server. See

https://docs.datastax.com/en/developer/java-driver/4.17/manual/core/pooling/


On 23/04/2024 23:18, Nathan Marz wrote:

It's using the async API, so why would it need multiple
threads? Using the exact same approach I'm able to get 38k /
second with periodic commitlog_sync. For what it's worth, I
do see 100% CPU utilization in every single one of these tests.

On Tue, Apr 23, 2024 at 11:01 AM Bowen Song via user
 wrote:

Have you checked the thread CPU utilisation of the
client side? You likely will need more than one thread
to do insertion in a loop to achieve tens of thousands
of inserts per second.


On 23/04/2024 21:55, Nathan Marz wrote:

Thanks for the explanation.

I tried again with commitlog_sync_group_window at 2ms,
concurrent_writes at 512, and doing 1000 individual
inserts at a time with the same loop + semaphore
approach. This only nets 9k / second.

I got much higher throughput for the other modes with
BatchStatement of 100 inserts rather than 100x more
individual inserts.

On Tue, Apr 23, 2024 at 10:45 AM Bowen Song via user
 wrote:

I suspect you are abusing batch statements. Batch
statements should only be used where atomicity or
isolation is needed. Using batch statements won't
make inserting multiple partitions faster. In fact,
it often will make that slower.

Also, the liner relationship between
commitlog_sync_group_window and write throughput is
expected. That's because the max number of
uncompleted writes is limited by the write
concurrency, and a write is not considered
"complete" before it is synced to disk when
commitlog sync is in group or batch mode. That
means within each interval, only limited number of
writes can be done. The ways to increase that
including: add more nodes, sync commitlog at
shorter intervals and allow more concurrent writes.


On 23/04/2024 20:43, Nathan Marz wrote:

Thanks. I raised concurrent_writes to 128 and
set commitlog_sync_group_window to 20ms. This
causes a single execute of a BatchStatement
containing 100 inserts to succeed. However, the
throughput I'm seeing is atrocious.

With these settings, I'm executing 10
BatchStatement concurrently at a time using the
semaphore + loop approach I showed in my first
message. So as requests complete, more are sent
out such that there are 10 in-flight at a time.
Each BatchStatement has 100 individual inserts.
I'm seeing only 730 inserts / second. Again, with
periodic mode I see 38k / second and with batch I
see 14k / second. My expectation was that group
commit mode throughput would be somewhere between

Re: Trouble with using group commitlog_sync

2024-04-24 Thread Nathan Marz
I tried running two client processes in parallel and the numbers were
unchanged. The max throughput is still a single client doing 10 in-flight
BatchStatement containing 100 inserts.

On Tue, Apr 23, 2024 at 10:24 PM Bowen Song via user <
user@cassandra.apache.org> wrote:

> You might have run into the bottleneck of the driver's IO thread. Try
> increase the driver's connections-per-server limit to 2 or 3 if you've only
> got 1 server in the cluster. Or alternatively, run two client processes in
> parallel.
>
>
> On 24/04/2024 07:19, Nathan Marz wrote:
>
> Tried it again with one more client thread, and that had no effect on
> performance. This is unsurprising as there's only 2 CPU on this node and
> they were already at 100%. These were good ideas, but I'm still unable to
> even match the performance of batch commit mode with group commit mode.
>
> On Tue, Apr 23, 2024 at 12:46 PM Bowen Song via user <
> user@cassandra.apache.org> wrote:
>
>> To achieve 10k loop iterations per second, each iteration must take 0.1
>> milliseconds or less. Considering that each iteration needs to lock and
>> unlock the semaphore (two syscalls) and make network requests (more
>> syscalls), that's a lots of context switches. It may a bit too much to ask
>> for a single thread. I would suggest try multi-threading or
>> multi-processing, and see if the combined insert rate is higher.
>>
>> I should also note that executeAsync() also has implicit limits on the
>> number of in-flight requests, which default to 1024 requests per connection
>> and 1 connection per server. See
>> https://docs.datastax.com/en/developer/java-driver/4.17/manual/core/pooling/
>>
>>
>> On 23/04/2024 23:18, Nathan Marz wrote:
>>
>> It's using the async API, so why would it need multiple threads? Using
>> the exact same approach I'm able to get 38k / second with periodic
>> commitlog_sync. For what it's worth, I do see 100% CPU utilization in every
>> single one of these tests.
>>
>> On Tue, Apr 23, 2024 at 11:01 AM Bowen Song via user <
>> user@cassandra.apache.org> wrote:
>>
>>> Have you checked the thread CPU utilisation of the client side? You
>>> likely will need more than one thread to do insertion in a loop to achieve
>>> tens of thousands of inserts per second.
>>>
>>>
>>> On 23/04/2024 21:55, Nathan Marz wrote:
>>>
>>> Thanks for the explanation.
>>>
>>> I tried again with commitlog_sync_group_window at 2ms, concurrent_writes
>>> at 512, and doing 1000 individual inserts at a time with the same loop +
>>> semaphore approach. This only nets 9k / second.
>>>
>>> I got much higher throughput for the other modes with BatchStatement of
>>> 100 inserts rather than 100x more individual inserts.
>>>
>>> On Tue, Apr 23, 2024 at 10:45 AM Bowen Song via user <
>>> user@cassandra.apache.org> wrote:
>>>
 I suspect you are abusing batch statements. Batch statements should
 only be used where atomicity or isolation is needed. Using batch statements
 won't make inserting multiple partitions faster. In fact, it often will
 make that slower.

 Also, the liner relationship between commitlog_sync_group_window and
 write throughput is expected. That's because the max number of uncompleted
 writes is limited by the write concurrency, and a write is not considered
 "complete" before it is synced to disk when commitlog sync is in group or
 batch mode. That means within each interval, only limited number of writes
 can be done. The ways to increase that including: add more nodes, sync
 commitlog at shorter intervals and allow more concurrent writes.


 On 23/04/2024 20:43, Nathan Marz wrote:

 Thanks. I raised concurrent_writes to 128 and
 set commitlog_sync_group_window to 20ms. This causes a single execute of a
 BatchStatement containing 100 inserts to succeed. However, the throughput
 I'm seeing is atrocious.

 With these settings, I'm executing 10 BatchStatement concurrently at a
 time using the semaphore + loop approach I showed in my first message. So
 as requests complete, more are sent out such that there are 10 in-flight at
 a time. Each BatchStatement has 100 individual inserts. I'm seeing only 730
 inserts / second. Again, with periodic mode I see 38k / second and with
 batch I see 14k / second. My expectation was that group commit mode
 throughput would be somewhere between those two.

 If I set commitlog_sync_group_window to 100ms, the throughput drops to
 14 / second.

 If I set commitlog_sync_group_window to 10ms, the throughput increases
 to 1587 / second.

 If I set commitlog_sync_group_window to 5ms, the throughput increases
 to 3200 / second.

 If I set commitlog_sync_group_window to 1ms, the throughput increases
 to 13k / second, which is slightly less than batch commit mode.

 Is group commit mode supposed to have better performance than batch
 mode?


 On Tue, Apr 23, 

Re: Mixed Cluster 4.0 and 4.1

2024-04-24 Thread Bowen Song via user

Hi Paul,

IMO, if they are truly risk-adverse, they should follow the tested and 
proven best practices, instead of doing things in a less tested way 
which is also know to pose a danger to the data correctness.


If they must do this over a long period of time, then they may need to 
temporarily increase the gc_grace_seconds on all tables, and ensure that 
no DDL or repair is run before the upgrade completes. It is unknown 
whether this route is safe, because it's a less tested route to upgrade 
a cluster.


Please be aware that if they do deletes frequently, increasing the 
gc_grace_seconds may cause some reads to fail due to the elevated number 
of tombstones.


Cheers,
Bowen

On 24/04/2024 17:25, Paul Chandler wrote:

Hi Bowen,

Thanks for your quick reply.

Sorry I used the wrong term there, there it is a maintenance window rather than 
an outage. This is a key system and the vital nature of it means that the 
customer is rightly very risk adverse, so we will only even get permission to 
upgrade one DC per night via a rolling upgrade, meaning this will always be 
over more than a week.

So we can’t shorten the time the cluster is in mixed mode, but I am concerned 
about having a schema mismatch for this long time. Should I be concerned, or 
have others upgraded in a similar way?

Thanks

Paul


On 24 Apr 2024, at 17:02, Bowen Song via user  wrote:

Hi Paul,

You don't need to plan for or introduce an outage for a rolling upgrade, which 
is the preferred route. It isn't advisable to take down an entire DC to do 
upgrade.

You should aim to complete upgrading the entire cluster and finish a full 
repair within the shortest gc_grace_seconds (default to 10 days) of all tables. 
Failing to do that may cause data resurrections.

During the rolling upgrade, you should not run repair or any DDL query (such as 
ALTER TABLE, TRUNCATE, etc.).

You don't need to do the rolling upgrade node by node. You can do it rack by 
rack. Stopping all nodes in a single rack and upgrade them concurrently is much 
faster. The number of nodes doesn't matter that much to the time required to 
complete a rolling upgrade, it's the number of DCs and racks matter.

Cheers,
Bowen

On 24/04/2024 16:16, Paul Chandler wrote:

Hi all,

We have some large clusters ( 1000+  nodes ), these are across multiple 
datacenters.

When we perform upgrades we would normally upgrade a DC at a time during a 
planned outage for one DC. This means that a cluster might be in a mixed mode 
with multiple versions for a week or 2.

We have noticed that during our testing that upgrading to 4.1 causes a schema 
mismatch due to the new tables added into the system keyspace.

Is this going to be an issue if this schema mismatch lasts for maybe several 
weeks? I assume that running any DDL during that time would be a bad idea, is 
there any other issues to look out for?

Thanks

Paul Chandler


Re: Mixed Cluster 4.0 and 4.1

2024-04-24 Thread Paul Chandler
Hi Bowen,

Thanks for your quick reply. 

Sorry I used the wrong term there, there it is a maintenance window rather than 
an outage. This is a key system and the vital nature of it means that the 
customer is rightly very risk adverse, so we will only even get permission to 
upgrade one DC per night via a rolling upgrade, meaning this will always be 
over more than a week. 

So we can’t shorten the time the cluster is in mixed mode, but I am concerned 
about having a schema mismatch for this long time. Should I be concerned, or 
have others upgraded in a similar way?

Thanks

Paul

> On 24 Apr 2024, at 17:02, Bowen Song via user  
> wrote:
> 
> Hi Paul,
> 
> You don't need to plan for or introduce an outage for a rolling upgrade, 
> which is the preferred route. It isn't advisable to take down an entire DC to 
> do upgrade.
> 
> You should aim to complete upgrading the entire cluster and finish a full 
> repair within the shortest gc_grace_seconds (default to 10 days) of all 
> tables. Failing to do that may cause data resurrections.
> 
> During the rolling upgrade, you should not run repair or any DDL query (such 
> as ALTER TABLE, TRUNCATE, etc.).
> 
> You don't need to do the rolling upgrade node by node. You can do it rack by 
> rack. Stopping all nodes in a single rack and upgrade them concurrently is 
> much faster. The number of nodes doesn't matter that much to the time 
> required to complete a rolling upgrade, it's the number of DCs and racks 
> matter.
> 
> Cheers,
> Bowen
> 
> On 24/04/2024 16:16, Paul Chandler wrote:
>> Hi all,
>> 
>> We have some large clusters ( 1000+  nodes ), these are across multiple 
>> datacenters.
>> 
>> When we perform upgrades we would normally upgrade a DC at a time during a 
>> planned outage for one DC. This means that a cluster might be in a mixed 
>> mode with multiple versions for a week or 2.
>> 
>> We have noticed that during our testing that upgrading to 4.1 causes a 
>> schema mismatch due to the new tables added into the system keyspace.
>> 
>> Is this going to be an issue if this schema mismatch lasts for maybe several 
>> weeks? I assume that running any DDL during that time would be a bad idea, 
>> is there any other issues to look out for?
>> 
>> Thanks
>> 
>> Paul Chandler



Re: Mixed Cluster 4.0 and 4.1

2024-04-24 Thread Bowen Song via user

Hi Paul,

You don't need to plan for or introduce an outage for a rolling upgrade, 
which is the preferred route. It isn't advisable to take down an entire 
DC to do upgrade.


You should aim to complete upgrading the entire cluster and finish a 
full repair within the shortest gc_grace_seconds (default to 10 days) of 
all tables. Failing to do that may cause data resurrections.


During the rolling upgrade, you should not run repair or any DDL query 
(such as ALTER TABLE, TRUNCATE, etc.).


You don't need to do the rolling upgrade node by node. You can do it 
rack by rack. Stopping all nodes in a single rack and upgrade them 
concurrently is much faster. The number of nodes doesn't matter that 
much to the time required to complete a rolling upgrade, it's the number 
of DCs and racks matter.


Cheers,
Bowen

On 24/04/2024 16:16, Paul Chandler wrote:

Hi all,

We have some large clusters ( 1000+  nodes ), these are across multiple 
datacenters.

When we perform upgrades we would normally upgrade a DC at a time during a 
planned outage for one DC. This means that a cluster might be in a mixed mode 
with multiple versions for a week or 2.

We have noticed that during our testing that upgrading to 4.1 causes a schema 
mismatch due to the new tables added into the system keyspace.

Is this going to be an issue if this schema mismatch lasts for maybe several 
weeks? I assume that running any DDL during that time would be a bad idea, is 
there any other issues to look out for?

Thanks

Paul Chandler


Mixed Cluster 4.0 and 4.1

2024-04-24 Thread Paul Chandler
Hi all,

We have some large clusters ( 1000+  nodes ), these are across multiple 
datacenters. 

When we perform upgrades we would normally upgrade a DC at a time during a 
planned outage for one DC. This means that a cluster might be in a mixed mode 
with multiple versions for a week or 2.

We have noticed that during our testing that upgrading to 4.1 causes a schema 
mismatch due to the new tables added into the system keyspace.

Is this going to be an issue if this schema mismatch lasts for maybe several 
weeks? I assume that running any DDL during that time would be a bad idea, is 
there any other issues to look out for?

Thanks

Paul Chandler

Re: Trouble with using group commitlog_sync

2024-04-24 Thread Bowen Song via user
You might have run into the bottleneck of the driver's IO thread. Try 
increase the driver's connections-per-server limit to 2 or 3 if you've 
only got 1 server in the cluster. Or alternatively, run two client 
processes in parallel.



On 24/04/2024 07:19, Nathan Marz wrote:
Tried it again with one more client thread, and that had no effect on 
performance. This is unsurprising as there's only 2 CPU on this node 
and they were already at 100%. These were good ideas, but I'm still 
unable to even match the performance of batch commit mode with group 
commit mode.


On Tue, Apr 23, 2024 at 12:46 PM Bowen Song via user 
 wrote:


To achieve 10k loop iterations per second, each iteration must
take 0.1 milliseconds or less. Considering that each iteration
needs to lock and unlock the semaphore (two syscalls) and make
network requests (more syscalls), that's a lots of context
switches. It may a bit too much to ask for a single thread. I
would suggest try multi-threading or multi-processing, and see if
the combined insert rate is higher.

I should also note that executeAsync() also has implicit limits on
the number of in-flight requests, which default to 1024 requests
per connection and 1 connection per server. See
https://docs.datastax.com/en/developer/java-driver/4.17/manual/core/pooling/


On 23/04/2024 23:18, Nathan Marz wrote:

It's using the async API, so why would it need multiple threads?
Using the exact same approach I'm able to get 38k / second with
periodic commitlog_sync. For what it's worth, I do see 100% CPU
utilization in every single one of these tests.

On Tue, Apr 23, 2024 at 11:01 AM Bowen Song via user
 wrote:

Have you checked the thread CPU utilisation of the client
side? You likely will need more than one thread to do
insertion in a loop to achieve tens of thousands of inserts
per second.


On 23/04/2024 21:55, Nathan Marz wrote:

Thanks for the explanation.

I tried again with commitlog_sync_group_window at 2ms,
concurrent_writes at 512, and doing 1000 individual inserts
at a time with the same loop + semaphore approach. This only
nets 9k / second.

I got much higher throughput for the other modes with
BatchStatement of 100 inserts rather than 100x more
individual inserts.

On Tue, Apr 23, 2024 at 10:45 AM Bowen Song via user
 wrote:

I suspect you are abusing batch statements. Batch
statements should only be used where atomicity or
isolation is needed. Using batch statements won't make
inserting multiple partitions faster. In fact, it often
will make that slower.

Also, the liner relationship between
commitlog_sync_group_window and write throughput is
expected. That's because the max number of uncompleted
writes is limited by the write concurrency, and a write
is not considered "complete" before it is synced to disk
when commitlog sync is in group or batch mode. That
means within each interval, only limited number of
writes can be done. The ways to increase that including:
add more nodes, sync commitlog at shorter intervals and
allow more concurrent writes.


On 23/04/2024 20:43, Nathan Marz wrote:

Thanks. I raised concurrent_writes to 128 and
set commitlog_sync_group_window to 20ms. This causes a
single execute of a BatchStatement containing 100
inserts to succeed. However, the throughput I'm seeing
is atrocious.

With these settings, I'm executing 10 BatchStatement
concurrently at a time using the semaphore + loop
approach I showed in my first message. So as requests
complete, more are sent out such that there are 10
in-flight at a time. Each BatchStatement has 100
individual inserts. I'm seeing only 730 inserts /
second. Again, with periodic mode I see 38k / second
and with batch I see 14k / second. My expectation was
that group commit mode throughput would be somewhere
between those two.

If I set commitlog_sync_group_window to 100ms, the
throughput drops to 14 / second.

If I set commitlog_sync_group_window to 10ms, the
throughput increases to 1587 / second.

If I set commitlog_sync_group_window to 5ms, the
throughput increases to 3200 / second.

If I set commitlog_sync_group_window to 1ms, the
throughput increases to 13k / second, which is slightly
less than batch commit mode.

Is group commit mode supposed to have better
performance than batch mode?


On Tue, Apr 23, 2024 at 8:46 AM 

Re: Trouble with using group commitlog_sync

2024-04-24 Thread Nathan Marz
Tried it again with one more client thread, and that had no effect on
performance. This is unsurprising as there's only 2 CPU on this node and
they were already at 100%. These were good ideas, but I'm still unable to
even match the performance of batch commit mode with group commit mode.

On Tue, Apr 23, 2024 at 12:46 PM Bowen Song via user <
user@cassandra.apache.org> wrote:

> To achieve 10k loop iterations per second, each iteration must take 0.1
> milliseconds or less. Considering that each iteration needs to lock and
> unlock the semaphore (two syscalls) and make network requests (more
> syscalls), that's a lots of context switches. It may a bit too much to ask
> for a single thread. I would suggest try multi-threading or
> multi-processing, and see if the combined insert rate is higher.
>
> I should also note that executeAsync() also has implicit limits on the
> number of in-flight requests, which default to 1024 requests per connection
> and 1 connection per server. See
> https://docs.datastax.com/en/developer/java-driver/4.17/manual/core/pooling/
>
>
> On 23/04/2024 23:18, Nathan Marz wrote:
>
> It's using the async API, so why would it need multiple threads? Using the
> exact same approach I'm able to get 38k / second with periodic
> commitlog_sync. For what it's worth, I do see 100% CPU utilization in every
> single one of these tests.
>
> On Tue, Apr 23, 2024 at 11:01 AM Bowen Song via user <
> user@cassandra.apache.org> wrote:
>
>> Have you checked the thread CPU utilisation of the client side? You
>> likely will need more than one thread to do insertion in a loop to achieve
>> tens of thousands of inserts per second.
>>
>>
>> On 23/04/2024 21:55, Nathan Marz wrote:
>>
>> Thanks for the explanation.
>>
>> I tried again with commitlog_sync_group_window at 2ms, concurrent_writes
>> at 512, and doing 1000 individual inserts at a time with the same loop +
>> semaphore approach. This only nets 9k / second.
>>
>> I got much higher throughput for the other modes with BatchStatement of
>> 100 inserts rather than 100x more individual inserts.
>>
>> On Tue, Apr 23, 2024 at 10:45 AM Bowen Song via user <
>> user@cassandra.apache.org> wrote:
>>
>>> I suspect you are abusing batch statements. Batch statements should only
>>> be used where atomicity or isolation is needed. Using batch statements
>>> won't make inserting multiple partitions faster. In fact, it often will
>>> make that slower.
>>>
>>> Also, the liner relationship between commitlog_sync_group_window and
>>> write throughput is expected. That's because the max number of uncompleted
>>> writes is limited by the write concurrency, and a write is not considered
>>> "complete" before it is synced to disk when commitlog sync is in group or
>>> batch mode. That means within each interval, only limited number of writes
>>> can be done. The ways to increase that including: add more nodes, sync
>>> commitlog at shorter intervals and allow more concurrent writes.
>>>
>>>
>>> On 23/04/2024 20:43, Nathan Marz wrote:
>>>
>>> Thanks. I raised concurrent_writes to 128 and
>>> set commitlog_sync_group_window to 20ms. This causes a single execute of a
>>> BatchStatement containing 100 inserts to succeed. However, the throughput
>>> I'm seeing is atrocious.
>>>
>>> With these settings, I'm executing 10 BatchStatement concurrently at a
>>> time using the semaphore + loop approach I showed in my first message. So
>>> as requests complete, more are sent out such that there are 10 in-flight at
>>> a time. Each BatchStatement has 100 individual inserts. I'm seeing only 730
>>> inserts / second. Again, with periodic mode I see 38k / second and with
>>> batch I see 14k / second. My expectation was that group commit mode
>>> throughput would be somewhere between those two.
>>>
>>> If I set commitlog_sync_group_window to 100ms, the throughput drops to
>>> 14 / second.
>>>
>>> If I set commitlog_sync_group_window to 10ms, the throughput increases
>>> to 1587 / second.
>>>
>>> If I set commitlog_sync_group_window to 5ms, the throughput increases to
>>> 3200 / second.
>>>
>>> If I set commitlog_sync_group_window to 1ms, the throughput increases to
>>> 13k / second, which is slightly less than batch commit mode.
>>>
>>> Is group commit mode supposed to have better performance than batch mode?
>>>
>>>
>>> On Tue, Apr 23, 2024 at 8:46 AM Bowen Song via user <
>>> user@cassandra.apache.org> wrote:
>>>
 The default commitlog_sync_group_window is very long for SSDs. Try
 reduce it if you are using SSD-backed storage for the commit log. 10-15 ms
 is a good starting point. You may also want to increase the value of
 concurrent_writes, consider at least double or quadruple it from the
 default. You'll need even higher write concurrency for longer
 commitlog_sync_group_window.

 On 23/04/2024 19:26, Nathan Marz wrote:

 "batch" mode works fine. I'm having trouble with "group" mode. The only
 config for that is "commitlog_sync_group_window", 

Re: Trouble with using group commitlog_sync

2024-04-23 Thread Bowen Song via user
To achieve 10k loop iterations per second, each iteration must take 0.1 
milliseconds or less. Considering that each iteration needs to lock and 
unlock the semaphore (two syscalls) and make network requests (more 
syscalls), that's a lots of context switches. It may a bit too much to 
ask for a single thread. I would suggest try multi-threading or 
multi-processing, and see if the combined insert rate is higher.


I should also note that executeAsync() also has implicit limits on the 
number of in-flight requests, which default to 1024 requests per 
connection and 1 connection per server. See 
https://docs.datastax.com/en/developer/java-driver/4.17/manual/core/pooling/



On 23/04/2024 23:18, Nathan Marz wrote:
It's using the async API, so why would it need multiple threads? Using 
the exact same approach I'm able to get 38k / second with periodic 
commitlog_sync. For what it's worth, I do see 100% CPU utilization in 
every single one of these tests.


On Tue, Apr 23, 2024 at 11:01 AM Bowen Song via user 
 wrote:


Have you checked the thread CPU utilisation of the client side?
You likely will need more than one thread to do insertion in a
loop to achieve tens of thousands of inserts per second.


On 23/04/2024 21:55, Nathan Marz wrote:

Thanks for the explanation.

I tried again with commitlog_sync_group_window at 2ms,
concurrent_writes at 512, and doing 1000 individual inserts at a
time with the same loop + semaphore approach. This only nets 9k /
second.

I got much higher throughput for the other modes with
BatchStatement of 100 inserts rather than 100x more individual
inserts.

On Tue, Apr 23, 2024 at 10:45 AM Bowen Song via user
 wrote:

I suspect you are abusing batch statements. Batch statements
should only be used where atomicity or isolation is needed.
Using batch statements won't make inserting multiple
partitions faster. In fact, it often will make that slower.

Also, the liner relationship between
commitlog_sync_group_window and write throughput is expected.
That's because the max number of uncompleted writes is
limited by the write concurrency, and a write is not
considered "complete" before it is synced to disk when
commitlog sync is in group or batch mode. That means within
each interval, only limited number of writes can be done. The
ways to increase that including: add more nodes, sync
commitlog at shorter intervals and allow more concurrent writes.


On 23/04/2024 20:43, Nathan Marz wrote:

Thanks. I raised concurrent_writes to 128 and
set commitlog_sync_group_window to 20ms. This causes a
single execute of a BatchStatement containing 100 inserts to
succeed. However, the throughput I'm seeing is atrocious.

With these settings, I'm executing 10 BatchStatement
concurrently at a time using the semaphore + loop approach I
showed in my first message. So as requests complete, more
are sent out such that there are 10 in-flight at a time.
Each BatchStatement has 100 individual inserts. I'm seeing
only 730 inserts / second. Again, with periodic mode I see
38k / second and with batch I see 14k / second. My
expectation was that group commit mode throughput would be
somewhere between those two.

If I set commitlog_sync_group_window to 100ms, the
throughput drops to 14 / second.

If I set commitlog_sync_group_window to 10ms, the throughput
increases to 1587 / second.

If I set commitlog_sync_group_window to 5ms, the throughput
increases to 3200 / second.

If I set commitlog_sync_group_window to 1ms, the throughput
increases to 13k / second, which is slightly less than batch
commit mode.

Is group commit mode supposed to have better performance
than batch mode?


On Tue, Apr 23, 2024 at 8:46 AM Bowen Song via user
 wrote:

The default commitlog_sync_group_window is very long for
SSDs. Try reduce it if you are using SSD-backed storage
for the commit log. 10-15 ms is a good starting point.
You may also want to increase the value of
concurrent_writes, consider at least double or quadruple
it from the default. You'll need even higher write
concurrency for longer commitlog_sync_group_window.


On 23/04/2024 19:26, Nathan Marz wrote:

"batch" mode works fine. I'm having trouble with
"group" mode. The only config for that is
"commitlog_sync_group_window", and I have that set to
the default 1000ms.

On Tue, Apr 23, 2024 at 8:15 AM Bowen Song via user
 wrote:

Why would you want to set
commitlog_sync_batch_window to 1 second long when
 

Re: Trouble with using group commitlog_sync

2024-04-23 Thread Nathan Marz
It's using the async API, so why would it need multiple threads? Using the
exact same approach I'm able to get 38k / second with periodic
commitlog_sync. For what it's worth, I do see 100% CPU utilization in every
single one of these tests.

On Tue, Apr 23, 2024 at 11:01 AM Bowen Song via user <
user@cassandra.apache.org> wrote:

> Have you checked the thread CPU utilisation of the client side? You likely
> will need more than one thread to do insertion in a loop to achieve tens of
> thousands of inserts per second.
>
>
> On 23/04/2024 21:55, Nathan Marz wrote:
>
> Thanks for the explanation.
>
> I tried again with commitlog_sync_group_window at 2ms, concurrent_writes
> at 512, and doing 1000 individual inserts at a time with the same loop +
> semaphore approach. This only nets 9k / second.
>
> I got much higher throughput for the other modes with BatchStatement of
> 100 inserts rather than 100x more individual inserts.
>
> On Tue, Apr 23, 2024 at 10:45 AM Bowen Song via user <
> user@cassandra.apache.org> wrote:
>
>> I suspect you are abusing batch statements. Batch statements should only
>> be used where atomicity or isolation is needed. Using batch statements
>> won't make inserting multiple partitions faster. In fact, it often will
>> make that slower.
>>
>> Also, the liner relationship between commitlog_sync_group_window and
>> write throughput is expected. That's because the max number of uncompleted
>> writes is limited by the write concurrency, and a write is not considered
>> "complete" before it is synced to disk when commitlog sync is in group or
>> batch mode. That means within each interval, only limited number of writes
>> can be done. The ways to increase that including: add more nodes, sync
>> commitlog at shorter intervals and allow more concurrent writes.
>>
>>
>> On 23/04/2024 20:43, Nathan Marz wrote:
>>
>> Thanks. I raised concurrent_writes to 128 and
>> set commitlog_sync_group_window to 20ms. This causes a single execute of a
>> BatchStatement containing 100 inserts to succeed. However, the throughput
>> I'm seeing is atrocious.
>>
>> With these settings, I'm executing 10 BatchStatement concurrently at a
>> time using the semaphore + loop approach I showed in my first message. So
>> as requests complete, more are sent out such that there are 10 in-flight at
>> a time. Each BatchStatement has 100 individual inserts. I'm seeing only 730
>> inserts / second. Again, with periodic mode I see 38k / second and with
>> batch I see 14k / second. My expectation was that group commit mode
>> throughput would be somewhere between those two.
>>
>> If I set commitlog_sync_group_window to 100ms, the throughput drops to 14
>> / second.
>>
>> If I set commitlog_sync_group_window to 10ms, the throughput increases to
>> 1587 / second.
>>
>> If I set commitlog_sync_group_window to 5ms, the throughput increases to
>> 3200 / second.
>>
>> If I set commitlog_sync_group_window to 1ms, the throughput increases to
>> 13k / second, which is slightly less than batch commit mode.
>>
>> Is group commit mode supposed to have better performance than batch mode?
>>
>>
>> On Tue, Apr 23, 2024 at 8:46 AM Bowen Song via user <
>> user@cassandra.apache.org> wrote:
>>
>>> The default commitlog_sync_group_window is very long for SSDs. Try
>>> reduce it if you are using SSD-backed storage for the commit log. 10-15 ms
>>> is a good starting point. You may also want to increase the value of
>>> concurrent_writes, consider at least double or quadruple it from the
>>> default. You'll need even higher write concurrency for longer
>>> commitlog_sync_group_window.
>>>
>>> On 23/04/2024 19:26, Nathan Marz wrote:
>>>
>>> "batch" mode works fine. I'm having trouble with "group" mode. The only
>>> config for that is "commitlog_sync_group_window", and I have that set to
>>> the default 1000ms.
>>>
>>> On Tue, Apr 23, 2024 at 8:15 AM Bowen Song via user <
>>> user@cassandra.apache.org> wrote:
>>>
 Why would you want to set commitlog_sync_batch_window to 1 second long
 when commitlog_sync is set to batch mode? The documentation
 
 on this says:

 *This window should be kept short because the writer threads will be
 unable to do extra work while waiting. You may need to increase
 concurrent_writes for the same reason*

 If you want to use batch mode, at least ensure
 commitlog_sync_batch_window is reasonably short. The default is 2
 millisecond.


 On 23/04/2024 18:32, Nathan Marz wrote:

 I'm doing some benchmarking of Cassandra on a single m6gd.large
 instance. It works fine with periodic or batch commitlog_sync options, but
 I'm having tons of issues when I change it to "group". I have
 "commitlog_sync_group_window" set to 1000ms.

 My client is doing writes like this (pseudocode):

 Semaphore sem = new Semaphore(numTickets);
 while(true) {

 

Re: Trouble with using group commitlog_sync

2024-04-23 Thread Bowen Song via user
Have you checked the thread CPU utilisation of the client side? You 
likely will need more than one thread to do insertion in a loop to 
achieve tens of thousands of inserts per second.



On 23/04/2024 21:55, Nathan Marz wrote:

Thanks for the explanation.

I tried again with commitlog_sync_group_window at 2ms, 
concurrent_writes at 512, and doing 1000 individual inserts at a time 
with the same loop + semaphore approach. This only nets 9k / second.


I got much higher throughput for the other modes with BatchStatement 
of 100 inserts rather than 100x more individual inserts.


On Tue, Apr 23, 2024 at 10:45 AM Bowen Song via user 
 wrote:


I suspect you are abusing batch statements. Batch statements
should only be used where atomicity or isolation is needed. Using
batch statements won't make inserting multiple partitions faster.
In fact, it often will make that slower.

Also, the liner relationship between commitlog_sync_group_window
and write throughput is expected. That's because the max number of
uncompleted writes is limited by the write concurrency, and a
write is not considered "complete" before it is synced to disk
when commitlog sync is in group or batch mode. That means within
each interval, only limited number of writes can be done. The ways
to increase that including: add more nodes, sync commitlog at
shorter intervals and allow more concurrent writes.


On 23/04/2024 20:43, Nathan Marz wrote:

Thanks. I raised concurrent_writes to 128 and
set commitlog_sync_group_window to 20ms. This causes a single
execute of a BatchStatement containing 100 inserts to succeed.
However, the throughput I'm seeing is atrocious.

With these settings, I'm executing 10 BatchStatement concurrently
at a time using the semaphore + loop approach I showed in my
first message. So as requests complete, more are sent out such
that there are 10 in-flight at a time. Each BatchStatement has
100 individual inserts. I'm seeing only 730 inserts / second.
Again, with periodic mode I see 38k / second and with batch I see
14k / second. My expectation was that group commit mode
throughput would be somewhere between those two.

If I set commitlog_sync_group_window to 100ms, the throughput
drops to 14 / second.

If I set commitlog_sync_group_window to 10ms, the throughput
increases to 1587 / second.

If I set commitlog_sync_group_window to 5ms, the throughput
increases to 3200 / second.

If I set commitlog_sync_group_window to 1ms, the throughput
increases to 13k / second, which is slightly less than batch
commit mode.

Is group commit mode supposed to have better performance than
batch mode?


On Tue, Apr 23, 2024 at 8:46 AM Bowen Song via user
 wrote:

The default commitlog_sync_group_window is very long for
SSDs. Try reduce it if you are using SSD-backed storage for
the commit log. 10-15 ms is a good starting point. You may
also want to increase the value of concurrent_writes,
consider at least double or quadruple it from the default.
You'll need even higher write concurrency for longer
commitlog_sync_group_window.


On 23/04/2024 19:26, Nathan Marz wrote:

"batch" mode works fine. I'm having trouble with "group"
mode. The only config for that is
"commitlog_sync_group_window", and I have that set to the
default 1000ms.

On Tue, Apr 23, 2024 at 8:15 AM Bowen Song via user
 wrote:

Why would you want to set commitlog_sync_batch_window to
1 second long when commitlog_sync is set to batch mode?
The documentation


on this says:

/This window should be kept short because the writer
threads will be unable to do extra work while
waiting. You may need to increase concurrent_writes
for the same reason/

If you want to use batch mode, at least ensure
commitlog_sync_batch_window is reasonably short. The
default is 2 millisecond.


On 23/04/2024 18:32, Nathan Marz wrote:

I'm doing some benchmarking of Cassandra on a single
m6gd.large instance. It works fine with periodic or
batch commitlog_sync options, but I'm having tons of
issues when I change it to "group". I have
"commitlog_sync_group_window" set to 1000ms.

My client is doing writes like this (pseudocode):

Semaphore sem = new Semaphore(numTickets);
while(true) {

sem.acquire();
session.executeAsync(insert.bind(genUUIDStr(),
genUUIDStr(), genUUIDStr())
            .whenComplete((t, u) -> sem.release())

}


Re: Trouble with using group commitlog_sync

2024-04-23 Thread Nathan Marz
Thanks for the explanation.

I tried again with commitlog_sync_group_window at 2ms, concurrent_writes at
512, and doing 1000 individual inserts at a time with the same loop +
semaphore approach. This only nets 9k / second.

I got much higher throughput for the other modes with BatchStatement of 100
inserts rather than 100x more individual inserts.

On Tue, Apr 23, 2024 at 10:45 AM Bowen Song via user <
user@cassandra.apache.org> wrote:

> I suspect you are abusing batch statements. Batch statements should only
> be used where atomicity or isolation is needed. Using batch statements
> won't make inserting multiple partitions faster. In fact, it often will
> make that slower.
>
> Also, the liner relationship between commitlog_sync_group_window and
> write throughput is expected. That's because the max number of uncompleted
> writes is limited by the write concurrency, and a write is not considered
> "complete" before it is synced to disk when commitlog sync is in group or
> batch mode. That means within each interval, only limited number of writes
> can be done. The ways to increase that including: add more nodes, sync
> commitlog at shorter intervals and allow more concurrent writes.
>
>
> On 23/04/2024 20:43, Nathan Marz wrote:
>
> Thanks. I raised concurrent_writes to 128 and
> set commitlog_sync_group_window to 20ms. This causes a single execute of a
> BatchStatement containing 100 inserts to succeed. However, the throughput
> I'm seeing is atrocious.
>
> With these settings, I'm executing 10 BatchStatement concurrently at a
> time using the semaphore + loop approach I showed in my first message. So
> as requests complete, more are sent out such that there are 10 in-flight at
> a time. Each BatchStatement has 100 individual inserts. I'm seeing only 730
> inserts / second. Again, with periodic mode I see 38k / second and with
> batch I see 14k / second. My expectation was that group commit mode
> throughput would be somewhere between those two.
>
> If I set commitlog_sync_group_window to 100ms, the throughput drops to 14
> / second.
>
> If I set commitlog_sync_group_window to 10ms, the throughput increases to
> 1587 / second.
>
> If I set commitlog_sync_group_window to 5ms, the throughput increases to
> 3200 / second.
>
> If I set commitlog_sync_group_window to 1ms, the throughput increases to
> 13k / second, which is slightly less than batch commit mode.
>
> Is group commit mode supposed to have better performance than batch mode?
>
>
> On Tue, Apr 23, 2024 at 8:46 AM Bowen Song via user <
> user@cassandra.apache.org> wrote:
>
>> The default commitlog_sync_group_window is very long for SSDs. Try
>> reduce it if you are using SSD-backed storage for the commit log. 10-15 ms
>> is a good starting point. You may also want to increase the value of
>> concurrent_writes, consider at least double or quadruple it from the
>> default. You'll need even higher write concurrency for longer
>> commitlog_sync_group_window.
>>
>> On 23/04/2024 19:26, Nathan Marz wrote:
>>
>> "batch" mode works fine. I'm having trouble with "group" mode. The only
>> config for that is "commitlog_sync_group_window", and I have that set to
>> the default 1000ms.
>>
>> On Tue, Apr 23, 2024 at 8:15 AM Bowen Song via user <
>> user@cassandra.apache.org> wrote:
>>
>>> Why would you want to set commitlog_sync_batch_window to 1 second long
>>> when commitlog_sync is set to batch mode? The documentation
>>> 
>>> on this says:
>>>
>>> *This window should be kept short because the writer threads will be
>>> unable to do extra work while waiting. You may need to increase
>>> concurrent_writes for the same reason*
>>>
>>> If you want to use batch mode, at least ensure
>>> commitlog_sync_batch_window is reasonably short. The default is 2
>>> millisecond.
>>>
>>>
>>> On 23/04/2024 18:32, Nathan Marz wrote:
>>>
>>> I'm doing some benchmarking of Cassandra on a single m6gd.large
>>> instance. It works fine with periodic or batch commitlog_sync options, but
>>> I'm having tons of issues when I change it to "group". I have
>>> "commitlog_sync_group_window" set to 1000ms.
>>>
>>> My client is doing writes like this (pseudocode):
>>>
>>> Semaphore sem = new Semaphore(numTickets);
>>> while(true) {
>>>
>>> sem.acquire();
>>> session.executeAsync(insert.bind(genUUIDStr(), genUUIDStr(),
>>> genUUIDStr())
>>> .whenComplete((t, u) -> sem.release())
>>>
>>> }
>>>
>>> If I set numTickets higher than 20, I get tons of timeout errors.
>>>
>>> I've also tried doing single commands with BatchStatement with many
>>> inserts at a time, and that fails with timeout when the batch size gets
>>> more than 20.
>>>
>>> Increasing the write request timeout in cassandra.yaml makes it time out
>>> at slightly higher numbers of concurrent requests.
>>>
>>> With periodic I'm able to get about 38k writes / second, and with batch
>>> I'm able to get about 14k / second.
>>>
>>> Any tips on 

Re: Trouble with using group commitlog_sync

2024-04-23 Thread Bowen Song via user
I suspect you are abusing batch statements. Batch statements should only 
be used where atomicity or isolation is needed. Using batch statements 
won't make inserting multiple partitions faster. In fact, it often will 
make that slower.


Also, the liner relationship between commitlog_sync_group_window and 
write throughput is expected. That's because the max number of 
uncompleted writes is limited by the write concurrency, and a write is 
not considered "complete" before it is synced to disk when commitlog 
sync is in group or batch mode. That means within each interval, only 
limited number of writes can be done. The ways to increase that 
including: add more nodes, sync commitlog at shorter intervals and allow 
more concurrent writes.



On 23/04/2024 20:43, Nathan Marz wrote:
Thanks. I raised concurrent_writes to 128 and 
set commitlog_sync_group_window to 20ms. This causes a single execute 
of a BatchStatement containing 100 inserts to succeed. However, the 
throughput I'm seeing is atrocious.


With these settings, I'm executing 10 BatchStatement concurrently at a 
time using the semaphore + loop approach I showed in my first message. 
So as requests complete, more are sent out such that there are 10 
in-flight at a time. Each BatchStatement has 100 individual inserts. 
I'm seeing only 730 inserts / second. Again, with periodic mode I see 
38k / second and with batch I see 14k / second. My expectation was 
that group commit mode throughput would be somewhere between those two.


If I set commitlog_sync_group_window to 100ms, the throughput drops to 
14 / second.


If I set commitlog_sync_group_window to 10ms, the throughput increases 
to 1587 / second.


If I set commitlog_sync_group_window to 5ms, the throughput increases 
to 3200 / second.


If I set commitlog_sync_group_window to 1ms, the throughput increases 
to 13k / second, which is slightly less than batch commit mode.


Is group commit mode supposed to have better performance than batch mode?


On Tue, Apr 23, 2024 at 8:46 AM Bowen Song via user 
 wrote:


The default commitlog_sync_group_window is very long for SSDs. Try
reduce it if you are using SSD-backed storage for the commit log.
10-15 ms is a good starting point. You may also want to increase
the value of concurrent_writes, consider at least double or
quadruple it from the default. You'll need even higher write
concurrency for longer commitlog_sync_group_window.


On 23/04/2024 19:26, Nathan Marz wrote:

"batch" mode works fine. I'm having trouble with "group" mode.
The only config for that is "commitlog_sync_group_window", and I
have that set to the default 1000ms.

On Tue, Apr 23, 2024 at 8:15 AM Bowen Song via user
 wrote:

Why would you want to set commitlog_sync_batch_window to 1
second long when commitlog_sync is set to batch mode? The
documentation


on this says:

/This window should be kept short because the writer
threads will be unable to do extra work while waiting.
You may need to increase concurrent_writes for the same
reason/

If you want to use batch mode, at least ensure
commitlog_sync_batch_window is reasonably short. The default
is 2 millisecond.


On 23/04/2024 18:32, Nathan Marz wrote:

I'm doing some benchmarking of Cassandra on a single
m6gd.large instance. It works fine with periodic or batch
commitlog_sync options, but I'm having tons of issues when I
change it to "group". I have "commitlog_sync_group_window"
set to 1000ms.

My client is doing writes like this (pseudocode):

Semaphore sem = new Semaphore(numTickets);
while(true) {

sem.acquire();
session.executeAsync(insert.bind(genUUIDStr(),
genUUIDStr(), genUUIDStr())
            .whenComplete((t, u) -> sem.release())

}

If I set numTickets higher than 20, I get tons of timeout
errors.

I've also tried doing single commands with BatchStatement
with many inserts at a time, and that fails with timeout
when the batch size gets more than 20.

Increasing the write request timeout in cassandra.yaml makes
it time out at slightly higher numbers of concurrent requests.

With periodic I'm able to get about 38k writes / second, and
with batch I'm able to get about 14k / second.

Any tips on what I should be doing to get group
commitlog_sync to work properly? I didn't expect to have to
do anything other than change the config.


Re: Trouble with using group commitlog_sync

2024-04-23 Thread Nathan Marz
Thanks. I raised concurrent_writes to 128 and
set commitlog_sync_group_window to 20ms. This causes a single execute of a
BatchStatement containing 100 inserts to succeed. However, the throughput
I'm seeing is atrocious.

With these settings, I'm executing 10 BatchStatement concurrently at a time
using the semaphore + loop approach I showed in my first message. So as
requests complete, more are sent out such that there are 10 in-flight at a
time. Each BatchStatement has 100 individual inserts. I'm seeing only 730
inserts / second. Again, with periodic mode I see 38k / second and with
batch I see 14k / second. My expectation was that group commit mode
throughput would be somewhere between those two.

If I set commitlog_sync_group_window to 100ms, the throughput drops to 14 /
second.

If I set commitlog_sync_group_window to 10ms, the throughput increases to
1587 / second.

If I set commitlog_sync_group_window to 5ms, the throughput increases to
3200 / second.

If I set commitlog_sync_group_window to 1ms, the throughput increases to
13k / second, which is slightly less than batch commit mode.

Is group commit mode supposed to have better performance than batch mode?


On Tue, Apr 23, 2024 at 8:46 AM Bowen Song via user <
user@cassandra.apache.org> wrote:

> The default commitlog_sync_group_window is very long for SSDs. Try reduce
> it if you are using SSD-backed storage for the commit log. 10-15 ms is a
> good starting point. You may also want to increase the value of
> concurrent_writes, consider at least double or quadruple it from the
> default. You'll need even higher write concurrency for longer
> commitlog_sync_group_window.
>
> On 23/04/2024 19:26, Nathan Marz wrote:
>
> "batch" mode works fine. I'm having trouble with "group" mode. The only
> config for that is "commitlog_sync_group_window", and I have that set to
> the default 1000ms.
>
> On Tue, Apr 23, 2024 at 8:15 AM Bowen Song via user <
> user@cassandra.apache.org> wrote:
>
>> Why would you want to set commitlog_sync_batch_window to 1 second long
>> when commitlog_sync is set to batch mode? The documentation
>> 
>> on this says:
>>
>> *This window should be kept short because the writer threads will be
>> unable to do extra work while waiting. You may need to increase
>> concurrent_writes for the same reason*
>>
>> If you want to use batch mode, at least ensure
>> commitlog_sync_batch_window is reasonably short. The default is 2
>> millisecond.
>>
>>
>> On 23/04/2024 18:32, Nathan Marz wrote:
>>
>> I'm doing some benchmarking of Cassandra on a single m6gd.large instance.
>> It works fine with periodic or batch commitlog_sync options, but I'm having
>> tons of issues when I change it to "group". I have
>> "commitlog_sync_group_window" set to 1000ms.
>>
>> My client is doing writes like this (pseudocode):
>>
>> Semaphore sem = new Semaphore(numTickets);
>> while(true) {
>>
>> sem.acquire();
>> session.executeAsync(insert.bind(genUUIDStr(), genUUIDStr(), genUUIDStr())
>> .whenComplete((t, u) -> sem.release())
>>
>> }
>>
>> If I set numTickets higher than 20, I get tons of timeout errors.
>>
>> I've also tried doing single commands with BatchStatement with many
>> inserts at a time, and that fails with timeout when the batch size gets
>> more than 20.
>>
>> Increasing the write request timeout in cassandra.yaml makes it time out
>> at slightly higher numbers of concurrent requests.
>>
>> With periodic I'm able to get about 38k writes / second, and with batch
>> I'm able to get about 14k / second.
>>
>> Any tips on what I should be doing to get group commitlog_sync to work
>> properly? I didn't expect to have to do anything other than change the
>> config.
>>
>>


Re: Trouble with using group commitlog_sync

2024-04-23 Thread Bowen Song via user
The default commitlog_sync_group_window is very long for SSDs. Try 
reduce it if you are using SSD-backed storage for the commit log. 10-15 
ms is a good starting point. You may also want to increase the value of 
concurrent_writes, consider at least double or quadruple it from the 
default. You'll need even higher write concurrency for longer 
commitlog_sync_group_window.



On 23/04/2024 19:26, Nathan Marz wrote:
"batch" mode works fine. I'm having trouble with "group" mode. The 
only config for that is "commitlog_sync_group_window", and I have that 
set to the default 1000ms.


On Tue, Apr 23, 2024 at 8:15 AM Bowen Song via user 
 wrote:


Why would you want to set commitlog_sync_batch_window to 1 second
long when commitlog_sync is set to batch mode? The documentation


on this says:

/This window should be kept short because the writer threads
will be unable to do extra work while waiting. You may need to
increase concurrent_writes for the same reason/

If you want to use batch mode, at least ensure
commitlog_sync_batch_window is reasonably short. The default is 2
millisecond.


On 23/04/2024 18:32, Nathan Marz wrote:

I'm doing some benchmarking of Cassandra on a single m6gd.large
instance. It works fine with periodic or batch commitlog_sync
options, but I'm having tons of issues when I change it to
"group". I have "commitlog_sync_group_window" set to 1000ms.

My client is doing writes like this (pseudocode):

Semaphore sem = new Semaphore(numTickets);
while(true) {

sem.acquire();
session.executeAsync(insert.bind(genUUIDStr(), genUUIDStr(),
genUUIDStr())
            .whenComplete((t, u) -> sem.release())

}

If I set numTickets higher than 20, I get tons of timeout errors.

I've also tried doing single commands with BatchStatement with
many inserts at a time, and that fails with timeout when the
batch size gets more than 20.

Increasing the write request timeout in cassandra.yaml makes it
time out at slightly higher numbers of concurrent requests.

With periodic I'm able to get about 38k writes / second, and with
batch I'm able to get about 14k / second.

Any tips on what I should be doing to get group commitlog_sync to
work properly? I didn't expect to have to do anything other than
change the config.


Re: Trouble with using group commitlog_sync

2024-04-23 Thread Nathan Marz
"batch" mode works fine. I'm having trouble with "group" mode. The only
config for that is "commitlog_sync_group_window", and I have that set to
the default 1000ms.

On Tue, Apr 23, 2024 at 8:15 AM Bowen Song via user <
user@cassandra.apache.org> wrote:

> Why would you want to set commitlog_sync_batch_window to 1 second long
> when commitlog_sync is set to batch mode? The documentation
> 
> on this says:
>
> *This window should be kept short because the writer threads will be
> unable to do extra work while waiting. You may need to increase
> concurrent_writes for the same reason*
>
> If you want to use batch mode, at least ensure commitlog_sync_batch_window
> is reasonably short. The default is 2 millisecond.
>
>
> On 23/04/2024 18:32, Nathan Marz wrote:
>
> I'm doing some benchmarking of Cassandra on a single m6gd.large instance.
> It works fine with periodic or batch commitlog_sync options, but I'm having
> tons of issues when I change it to "group". I have
> "commitlog_sync_group_window" set to 1000ms.
>
> My client is doing writes like this (pseudocode):
>
> Semaphore sem = new Semaphore(numTickets);
> while(true) {
>
> sem.acquire();
> session.executeAsync(insert.bind(genUUIDStr(), genUUIDStr(), genUUIDStr())
> .whenComplete((t, u) -> sem.release())
>
> }
>
> If I set numTickets higher than 20, I get tons of timeout errors.
>
> I've also tried doing single commands with BatchStatement with many
> inserts at a time, and that fails with timeout when the batch size gets
> more than 20.
>
> Increasing the write request timeout in cassandra.yaml makes it time out
> at slightly higher numbers of concurrent requests.
>
> With periodic I'm able to get about 38k writes / second, and with batch
> I'm able to get about 14k / second.
>
> Any tips on what I should be doing to get group commitlog_sync to work
> properly? I didn't expect to have to do anything other than change the
> config.
>
>


Re: Trouble with using group commitlog_sync

2024-04-23 Thread Bowen Song via user
Why would you want to set commitlog_sync_batch_window to 1 second long 
when commitlog_sync is set to batch mode? The documentation 
 
on this says:


   /This window should be kept short because the writer threads will be
   unable to do extra work while waiting. You may need to increase
   concurrent_writes for the same reason/

If you want to use batch mode, at least ensure 
commitlog_sync_batch_window is reasonably short. The default is 2 
millisecond.



On 23/04/2024 18:32, Nathan Marz wrote:
I'm doing some benchmarking of Cassandra on a single m6gd.large 
instance. It works fine with periodic or batch commitlog_sync options, 
but I'm having tons of issues when I change it to "group". I have 
"commitlog_sync_group_window" set to 1000ms.


My client is doing writes like this (pseudocode):

Semaphore sem = new Semaphore(numTickets);
while(true) {

sem.acquire();
session.executeAsync(insert.bind(genUUIDStr(), genUUIDStr(),
genUUIDStr())
            .whenComplete((t, u) -> sem.release())

}

If I set numTickets higher than 20, I get tons of timeout errors.

I've also tried doing single commands with BatchStatement with many 
inserts at a time, and that fails with timeout when the batch size 
gets more than 20.


Increasing the write request timeout in cassandra.yaml makes it time 
out at slightly higher numbers of concurrent requests.


With periodic I'm able to get about 38k writes / second, and with 
batch I'm able to get about 14k / second.


Any tips on what I should be doing to get group commitlog_sync to work 
properly? I didn't expect to have to do anything other than change the 
config.

Trouble with using group commitlog_sync

2024-04-23 Thread Nathan Marz
I'm doing some benchmarking of Cassandra on a single m6gd.large instance.
It works fine with periodic or batch commitlog_sync options, but I'm having
tons of issues when I change it to "group". I have
"commitlog_sync_group_window" set to 1000ms.

My client is doing writes like this (pseudocode):

Semaphore sem = new Semaphore(numTickets);
while(true) {

sem.acquire();
session.executeAsync(insert.bind(genUUIDStr(), genUUIDStr(), genUUIDStr())
.whenComplete((t, u) -> sem.release())

}

If I set numTickets higher than 20, I get tons of timeout errors.

I've also tried doing single commands with BatchStatement with many inserts
at a time, and that fails with timeout when the batch size gets more than
20.

Increasing the write request timeout in cassandra.yaml makes it time out at
slightly higher numbers of concurrent requests.

With periodic I'm able to get about 38k writes / second, and with batch I'm
able to get about 14k / second.

Any tips on what I should be doing to get group commitlog_sync to work
properly? I didn't expect to have to do anything other than change the
config.


RE: Datacenter decommissioning on Cassandra 4.1.4

2024-04-23 Thread Michalis Kotsiouros (EXT) via user
Hello Alain,
Thanks a lot for the confirmation.
Yes this procedure seems like a workaround. But for my use case where 
system_auth contains a small amount of data and consistency level for 
authentication/authorization is switched to LOCAL_ONE, I think it is good 
enough.
I completely get that this could be improved since there might be requirements 
from other users that cannot be covered with the proposed procedure.

BR
MK
From: Alain Rodriguez 
Sent: April 22, 2024 18:27
To: user@cassandra.apache.org
Cc: Michalis Kotsiouros (EXT) 
Subject: Re: Datacenter decommissioning on Cassandra 4.1.4

Hi Michalis,

It's been a while since I removed a DC for the last time, but I see there is 
now a protection to avoid accidentally leaving a DC without auth capability.

This was introduced in C* 4.1 through CASSANDRA-17478 
(https://issues.apache.org/jira/browse/CASSANDRA-17478).

The process of dropping a data center might have been overlooked while doing 
this work.

It's never correct for an operator to remove a DC from system_auth replication 
settings while there are currently nodes up in that DC.

I believe this assertion is not correct. As Jon and Jeff mentioned, usually we 
remove the replication before decommissioning any node in the case of removing 
an entire DC, for reasons exposed by Jeff. The existing documentation is also 
clear about this: 
https://docs.datastax.com/en/cassandra-oss/3.0/cassandra/operations/opsDecomissionDC.html
 and 
https://thelastpickle.com/blog/2019/02/26/data-center-switch.html.

Michalis, the solution you suggest seems to be the (good/only?) way to go, even 
though it looks like a workaround, not really "clean" and something we need to 
improve. It was also mentioned here: 
https://dba.stackexchange.com/questions/331732/not-a-able-to-decommission-the-old-datacenter#answer-334890.
 It should work quickly, but only because this keyspace has a fairly low amount 
of data, but it will still not be optimal and as fast as it should (it should 
be a near no-op as explained above by Jeff). It also obliges you to use 
"--force" option that could lead you to delete one of your nodes in another DC 
by mistake and in a loaded cluster or a 3-node cluster - RF = 3, this could 
hurt...). Having to operate using "nodetool decommission --force" cannot be 
standard, but for now I can't think of anything better for you. Maybe wait for 
someone else's confirmation, it's been a while since operated Cassandra :).

I think it would make sense to fix this somehow in Cassandra. Maybe should we 
ensure that no other keyspaces has a RF > 0 for this data center instead of 
looking at active nodes, or that there is no client connected to the nodes, add 
a manual flag somewhere, or something else? Even though I understand the 
motivation to protect users against a wrongly distributed system_auth keyspace, 
I think this protection should not be kept with this implementation. If that 
makes sense I can create a ticket for this problem.

C*heers,

Alain Rodriguez

casterix.fr

[Image removed by sender.]



Le lun. 8 avr. 2024 à 16:26, Michalis Kotsiouros (EXT) via user 
mailto:user@cassandra.apache.org>> a écrit :
Hello Jon and Jeff,
Thanks a lot for your replies.
I completely get your points.
Some more clarification about my issue.
When trying to update the Replication before the decommission, I get the 
following error message when I remove the replication for system_auth kesypace.
ConfigurationException: Following datacenters have active nodes and must be 
present in replication options for keyspace system_auth: [datacenter1]

This error message does not appear in the rest of the application keyspaces.
So, may I change the procedure to:

  1.  Make sure no clients are still writing to any nodes in the datacenter.
  2.  Run a full repair with nodetool repair.
  3.  Change all keyspaces so they no longer reference the datacenter being 
removed apart from system_auth keyspace.
  4.  Run nodetool decommission using the --force option on every node in the 
datacenter being removed.
  5.  Change system_auth keyspace so they no longer reference the datacenter 
being removed.
BR
MK



From: Jeff Jirsa mailto:jji...@gmail.com>>
Sent: April 08, 2024 17:19
To: cassandra mailto:user@cassandra.apache.org>>
Cc: Michalis Kotsiouros (EXT) 
mailto:michalis.kotsiouros@ericsson.com>>
Subject: Re: Datacenter decommissioning on Cassandra 4.1.4

To Jon’s point, if you remove from replication after step 1 or step 2 (probably 
step 2 if your goal is to be strictly correct), the nodetool decommission phase 
becomes almost a no-op.

If you use the order 

RE: Datacenter decommissioning on Cassandra 4.1.4

2024-04-23 Thread Michalis Kotsiouros (EXT) via user
Hello Sebastien,
Yes, your approach is really interesting. I will test this in my system as
well. I think it reduces some risks involved in the procedure that was
discussed in the previous emails.
Just for the record, availability is a top priority for my use cases that is
why I have switched the default consistency level for
authentication/authorization to LOCAL_ONE as it used in previous C*
versions.

BR
MK
-Original Message-
From: Sebastian Marsching  
Sent: April 22, 2024 21:58
To: Michalis Kotsiouros (EXT) via user 
Subject: Re: Datacenter decommissioning on Cassandra 4.1.4

Recently, I successfully used the following procedure when decommissioning a
datacenter:

1. Reduced the replication factor for this DC to zero for all keyspaces
except the system_auth keyspace. For that keyspace, I reduced the RF to one.
2. Decommissioned all nodes except one in the DC using the regular procedure
(no --force needed).
3. Decommissioned the last node using --force.
4. Set the RF for the system_auth keyspace to 0.

This procedure has two benefits:

1. Authentication on the nodes in the DC being decommissioned will work
until the last node has been decommissioned. This is important when
authentication is enabled for JMX. Otherwise, you cannot proceed when there
are too few nodes left to get a LOCAL_QUORUM on system_auth.
2. One does not have to use --force except when removing the last node.

It would be nice if the RF for the system_auth keyspace could be reduced to
zero before decommissioning the nodes. However, I think that implementing
this correctly may be hard. If there are no local replicas, queries with a
consistency level of LOCAL_QUORUM will probably fail, and this is the
consistency level used for all authentication and authorization related
queries. So, setting the RF to zero might break authentication and
authorization, which in turn might make it impossible to decommission the
nodes (without disabling authentication for that DC).

So, I guess that the code dealing with authentication and authorization
would have to be changed to use a CL of QUORUM instead of LOCAL_QUORUM when
system_auth is not replicated in the local DC.



smime.p7s
Description: S/MIME cryptographic signature


Re: Datacenter decommissioning on Cassandra 4.1.4

2024-04-22 Thread Sebastian Marsching

Recently, I successfully used the following procedure when decommissioning a 
datacenter:

1. Reduced the replication factor for this DC to zero for all keyspaces except 
the system_auth keyspace. For that keyspace, I reduced the RF to one.
2. Decommissioned all nodes except one in the DC using the regular procedure 
(no --force needed).
3. Decommissioned the last node using --force.
4. Set the RF for the system_auth keyspace to 0.

This procedure has two benefits:

1. Authentication on the nodes in the DC being decommissioned will work until 
the last node has been decommissioned. This is important when authentication is 
enabled for JMX. Otherwise, you cannot proceed when there are too few nodes 
left to get a LOCAL_QUORUM on system_auth.
2. One does not have to use --force except when removing the last node.

It would be nice if the RF for the system_auth keyspace could be reduced to 
zero before decommissioning the nodes. However, I think that implementing this 
correctly may be hard. If there are no local replicas, queries with a 
consistency level of LOCAL_QUORUM will probably fail, and this is the 
consistency level used for all authentication and authorization related 
queries. So, setting the RF to zero might break authentication and 
authorization, which in turn might make it impossible to decommission the nodes 
(without disabling authentication for that DC).

So, I guess that the code dealing with authentication and authorization would 
have to be changed to use a CL of QUORUM instead of LOCAL_QUORUM when 
system_auth is not replicated in the local DC.



smime.p7s
Description: S/MIME cryptographic signature


Re: Datacenter decommissioning on Cassandra 4.1.4

2024-04-22 Thread Alain Rodriguez via user
Hi Michalis,

It's been a while since I removed a DC for the last time, but I see there
is now a protection to avoid accidentally leaving a DC without auth
capability.

This was introduced in C* 4.1 through CASSANDRA-17478 (
https://issues.apache.org/jira/browse/CASSANDRA-17478).

The process of dropping a data center might have been overlooked while
doing this work.

It's never correct for an operator to remove a DC from system_auth
> replication settings while there are currently nodes up in that DC.
>

I believe this assertion is not correct. As Jon and Jeff mentioned, usually
we remove the replication *before* decommissioning any node in the case of
removing an entire DC, for reasons exposed by Jeff. The existing
documentation is also clear about this:
https://docs.datastax.com/en/cassandra-oss/3.0/cassandra/operations/opsDecomissionDC.html
and https://thelastpickle.com/blog/2019/02/26/data-center-switch.html.

Michalis, the solution you suggest seems to be the (good/only?) way to go,
even though it looks like a workaround, not really "clean" and something we
need to improve. It was also mentioned here:
https://dba.stackexchange.com/questions/331732/not-a-able-to-decommission-the-old-datacenter#answer-334890.
It should work quickly, but only because this keyspace has a fairly low
amount of data, but it will still not be optimal and as fast as it should
(it should be a near no-op as explained above by Jeff). It also obliges you
to use "--force" option that could lead you to delete one of your nodes in
another DC by mistake and in a loaded cluster or a 3-node cluster - RF = 3,
this could hurt...). Having to operate using "nodetool decommission
--force" cannot be standard, but for now I can't think of anything better
for you. Maybe wait for someone else's confirmation, it's been a while
since operated Cassandra :).

I think it would make sense to fix this somehow in Cassandra. Maybe should
we ensure that no other keyspaces has a RF > 0 for this data center instead
of looking at active nodes, or that there is no client connected to the
nodes, add a manual flag somewhere, or something else? Even though I
understand the motivation to protect users against a wrongly distributed
system_auth keyspace, I think this protection should not be kept with this
implementation. If that makes sense I can create a ticket for this problem.

C*heers,


*Alain Rodriguezcasterix.fr *


Le lun. 8 avr. 2024 à 16:26, Michalis Kotsiouros (EXT) via user <
user@cassandra.apache.org> a écrit :

> Hello Jon and Jeff,
>
> Thanks a lot for your replies.
>
> I completely get your points.
>
> Some more clarification about my issue.
>
> When trying to update the Replication before the decommission, I get the
> following error message when I remove the replication for system_auth
> kesypace.
>
> ConfigurationException: Following datacenters have active nodes and must
> be present in replication options for keyspace system_auth: [datacenter1]
>
>
>
> This error message does not appear in the rest of the application
> keyspaces.
>
> So, may I change the procedure to:
>
>1. Make sure no clients are still writing to any nodes in the
>datacenter.
>2. Run a full repair with nodetool repair.
>3. Change all keyspaces so they no longer reference the datacenter
>being removed apart from system_auth keyspace.
>4. Run nodetool decommission using the --force option on every node in
>the datacenter being removed.
>5. Change system_auth keyspace so they no longer reference the
>datacenter being removed.
>
> BR
>
> MK
>
>
>
>
>
>
>
> *From:* Jeff Jirsa 
> *Sent:* April 08, 2024 17:19
> *To:* cassandra 
> *Cc:* Michalis Kotsiouros (EXT) 
> *Subject:* Re: Datacenter decommissioning on Cassandra 4.1.4
>
>
>
> To Jon’s point, if you remove from replication after step 1 or step 2
> (probably step 2 if your goal is to be strictly correct), the nodetool
> decommission phase becomes almost a no-op.
>
>
>
> If you use the order below, the last nodes to decommission will cause
> those surviving machines to run out of space (assuming you have more than a
> few nodes to start)
>
>
>
>
>
>
>
> On Apr 8, 2024, at 6:58 AM, Jon Haddad  wrote:
>
>
>
> You shouldn’t decom an entire DC before removing it from replication.
>
>
> —
>
>
> Jon Haddad
> Rustyrazorblade Consulting
> rustyrazorblade.com
> 
>
>
>
>
>
> On Mon, Apr 8, 2024 at 6:26 AM Michalis Kotsiouros (EXT) via user <
> user@cassandra.apache.org> wrote:
>
> Hello community,
>
> In our deployments, we usually rebuild the Cassandra datacenters for
> maintenance or recovery operations.
>
> The procedure used since the days of Cassandra 3.x was the one documented
> in datastax documentation. Decommissioning a datacenter | Apache
> Cassandra 3.x (datastax.com)
> 

Re: ssl certificate hot reloading test - cassandra 4.1

2024-04-18 Thread Tolbert, Andy
I think in the context of what I think initially motivated this hot
reloading capability, a big win it provides is avoiding having to
bounce your cluster as your certificates near expiry.  If not watched
closely you can get yourself into a state where every node in the
cluster's cert expired, which is effectively an outage.

I see the appeal of draining connections on a change of trust,
although the necessity of being able to "do it live" (as opposed to
doing a bounce) seems less important then avoiding the outage
condition of your certificates expiring, especially since you can sort
of already do this without bouncing by toggling nodetool
disablebinary/enablebinary.  I agree with Dinesh that most operators
would prefer that it does not do that as interrupting connections can
be disruptive to applications if they don't have retries configured,
but I also agree it'd be a nice improvement to support draining
existing connections in some way.

+1 on the idea of having a "timed connection" capability brought up
here, and implementing it in a way such that connection lifetimes can
be dynamically adjusted.  This way it can be made such that on a trust
store change Cassandra could simply adjust the connection lifetimes
and they will be disconnected immediately or drained over a time
period like Josh proposed.

Thanks,
Andy


Re: ssl certificate hot reloading test - cassandra 4.1

2024-04-18 Thread Josh McKenzie
I think it's all part of the same issue and you're not derailing IMO Abe. For 
the user Pabbireddy here, the unexpected behavior was not closing internode 
connections on that keystore refresh. So ISTM, from a "featureset that would be 
nice to have here" perspective, we could theoretically provide:
 1. An option to disconnect all connections on cert update, disabled by default
 2. An option to drain and recycle connections on a time period, also disabled 
by default
Leave the current behavior in place but allow for these kind of strong 
cert-guarantees if a user needs it in their env.

On Mon, Apr 15, 2024, at 9:51 PM, Abe Ratnofsky wrote:
> Not to derail from the original conversation too far, but wanted to agree 
> that maximum connection establishment time on native transport would be 
> useful. That would provide a maximum duration before an updated client 
> keystore is used for connections, which can be used to safely roll out client 
> keystore updates.
> 
> For example, if the maximum connection establishment time is 12 hours, then 
> you can update the keystore on a canary client, wait 24 hours, confirm that 
> connectivity is maintained, then upgrade keystores across the rest of the 
> fleet.
> 
> With unbounded connection establishment, reconnection isn't tested as often 
> and issues can hide behind long-lived connections.
> 
>> On Apr 15, 2024, at 5:14 PM, Jeff Jirsa  wrote:
>> 
>> It seems like if folks really want the life of a connection to be finite 
>> (either client/server or server/server), adding in an option to quietly 
>> drain and recycle a connection on some period isn’t that difficult.
>> 
>> That type of requirement shows up in a number of environments, usually on 
>> interactive logins (cqlsh, login, walk away, the connection needs to become 
>> invalid in a short and finite period of time), but adding it to internode 
>> could also be done, and may help in some weird situations (if you changed 
>> certs because you believe a key/cert is compromised, having the connection 
>> remain active is decidedly inconvenient, so maybe it does make sense to add 
>> an expiration timer/condition on each connection).
>> 
>> 
>> 
>>> On Apr 15, 2024, at 12:28 PM, Dinesh Joshi  wrote:
>>> 
>>> In addition to what Andy mentioned, I want to point out that for the vast 
>>> majority of use-cases, we would like to _avoid_ interruptions when a 
>>> certificate is updated so it is by design. If you're dealing with a 
>>> situation where you want to ensure that the connections are cycled, you can 
>>> follow Andy's advice. It will require automation outside the database that 
>>> you might already have. If there is demand, we can consider adding a 
>>> feature to slowly cycle the connections so the old SSL context is not used 
>>> anymore.
>>> 
>>> One more thing you should bear in mind is that Cassandra will not load the 
>>> new SSL context if it cannot successfully initialize it. This is again by 
>>> design to prevent an outage when the updated truststore is corrupted or 
>>> could not be read in some way.
>>> 
>>> thanks,
>>> Dinesh
>>> 
>>> On Mon, Apr 15, 2024 at 9:45 AM Tolbert, Andy  
>>> wrote:
 I should mention, when toggling disablebinary/enablebinary between
 instances, you will probably want to give some time between doing this
 so connections can reestablish, and you will want to verify that the
 connections can actually reestablish.  You also need to be mindful of
 this being disruptive to inflight queries (if your client is
 configured for retries it will probably be fine).  Semantically to
 your applications it should look a lot like a rolling cluster bounce.
 
 Thanks,
 Andy
 
 On Mon, Apr 15, 2024 at 11:39 AM pabbireddy avinash
  wrote:
 >
 > Thanks Andy for your reply . We will test the scenario you mentioned.
 >
 > Regards
 > Avinash
 >
 > On Mon, Apr 15, 2024 at 11:28 AM, Tolbert, Andy  
 > wrote:
 >>
 >> Hi Avinash,
 >>
 >> As far as I understand it, if the underlying keystore/trustore(s)
 >> Cassandra is configured for is updated, this *will not* provoke
 >> Cassandra to interrupt existing connections, it's just that the new
 >> stores will be used for future TLS initialization.
 >>
 >> Via: 
 >> https://cassandra.apache.org/doc/4.1/cassandra/operating/security.html#ssl-certificate-hot-reloading
 >>
 >> > When the files are updated, Cassandra will reload them and use them 
 >> > for subsequent connections
 >>
 >> I suppose one could do a rolling disablebinary/enablebinary (if it's
 >> only client connections) after you roll out a keystore/truststore
 >> change as a way of enforcing the existing connections to reestablish.
 >>
 >> Thanks,
 >> Andy
 >>
 >>
 >> On Mon, Apr 15, 2024 at 11:11 AM pabbireddy avinash
 >>  wrote:
 >> >
 >> > Dear Community,
 >> >
 >> > I hope this email 

Re: ssl certificate hot reloading test - cassandra 4.1

2024-04-15 Thread Abe Ratnofsky
Not to derail from the original conversation too far, but wanted to agree that 
maximum connection establishment time on native transport would be useful. That 
would provide a maximum duration before an updated client keystore is used for 
connections, which can be used to safely roll out client keystore updates.

For example, if the maximum connection establishment time is 12 hours, then you 
can update the keystore on a canary client, wait 24 hours, confirm that 
connectivity is maintained, then upgrade keystores across the rest of the fleet.

With unbounded connection establishment, reconnection isn't tested as often and 
issues can hide behind long-lived connections.

> On Apr 15, 2024, at 5:14 PM, Jeff Jirsa  wrote:
> 
> It seems like if folks really want the life of a connection to be finite 
> (either client/server or server/server), adding in an option to quietly drain 
> and recycle a connection on some period isn’t that difficult.
> 
> That type of requirement shows up in a number of environments, usually on 
> interactive logins (cqlsh, login, walk away, the connection needs to become 
> invalid in a short and finite period of time), but adding it to internode 
> could also be done, and may help in some weird situations (if you changed 
> certs because you believe a key/cert is compromised, having the connection 
> remain active is decidedly inconvenient, so maybe it does make sense to add 
> an expiration timer/condition on each connection).
> 
> 
> 
>> On Apr 15, 2024, at 12:28 PM, Dinesh Joshi  wrote:
>> 
>> In addition to what Andy mentioned, I want to point out that for the vast 
>> majority of use-cases, we would like to _avoid_ interruptions when a 
>> certificate is updated so it is by design. If you're dealing with a 
>> situation where you want to ensure that the connections are cycled, you can 
>> follow Andy's advice. It will require automation outside the database that 
>> you might already have. If there is demand, we can consider adding a feature 
>> to slowly cycle the connections so the old SSL context is not used anymore.
>> 
>> One more thing you should bear in mind is that Cassandra will not load the 
>> new SSL context if it cannot successfully initialize it. This is again by 
>> design to prevent an outage when the updated truststore is corrupted or 
>> could not be read in some way.
>> 
>> thanks,
>> Dinesh
>> 
>> On Mon, Apr 15, 2024 at 9:45 AM Tolbert, Andy > > wrote:
>>> I should mention, when toggling disablebinary/enablebinary between
>>> instances, you will probably want to give some time between doing this
>>> so connections can reestablish, and you will want to verify that the
>>> connections can actually reestablish.  You also need to be mindful of
>>> this being disruptive to inflight queries (if your client is
>>> configured for retries it will probably be fine).  Semantically to
>>> your applications it should look a lot like a rolling cluster bounce.
>>> 
>>> Thanks,
>>> Andy
>>> 
>>> On Mon, Apr 15, 2024 at 11:39 AM pabbireddy avinash
>>> mailto:pabbireddyavin...@gmail.com>> wrote:
>>> >
>>> > Thanks Andy for your reply . We will test the scenario you mentioned.
>>> >
>>> > Regards
>>> > Avinash
>>> >
>>> > On Mon, Apr 15, 2024 at 11:28 AM, Tolbert, Andy >> > > wrote:
>>> >>
>>> >> Hi Avinash,
>>> >>
>>> >> As far as I understand it, if the underlying keystore/trustore(s)
>>> >> Cassandra is configured for is updated, this *will not* provoke
>>> >> Cassandra to interrupt existing connections, it's just that the new
>>> >> stores will be used for future TLS initialization.
>>> >>
>>> >> Via: 
>>> >> https://cassandra.apache.org/doc/4.1/cassandra/operating/security.html#ssl-certificate-hot-reloading
>>> >>
>>> >> > When the files are updated, Cassandra will reload them and use them 
>>> >> > for subsequent connections
>>> >>
>>> >> I suppose one could do a rolling disablebinary/enablebinary (if it's
>>> >> only client connections) after you roll out a keystore/truststore
>>> >> change as a way of enforcing the existing connections to reestablish.
>>> >>
>>> >> Thanks,
>>> >> Andy
>>> >>
>>> >>
>>> >> On Mon, Apr 15, 2024 at 11:11 AM pabbireddy avinash
>>> >> mailto:pabbireddyavin...@gmail.com>> wrote:
>>> >> >
>>> >> > Dear Community,
>>> >> >
>>> >> > I hope this email finds you well. I am currently testing SSL 
>>> >> > certificate hot reloading on a Cassandra cluster running version 4.1 
>>> >> > and encountered a situation that requires your expertise.
>>> >> >
>>> >> > Here's a summary of the process and issue:
>>> >> >
>>> >> > Reloading Process: We reloaded certificates signed by our in-house 
>>> >> > certificate authority into the cluster, which was initially running 
>>> >> > with self-signed certificates. The reload was done node by node.
>>> >> >
>>> >> > Truststore and Keystore: The truststore and keystore passwords are the 
>>> >> > same across the cluster.
>>> >> >
>>> >> > 

Re: ssl certificate hot reloading test - cassandra 4.1

2024-04-15 Thread Jeff Jirsa
It seems like if folks really want the life of a connection to be finite 
(either client/server or server/server), adding in an option to quietly drain 
and recycle a connection on some period isn’t that difficult.

That type of requirement shows up in a number of environments, usually on 
interactive logins (cqlsh, login, walk away, the connection needs to become 
invalid in a short and finite period of time), but adding it to internode could 
also be done, and may help in some weird situations (if you changed certs 
because you believe a key/cert is compromised, having the connection remain 
active is decidedly inconvenient, so maybe it does make sense to add an 
expiration timer/condition on each connection).



> On Apr 15, 2024, at 12:28 PM, Dinesh Joshi  wrote:
> 
> In addition to what Andy mentioned, I want to point out that for the vast 
> majority of use-cases, we would like to _avoid_ interruptions when a 
> certificate is updated so it is by design. If you're dealing with a situation 
> where you want to ensure that the connections are cycled, you can follow 
> Andy's advice. It will require automation outside the database that you might 
> already have. If there is demand, we can consider adding a feature to slowly 
> cycle the connections so the old SSL context is not used anymore.
> 
> One more thing you should bear in mind is that Cassandra will not load the 
> new SSL context if it cannot successfully initialize it. This is again by 
> design to prevent an outage when the updated truststore is corrupted or could 
> not be read in some way.
> 
> thanks,
> Dinesh
> 
> On Mon, Apr 15, 2024 at 9:45 AM Tolbert, Andy  > wrote:
>> I should mention, when toggling disablebinary/enablebinary between
>> instances, you will probably want to give some time between doing this
>> so connections can reestablish, and you will want to verify that the
>> connections can actually reestablish.  You also need to be mindful of
>> this being disruptive to inflight queries (if your client is
>> configured for retries it will probably be fine).  Semantically to
>> your applications it should look a lot like a rolling cluster bounce.
>> 
>> Thanks,
>> Andy
>> 
>> On Mon, Apr 15, 2024 at 11:39 AM pabbireddy avinash
>> mailto:pabbireddyavin...@gmail.com>> wrote:
>> >
>> > Thanks Andy for your reply . We will test the scenario you mentioned.
>> >
>> > Regards
>> > Avinash
>> >
>> > On Mon, Apr 15, 2024 at 11:28 AM, Tolbert, Andy > > > wrote:
>> >>
>> >> Hi Avinash,
>> >>
>> >> As far as I understand it, if the underlying keystore/trustore(s)
>> >> Cassandra is configured for is updated, this *will not* provoke
>> >> Cassandra to interrupt existing connections, it's just that the new
>> >> stores will be used for future TLS initialization.
>> >>
>> >> Via: 
>> >> https://cassandra.apache.org/doc/4.1/cassandra/operating/security.html#ssl-certificate-hot-reloading
>> >>
>> >> > When the files are updated, Cassandra will reload them and use them for 
>> >> > subsequent connections
>> >>
>> >> I suppose one could do a rolling disablebinary/enablebinary (if it's
>> >> only client connections) after you roll out a keystore/truststore
>> >> change as a way of enforcing the existing connections to reestablish.
>> >>
>> >> Thanks,
>> >> Andy
>> >>
>> >>
>> >> On Mon, Apr 15, 2024 at 11:11 AM pabbireddy avinash
>> >> mailto:pabbireddyavin...@gmail.com>> wrote:
>> >> >
>> >> > Dear Community,
>> >> >
>> >> > I hope this email finds you well. I am currently testing SSL 
>> >> > certificate hot reloading on a Cassandra cluster running version 4.1 
>> >> > and encountered a situation that requires your expertise.
>> >> >
>> >> > Here's a summary of the process and issue:
>> >> >
>> >> > Reloading Process: We reloaded certificates signed by our in-house 
>> >> > certificate authority into the cluster, which was initially running 
>> >> > with self-signed certificates. The reload was done node by node.
>> >> >
>> >> > Truststore and Keystore: The truststore and keystore passwords are the 
>> >> > same across the cluster.
>> >> >
>> >> > Unexpected Behavior: Despite the different truststore configurations 
>> >> > for the self-signed and new CA certificates, we observed no breakdown 
>> >> > in server-to-server communication during the reload. We did not upload 
>> >> > the new CA cert into the old truststore.We anticipated interruptions 
>> >> > due to the differing truststore configurations but did not encounter 
>> >> > any.
>> >> >
>> >> > Post-Reload Changes: After reloading, we updated the cqlshrc file with 
>> >> > the new CA certificate and key to connect to cqlsh.
>> >> >
>> >> > server_encryption_options:
>> >> >
>> >> > internode_encryption: all
>> >> >
>> >> > keystore: ~/conf/server-keystore.jks
>> >> >
>> >> > keystore_password: 
>> >> >
>> >> > truststore: ~/conf/server-truststore.jks
>> >> >
>> >> > truststore_password: 
>> >> >
>> 

Re: ssl certificate hot reloading test - cassandra 4.1

2024-04-15 Thread Dinesh Joshi
In addition to what Andy mentioned, I want to point out that for the vast
majority of use-cases, we would like to _avoid_ interruptions when a
certificate is updated so it is by design. If you're dealing with a
situation where you want to ensure that the connections are cycled, you can
follow Andy's advice. It will require automation outside the database that
you might already have. If there is demand, we can consider adding a
feature to slowly cycle the connections so the old SSL context is not used
anymore.

One more thing you should bear in mind is that Cassandra will not load the
new SSL context if it cannot successfully initialize it. This is again by
design to prevent an outage when the updated truststore is corrupted or
could not be read in some way.

thanks,
Dinesh

On Mon, Apr 15, 2024 at 9:45 AM Tolbert, Andy  wrote:

> I should mention, when toggling disablebinary/enablebinary between
> instances, you will probably want to give some time between doing this
> so connections can reestablish, and you will want to verify that the
> connections can actually reestablish.  You also need to be mindful of
> this being disruptive to inflight queries (if your client is
> configured for retries it will probably be fine).  Semantically to
> your applications it should look a lot like a rolling cluster bounce.
>
> Thanks,
> Andy
>
> On Mon, Apr 15, 2024 at 11:39 AM pabbireddy avinash
>  wrote:
> >
> > Thanks Andy for your reply . We will test the scenario you mentioned.
> >
> > Regards
> > Avinash
> >
> > On Mon, Apr 15, 2024 at 11:28 AM, Tolbert, Andy 
> wrote:
> >>
> >> Hi Avinash,
> >>
> >> As far as I understand it, if the underlying keystore/trustore(s)
> >> Cassandra is configured for is updated, this *will not* provoke
> >> Cassandra to interrupt existing connections, it's just that the new
> >> stores will be used for future TLS initialization.
> >>
> >> Via:
> https://cassandra.apache.org/doc/4.1/cassandra/operating/security.html#ssl-certificate-hot-reloading
> >>
> >> > When the files are updated, Cassandra will reload them and use them
> for subsequent connections
> >>
> >> I suppose one could do a rolling disablebinary/enablebinary (if it's
> >> only client connections) after you roll out a keystore/truststore
> >> change as a way of enforcing the existing connections to reestablish.
> >>
> >> Thanks,
> >> Andy
> >>
> >>
> >> On Mon, Apr 15, 2024 at 11:11 AM pabbireddy avinash
> >>  wrote:
> >> >
> >> > Dear Community,
> >> >
> >> > I hope this email finds you well. I am currently testing SSL
> certificate hot reloading on a Cassandra cluster running version 4.1 and
> encountered a situation that requires your expertise.
> >> >
> >> > Here's a summary of the process and issue:
> >> >
> >> > Reloading Process: We reloaded certificates signed by our in-house
> certificate authority into the cluster, which was initially running with
> self-signed certificates. The reload was done node by node.
> >> >
> >> > Truststore and Keystore: The truststore and keystore passwords are
> the same across the cluster.
> >> >
> >> > Unexpected Behavior: Despite the different truststore configurations
> for the self-signed and new CA certificates, we observed no breakdown in
> server-to-server communication during the reload. We did not upload the new
> CA cert into the old truststore.We anticipated interruptions due to the
> differing truststore configurations but did not encounter any.
> >> >
> >> > Post-Reload Changes: After reloading, we updated the cqlshrc file
> with the new CA certificate and key to connect to cqlsh.
> >> >
> >> > server_encryption_options:
> >> >
> >> > internode_encryption: all
> >> >
> >> > keystore: ~/conf/server-keystore.jks
> >> >
> >> > keystore_password: 
> >> >
> >> > truststore: ~/conf/server-truststore.jks
> >> >
> >> > truststore_password: 
> >> >
> >> > protocol: TLS
> >> >
> >> > algorithm: SunX509
> >> >
> >> > store_type: JKS
> >> >
> >> > cipher_suites: [TLS_RSA_WITH_AES_256_CBC_SHA]
> >> >
> >> > require_client_auth: true
> >> >
> >> > client_encryption_options:
> >> >
> >> > enabled: true
> >> >
> >> > keystore: ~/conf/server-keystore.jks
> >> >
> >> > keystore_password: 
> >> >
> >> > require_client_auth: true
> >> >
> >> > truststore: ~/conf/server-truststore.jks
> >> >
> >> > truststore_password: 
> >> >
> >> > protocol: TLS
> >> >
> >> > algorithm: SunX509
> >> >
> >> > store_type: JKS
> >> >
> >> > cipher_suites: [TLS_RSA_WITH_AES_256_CBC_SHA]
> >> >
> >> > Given this situation, I have the following questions:
> >> >
> >> > Could there be a reason for the continuity of server-to-server
> communication despite the different truststores?
> >> > Is there a possibility that the old truststore remains cached even
> after reloading the certificates on a node?
> >> > Have others encountered similar issues, and if so, what were your
> solutions?
> >> >
> >> > Any insights or 

Re: ssl certificate hot reloading test - cassandra 4.1

2024-04-15 Thread Tolbert, Andy
I should mention, when toggling disablebinary/enablebinary between
instances, you will probably want to give some time between doing this
so connections can reestablish, and you will want to verify that the
connections can actually reestablish.  You also need to be mindful of
this being disruptive to inflight queries (if your client is
configured for retries it will probably be fine).  Semantically to
your applications it should look a lot like a rolling cluster bounce.

Thanks,
Andy

On Mon, Apr 15, 2024 at 11:39 AM pabbireddy avinash
 wrote:
>
> Thanks Andy for your reply . We will test the scenario you mentioned.
>
> Regards
> Avinash
>
> On Mon, Apr 15, 2024 at 11:28 AM, Tolbert, Andy  
> wrote:
>>
>> Hi Avinash,
>>
>> As far as I understand it, if the underlying keystore/trustore(s)
>> Cassandra is configured for is updated, this *will not* provoke
>> Cassandra to interrupt existing connections, it's just that the new
>> stores will be used for future TLS initialization.
>>
>> Via: 
>> https://cassandra.apache.org/doc/4.1/cassandra/operating/security.html#ssl-certificate-hot-reloading
>>
>> > When the files are updated, Cassandra will reload them and use them for 
>> > subsequent connections
>>
>> I suppose one could do a rolling disablebinary/enablebinary (if it's
>> only client connections) after you roll out a keystore/truststore
>> change as a way of enforcing the existing connections to reestablish.
>>
>> Thanks,
>> Andy
>>
>>
>> On Mon, Apr 15, 2024 at 11:11 AM pabbireddy avinash
>>  wrote:
>> >
>> > Dear Community,
>> >
>> > I hope this email finds you well. I am currently testing SSL certificate 
>> > hot reloading on a Cassandra cluster running version 4.1 and encountered a 
>> > situation that requires your expertise.
>> >
>> > Here's a summary of the process and issue:
>> >
>> > Reloading Process: We reloaded certificates signed by our in-house 
>> > certificate authority into the cluster, which was initially running with 
>> > self-signed certificates. The reload was done node by node.
>> >
>> > Truststore and Keystore: The truststore and keystore passwords are the 
>> > same across the cluster.
>> >
>> > Unexpected Behavior: Despite the different truststore configurations for 
>> > the self-signed and new CA certificates, we observed no breakdown in 
>> > server-to-server communication during the reload. We did not upload the 
>> > new CA cert into the old truststore.We anticipated interruptions due to 
>> > the differing truststore configurations but did not encounter any.
>> >
>> > Post-Reload Changes: After reloading, we updated the cqlshrc file with the 
>> > new CA certificate and key to connect to cqlsh.
>> >
>> > server_encryption_options:
>> >
>> > internode_encryption: all
>> >
>> > keystore: ~/conf/server-keystore.jks
>> >
>> > keystore_password: 
>> >
>> > truststore: ~/conf/server-truststore.jks
>> >
>> > truststore_password: 
>> >
>> > protocol: TLS
>> >
>> > algorithm: SunX509
>> >
>> > store_type: JKS
>> >
>> > cipher_suites: [TLS_RSA_WITH_AES_256_CBC_SHA]
>> >
>> > require_client_auth: true
>> >
>> > client_encryption_options:
>> >
>> > enabled: true
>> >
>> > keystore: ~/conf/server-keystore.jks
>> >
>> > keystore_password: 
>> >
>> > require_client_auth: true
>> >
>> > truststore: ~/conf/server-truststore.jks
>> >
>> > truststore_password: 
>> >
>> > protocol: TLS
>> >
>> > algorithm: SunX509
>> >
>> > store_type: JKS
>> >
>> > cipher_suites: [TLS_RSA_WITH_AES_256_CBC_SHA]
>> >
>> > Given this situation, I have the following questions:
>> >
>> > Could there be a reason for the continuity of server-to-server 
>> > communication despite the different truststores?
>> > Is there a possibility that the old truststore remains cached even after 
>> > reloading the certificates on a node?
>> > Have others encountered similar issues, and if so, what were your 
>> > solutions?
>> >
>> > Any insights or suggestions would be greatly appreciated. Please let me 
>> > know if further information is needed.
>> >
>> > Thank you
>> >
>> > Best regards,
>> >
>> > Avinash


Re: ssl certificate hot reloading test - cassandra 4.1

2024-04-15 Thread pabbireddy avinash
Thanks Andy for your reply . We will test the scenario you mentioned.

Regards
Avinash

On Mon, Apr 15, 2024 at 11:28 AM, Tolbert, Andy  wrote:

> Hi Avinash,
>
> As far as I understand it, if the underlying keystore/trustore(s)
> Cassandra is configured for is updated, this *will not* provoke
> Cassandra to interrupt existing connections, it's just that the new
> stores will be used for future TLS initialization.
>
> Via:
> https://cassandra.apache.org/doc/4.1/cassandra/operating/security.html#ssl-certificate-hot-reloading
>
> > When the files are updated, Cassandra will reload them and use them for
> subsequent connections
>
> I suppose one could do a rolling disablebinary/enablebinary (if it's
> only client connections) after you roll out a keystore/truststore
> change as a way of enforcing the existing connections to reestablish.
>
> Thanks,
> Andy
>
>
> On Mon, Apr 15, 2024 at 11:11 AM pabbireddy avinash
>  wrote:
> >
> > Dear Community,
> >
> > I hope this email finds you well. I am currently testing SSL certificate
> hot reloading on a Cassandra cluster running version 4.1 and encountered a
> situation that requires your expertise.
> >
> > Here's a summary of the process and issue:
> >
> > Reloading Process: We reloaded certificates signed by our in-house
> certificate authority into the cluster, which was initially running with
> self-signed certificates. The reload was done node by node.
> >
> > Truststore and Keystore: The truststore and keystore passwords are the
> same across the cluster.
> >
> > Unexpected Behavior: Despite the different truststore configurations for
> the self-signed and new CA certificates, we observed no breakdown in
> server-to-server communication during the reload. We did not upload the new
> CA cert into the old truststore.We anticipated interruptions due to the
> differing truststore configurations but did not encounter any.
> >
> > Post-Reload Changes: After reloading, we updated the cqlshrc file with
> the new CA certificate and key to connect to cqlsh.
> >
> > server_encryption_options:
> >
> > internode_encryption: all
> >
> > keystore: ~/conf/server-keystore.jks
> >
> > keystore_password: 
> >
> > truststore: ~/conf/server-truststore.jks
> >
> > truststore_password: 
> >
> > protocol: TLS
> >
> > algorithm: SunX509
> >
> > store_type: JKS
> >
> > cipher_suites: [TLS_RSA_WITH_AES_256_CBC_SHA]
> >
> > require_client_auth: true
> >
> > client_encryption_options:
> >
> > enabled: true
> >
> > keystore: ~/conf/server-keystore.jks
> >
> > keystore_password: 
> >
> > require_client_auth: true
> >
> > truststore: ~/conf/server-truststore.jks
> >
> > truststore_password: 
> >
> > protocol: TLS
> >
> > algorithm: SunX509
> >
> > store_type: JKS
> >
> > cipher_suites: [TLS_RSA_WITH_AES_256_CBC_SHA]
> >
> > Given this situation, I have the following questions:
> >
> > Could there be a reason for the continuity of server-to-server
> communication despite the different truststores?
> > Is there a possibility that the old truststore remains cached even after
> reloading the certificates on a node?
> > Have others encountered similar issues, and if so, what were your
> solutions?
> >
> > Any insights or suggestions would be greatly appreciated. Please let me
> know if further information is needed.
> >
> > Thank you
> >
> > Best regards,
> >
> > Avinash
>


Re: ssl certificate hot reloading test - cassandra 4.1

2024-04-15 Thread Tolbert, Andy
Hi Avinash,

As far as I understand it, if the underlying keystore/trustore(s)
Cassandra is configured for is updated, this *will not* provoke
Cassandra to interrupt existing connections, it's just that the new
stores will be used for future TLS initialization.

Via: 
https://cassandra.apache.org/doc/4.1/cassandra/operating/security.html#ssl-certificate-hot-reloading

> When the files are updated, Cassandra will reload them and use them for 
> subsequent connections

I suppose one could do a rolling disablebinary/enablebinary (if it's
only client connections) after you roll out a keystore/truststore
change as a way of enforcing the existing connections to reestablish.

Thanks,
Andy


On Mon, Apr 15, 2024 at 11:11 AM pabbireddy avinash
 wrote:
>
> Dear Community,
>
> I hope this email finds you well. I am currently testing SSL certificate hot 
> reloading on a Cassandra cluster running version 4.1 and encountered a 
> situation that requires your expertise.
>
> Here's a summary of the process and issue:
>
> Reloading Process: We reloaded certificates signed by our in-house 
> certificate authority into the cluster, which was initially running with 
> self-signed certificates. The reload was done node by node.
>
> Truststore and Keystore: The truststore and keystore passwords are the same 
> across the cluster.
>
> Unexpected Behavior: Despite the different truststore configurations for the 
> self-signed and new CA certificates, we observed no breakdown in 
> server-to-server communication during the reload. We did not upload the new 
> CA cert into the old truststore.We anticipated interruptions due to the 
> differing truststore configurations but did not encounter any.
>
> Post-Reload Changes: After reloading, we updated the cqlshrc file with the 
> new CA certificate and key to connect to cqlsh.
>
> server_encryption_options:
>
> internode_encryption: all
>
> keystore: ~/conf/server-keystore.jks
>
> keystore_password: 
>
> truststore: ~/conf/server-truststore.jks
>
> truststore_password: 
>
> protocol: TLS
>
> algorithm: SunX509
>
> store_type: JKS
>
> cipher_suites: [TLS_RSA_WITH_AES_256_CBC_SHA]
>
> require_client_auth: true
>
> client_encryption_options:
>
> enabled: true
>
> keystore: ~/conf/server-keystore.jks
>
> keystore_password: 
>
> require_client_auth: true
>
> truststore: ~/conf/server-truststore.jks
>
> truststore_password: 
>
> protocol: TLS
>
> algorithm: SunX509
>
> store_type: JKS
>
> cipher_suites: [TLS_RSA_WITH_AES_256_CBC_SHA]
>
> Given this situation, I have the following questions:
>
> Could there be a reason for the continuity of server-to-server communication 
> despite the different truststores?
> Is there a possibility that the old truststore remains cached even after 
> reloading the certificates on a node?
> Have others encountered similar issues, and if so, what were your solutions?
>
> Any insights or suggestions would be greatly appreciated. Please let me know 
> if further information is needed.
>
> Thank you
>
> Best regards,
>
> Avinash


ssl certificate hot reloading test - cassandra 4.1

2024-04-15 Thread pabbireddy avinash
Dear Community,

I hope this email finds you well. I am currently testing SSL certificate
hot reloading on a Cassandra cluster running version 4.1 and encountered a
situation that requires your expertise.

Here's a summary of the process and issue:

   1.

   Reloading Process: We reloaded certificates signed by our in-house
   certificate authority into the cluster, which was initially running with
   self-signed certificates. The reload was done node by node.
   2.

   Truststore and Keystore: The truststore and keystore passwords are the
   same across the cluster.
   3.

   Unexpected Behavior: Despite the different truststore configurations for
   the self-signed and new CA certificates, we observed no breakdown in
   server-to-server communication during the reload. We did not upload the *new
   CA cert* into the *old truststore.*We anticipated interruptions due to
   the differing truststore configurations but did not encounter any.
   4.

   Post-Reload Changes: After reloading, we updated the cqlshrc file with
   the new CA certificate and key to connect to cqlsh.

server_encryption_options:

internode_encryption: all

keystore: ~/conf/server-keystore.jks

keystore_password: 

truststore: ~/conf/server-truststore.jks

truststore_password: 

protocol: TLS

algorithm: SunX509

store_type: JKS

cipher_suites: [TLS_RSA_WITH_AES_256_CBC_SHA]

require_client_auth: true

client_encryption_options:

enabled: true

keystore: ~/conf/server-keystore.jks

keystore_password: 

require_client_auth: true

truststore: ~/conf/server-truststore.jks

truststore_password: 

protocol: TLS

algorithm: SunX509

store_type: JKS

cipher_suites: [TLS_RSA_WITH_AES_256_CBC_SHA]

Given this situation, I have the following questions:

   - Could there be a reason for the continuity of server-to-server
   communication despite the different truststores?
   - Is there a possibility that the old truststore remains cached even
   after reloading the certificates on a node?
   - Have others encountered similar issues, and if so, what were your
   solutions?

Any insights or suggestions would be greatly appreciated. Please let me
know if further information is needed.

Thank you

Best regards,

Avinash


Trie Memtables

2024-04-09 Thread Jon Haddad
Hey all,

Tomorrow at 10:30am PDT I'm taking a look at Trie Memtables tomorrow on my
live stream.  I'll do some performance comparisons between it and the
legacy SkipListMemtable implementation and see what I can learn.

https://www.youtube.com/live/Jp5R_-uXORQ?si=NnIoV3jqjHFoD8nF

or if you prefer a LinkedIn version:
https://www.linkedin.com/events/7183580733750304768/comments/

Jon


RE: Datacenter decommissioning on Cassandra 4.1.4

2024-04-08 Thread Michalis Kotsiouros (EXT) via user
Hello Jon and Jeff,
Thanks a lot for your replies.
I completely get your points.
Some more clarification about my issue.
When trying to update the Replication before the decommission, I get the 
following error message when I remove the replication for system_auth kesypace.
ConfigurationException: Following datacenters have active nodes and must be 
present in replication options for keyspace system_auth: [datacenter1]

This error message does not appear in the rest of the application keyspaces.
So, may I change the procedure to:

  1.  Make sure no clients are still writing to any nodes in the datacenter.
  2.  Run a full repair with nodetool repair.
  3.  Change all keyspaces so they no longer reference the datacenter being 
removed apart from system_auth keyspace.
  4.  Run nodetool decommission using the --force option on every node in the 
datacenter being removed.
  5.  Change system_auth keyspace so they no longer reference the datacenter 
being removed.
BR
MK



From: Jeff Jirsa 
Sent: April 08, 2024 17:19
To: cassandra 
Cc: Michalis Kotsiouros (EXT) 
Subject: Re: Datacenter decommissioning on Cassandra 4.1.4

To Jon’s point, if you remove from replication after step 1 or step 2 (probably 
step 2 if your goal is to be strictly correct), the nodetool decommission phase 
becomes almost a no-op.

If you use the order below, the last nodes to decommission will cause those 
surviving machines to run out of space (assuming you have more than a few nodes 
to start)




On Apr 8, 2024, at 6:58 AM, Jon Haddad 
mailto:j...@jonhaddad.com>> wrote:

You shouldn’t decom an entire DC before removing it from replication.

—

Jon Haddad
Rustyrazorblade Consulting
rustyrazorblade.com


On Mon, Apr 8, 2024 at 6:26 AM Michalis Kotsiouros (EXT) via user 
mailto:user@cassandra.apache.org>> wrote:
Hello community,
In our deployments, we usually rebuild the Cassandra datacenters for 
maintenance or recovery operations.
The procedure used since the days of Cassandra 3.x was the one documented in 
datastax documentation. Decommissioning a datacenter | Apache Cassandra 3.x 
(datastax.com)
After upgrading to Cassandra 4.1.4, we have realized that there are some 
stricter rules that do not allo to remove the replication when active Cassandra 
nodes still exist in a datacenter.
This check makes the above-mentioned procedure obsolete.
I am thinking to use the following as an alternative:

  1.  Make sure no clients are still writing to any nodes in the datacenter.
  2.  Run a full repair with nodetool repair.
  3.  Run nodetool decommission using the --force option on every node in the 
datacenter being removed.
  4.  Change all keyspaces so they no longer reference the datacenter being 
removed.

What is the procedure followed by other users? Do you see any risk following 
the proposed procedure?

BR
MK



Re: Datacenter decommissioning on Cassandra 4.1.4

2024-04-08 Thread Jeff Jirsa
To Jon’s point, if you remove from replication after step 1 or step 2 (probably 
step 2 if your goal is to be strictly correct), the nodetool decommission phase 
becomes almost a no-op. 

If you use the order below, the last nodes to decommission will cause those 
surviving machines to run out of space (assuming you have more than a few nodes 
to start)



> On Apr 8, 2024, at 6:58 AM, Jon Haddad  wrote:
> 
> You shouldn’t decom an entire DC before removing it from replication.
> 
> —
> 
> Jon Haddad
> Rustyrazorblade Consulting
> rustyrazorblade.com 
> 
> 
> On Mon, Apr 8, 2024 at 6:26 AM Michalis Kotsiouros (EXT) via user 
> mailto:user@cassandra.apache.org>> wrote:
>> Hello community,
>> 
>> In our deployments, we usually rebuild the Cassandra datacenters for 
>> maintenance or recovery operations.
>> 
>> The procedure used since the days of Cassandra 3.x was the one documented in 
>> datastax documentation. Decommissioning a datacenter | Apache Cassandra 3.x 
>> (datastax.com) 
>> 
>> After upgrading to Cassandra 4.1.4, we have realized that there are some 
>> stricter rules that do not allo to remove the replication when active 
>> Cassandra nodes still exist in a datacenter.
>> 
>> This check makes the above-mentioned procedure obsolete.
>> 
>> I am thinking to use the following as an alternative:
>> 
>> Make sure no clients are still writing to any nodes in the datacenter.
>> Run a full repair with nodetool repair.
>> Run nodetool decommission using the --force option on every node in the 
>> datacenter being removed.
>> Change all keyspaces so they no longer reference the datacenter being 
>> removed.
>>  
>> 
>> What is the procedure followed by other users? Do you see any risk following 
>> the proposed procedure?
>> 
>>  
>> 
>> BR
>> 
>> MK
>> 



Re: Datacenter decommissioning on Cassandra 4.1.4

2024-04-08 Thread Jon Haddad
You shouldn’t decom an entire DC before removing it from replication.

—

Jon Haddad
Rustyrazorblade Consulting
rustyrazorblade.com


On Mon, Apr 8, 2024 at 6:26 AM Michalis Kotsiouros (EXT) via user <
user@cassandra.apache.org> wrote:

> Hello community,
>
> In our deployments, we usually rebuild the Cassandra datacenters for
> maintenance or recovery operations.
>
> The procedure used since the days of Cassandra 3.x was the one documented
> in datastax documentation. Decommissioning a datacenter | Apache
> Cassandra 3.x (datastax.com)
> 
>
> After upgrading to Cassandra 4.1.4, we have realized that there are some
> stricter rules that do not allo to remove the replication when active
> Cassandra nodes still exist in a datacenter.
>
> This check makes the above-mentioned procedure obsolete.
>
> I am thinking to use the following as an alternative:
>
>1. Make sure no clients are still writing to any nodes in the
>datacenter.
>2. Run a full repair with nodetool repair.
>3. Run nodetool decommission using the --force option on every node in
>the datacenter being removed.
>4. Change all keyspaces so they no longer reference the datacenter
>being removed.
>
>
>
> What is the procedure followed by other users? Do you see any risk
> following the proposed procedure?
>
>
>
> BR
>
> MK
>


Datacenter decommissioning on Cassandra 4.1.4

2024-04-08 Thread Michalis Kotsiouros (EXT) via user
Hello community,
In our deployments, we usually rebuild the Cassandra datacenters for 
maintenance or recovery operations.
The procedure used since the days of Cassandra 3.x was the one documented in 
datastax documentation. Decommissioning a datacenter | Apache Cassandra 3.x 
(datastax.com)
After upgrading to Cassandra 4.1.4, we have realized that there are some 
stricter rules that do not allo to remove the replication when active Cassandra 
nodes still exist in a datacenter.
This check makes the above-mentioned procedure obsolete.
I am thinking to use the following as an alternative:

  1.  Make sure no clients are still writing to any nodes in the datacenter.
  2.  Run a full repair with nodetool repair.
  3.  Run nodetool decommission using the --force option on every node in the 
datacenter being removed.
  4.  Change all keyspaces so they no longer reference the datacenter being 
removed.

What is the procedure followed by other users? Do you see any risk following 
the proposed procedure?

BR
MK


Re: Update: C/C NA Call for Presentations Deadline Extended to April 15th

2024-04-06 Thread Paulo Motta
Hi,

I would like to send a friendly reminder that the Community Over Code North
America 2024 call for presentations ends in a little less than 9 days on
Mon, 15 April 2024 22:59:59 UTC. Don't leave your Cassandra submissions to
the last minute! :-)

Thanks,

Paulo

On Tue, Mar 19, 2024 at 7:19 PM Paulo Motta  wrote:

> Hi,
>
> I wanted to update that the Call for Presentations deadline was extended
> by two weeks to April 15th, 2024 for Community Over Code North America
> 2024. Find more information on this blog post:
> https://news.apache.org/foundation/entry/apache-software-foundation-opens-cfp-for-community-over-code-north-america-2024
>
> We're looking for presentation abstracts in the following areas:
> * Customizing and tweaking Cassandra
> * Benchmarking and testing Cassandra
> * New Cassandra features and improvements
> * Provisioning and operating Cassandra
> * Developing with Cassandra
> * Anything else related to Apache Cassandra
>
> Please use this link to submit your proposal:
> https://sessionize.com/community-over-code-na-2024/
>
> Thanks,
>
> Paulo
>


Re: Query on Performance Dip

2024-04-05 Thread Jon Haddad
Try changing the chunk length parameter on the compression settings to 4kb,
and reduce read ahead to 16kb if you’re using EBS or 4KB if you’re using
decent local ssd or nvme.

Counters read before write.

—
Jon Haddad
Rustyrazorblade Consulting
rustyrazorblade.com


On Fri, Apr 5, 2024 at 9:27 AM Subroto Barua  wrote:

> follow up question on performance issue with 'counter writes'- is there a
> parameter or condition that limits the allocation rate for
> 'CounterMutationStage'? I see 13-18mb/s for 4.1.4 Vs 20-25mb/s for 4.0.5.
>
> The back-end infra is same for both the clusters and same test cases/data
> model.
> On Saturday, March 30, 2024 at 08:40:28 AM PDT, Jon Haddad <
> j...@jonhaddad.com> wrote:
>
>
> Hi,
>
> Unfortunately, the numbers you're posting have no meaning without
> context.  The speculative retries could be the cause of a problem, or you
> could simply be executing enough queries and you have a fairly high
> variance in latency which triggers them often.  It's unclear how many
> queries / second you're executing and there's no historical information to
> suggest if what you're seeing now is an anomaly or business as usual.
>
> If you want to determine if your theory that speculative retries are
> causing your performance issue, then you could try changing speculative
> retry to a fixed value instead of a percentile, such as 50MS.  It's easy
> enough to try and you can get an answer to your question almost immediately.
>
> The problem with this is that you're essentially guessing based on very
> limited information - the output of a nodetool command you've run "every
> few secs".  I prefer to use a more data driven approach.  Get a CPU flame
> graph and figure out where your time is spent:
> https://rustyrazorblade.com/post/2023/2023-11-07-async-profiler/
>
> The flame graph will reveal where your time is spent, and you can focus on
> improving that, rather than looking at a random statistic that you've
> picked.
>
> I just gave a talk at SCALE on distributed systems performance
> troubleshooting.  You'll be better off following a methodical process than
> guessing at potential root causes, because the odds of you correctly
> guessing the root cause in a system this complex is close to zero.  My talk
> is here: https://www.youtube.com/watch?v=VX9tHk3VTLE
>
> I'm guessing you don't have dashboards in place if you're relying on
> nodetool output with grep.  If your cluster is under 6 nodes, you can take
> advantage of AxonOps's free tier: https://axonops.com/
>
> Good dashboards are essential for these types of problems.
>
> Jon
>
>
>
> On Sat, Mar 30, 2024 at 2:33 AM ranju goel  wrote:
>
> Hi All,
>
> On debugging the cluster for performance dip seen while using 4.1.4,  i
> found high speculation retries Value in nodetool tablestats during read
> operation.
>
> I ran the below tablestats command and checked its output after every few
> secs and noticed that retries are on rising side. Also there is one open
> ticket (https://issues.apache.org/jira/browse/CASSANDRA-18766) similar to
> this.
> /usr/share/cassandra/bin/nodetool -u  -pw  -p 
> tablestats  | grep -i 'Speculative retries'
>
>
>
> Speculative retries: 11633
>
> ..
>
> ..
>
> Speculative retries: 13727
>
>
>
> Speculative retries: 14256
>
> Speculative retries: 14855
>
> Speculative retries: 14858
>
> Speculative retries: 14859
>
> Speculative retries: 14873
>
> Speculative retries: 14875
>
> Speculative retries: 14890
>
> Speculative retries: 14893
>
> Speculative retries: 14896
>
> Speculative retries: 14901
>
> Speculative retries: 14905
>
> Speculative retries: 14946
>
> Speculative retries: 14948
>
> Speculative retries: 14957
>
>
> Suspecting this could be performance dip cause.  Please add in case anyone
> knows more about it.
>
>
> Regards
>
>
>
>
>
>
>
>
> On Wed, Mar 27, 2024 at 10:43 PM Subroto Barua via user <
> user@cassandra.apache.org> wrote:
>
> we are seeing similar perf issues with counter writes - to reproduce:
>
> cassandra-stress counter_write n=10 no-warmup cl=LOCAL_QUORUM -rate
> threads=50 -mode native cql3 user= password= -name 
>
>
> op rate: 39,260 ops (4.1) and 63,689 ops (4.0)
> latency 99th percentile: 7.7ms (4.1) and 1.8ms (4.0)
> Total GC count: 750 (4.1) and 744 (4.0)
> Avg GC time: 106 ms (4.1) and 110.1 ms (4.0)
>
>
> On Wednesday, March 27, 2024 at 12:18:50 AM PDT, ranju goel <
> goel.ra...@gmail.com> wrote:
>
>
> Hi All,
>
> Was going through this mail chain
> (https://www.mail-archive.com/user@cassandra.apache.org/msg63564.html)
>  and was wondering that if this could cause a performance degradation in
> 4.1 without changing compactionThroughput.
>
> As seeing performance dip in Read/Write after upgrading 

Re: Query on Performance Dip

2024-04-05 Thread Subroto Barua via user
 follow up question on performance issue with 'counter writes'- is there a 
parameter or condition that limits the allocation rate for 
'CounterMutationStage'? I see 13-18mb/s for 4.1.4 Vs 20-25mb/s for 4.0.5.

The back-end infra is same for both the clusters and same test cases/data model.
On Saturday, March 30, 2024 at 08:40:28 AM PDT, Jon Haddad 
 wrote:  
 
 Hi,

Unfortunately, the numbers you're posting have no meaning without context.  The 
speculative retries could be the cause of a problem, or you could simply be 
executing enough queries and you have a fairly high variance in latency which 
triggers them often.  It's unclear how many queries / second you're executing 
and there's no historical information to suggest if what you're seeing now is 
an anomaly or business as usual.
If you want to determine if your theory that speculative retries are causing 
your performance issue, then you could try changing speculative retry to a 
fixed value instead of a percentile, such as 50MS.  It's easy enough to try and 
you can get an answer to your question almost immediately.
The problem with this is that you're essentially guessing based on very limited 
information - the output of a nodetool command you've run "every few secs".  I 
prefer to use a more data driven approach.  Get a CPU flame graph and figure 
out where your time is spent: 
https://rustyrazorblade.com/post/2023/2023-11-07-async-profiler/
The flame graph will reveal where your time is spent, and you can focus on 
improving that, rather than looking at a random statistic that you've picked.
I just gave a talk at SCALE on distributed systems performance troubleshooting. 
 You'll be better off following a methodical process than guessing at potential 
root causes, because the odds of you correctly guessing the root cause in a 
system this complex is close to zero.  My talk is here: 
https://www.youtube.com/watch?v=VX9tHk3VTLE
I'm guessing you don't have dashboards in place if you're relying on nodetool 
output with grep.  If your cluster is under 6 nodes, you can take advantage of 
AxonOps's free tier: https://axonops.com/
Good dashboards are essential for these types of problems.    
Jon


On Sat, Mar 30, 2024 at 2:33 AM ranju goel  wrote:

Hi All,
On debugging the cluster for performance dip seen while using 4.1.4,  i found 
high speculation retries Value in nodetool tablestats during read operation.
I ran the below tablestats command and checked its output after every few secs 
and noticed that retries are on rising side. Also there is one open ticket 
(https://issues.apache.org/jira/browse/CASSANDRA-18766) similar to 
this./usr/share/cassandra/bin/nodetool -u  -pw  -p  
tablestats  | grep -i 'Speculative retries' 

                    

    Speculative retries: 11633

                ..

                ..

                Speculative retries: 13727

     

    Speculative retries: 14256

    Speculative retries: 14855

    Speculative retries: 14858

    Speculative retries: 14859

    Speculative retries: 14873

    Speculative retries: 14875

    Speculative retries: 14890

    Speculative retries: 14893

    Speculative retries: 14896

    Speculative retries: 14901

    Speculative retries: 14905

    Speculative retries: 14946

    Speculative retries: 14948

    Speculative retries: 14957




Suspecting this could be performance dip cause.  Please add in case anyone 
knows more about it.




Regards













On Wed, Mar 27, 2024 at 10:43 PM Subroto Barua via user 
 wrote:

 we are seeing similar perf issues with counter writes - to reproduce:

cassandra-stress counter_write n=10 no-warmup cl=LOCAL_QUORUM -rate 
threads=50 -mode native cql3 user= password= -name  


op rate: 39,260 ops (4.1) and 63,689 ops (4.0)
latency 99th percentile: 7.7ms (4.1) and 1.8ms (4.0)
Total GC count: 750 (4.1) and 744 (4.0)
Avg GC time: 106 ms (4.1) and 110.1 ms (4.0)

On Wednesday, March 27, 2024 at 12:18:50 AM PDT, ranju goel 
 wrote:  
 
 Hi All,

Was going through this mail chain 
(https://www.mail-archive.com/user@cassandra.apache.org/msg63564.html) and was 
wondering that if this could cause a performance degradation in 4.1 without 
changing compactionThroughput. 

As seeing performance dip in Read/Write after upgrading from 4.0 to 4.1.

RegardsRanju  

  

Participate in the ASF 25th Anniversary Campaign

2024-04-03 Thread Brian Proffitt
Hi everyone,

As part of The ASF’s 25th anniversary campaign[1], we will be celebrating
projects and communities in multiple ways.

We invite all projects and contributors to participate in the following
ways:

* Individuals - submit your first contribution:
https://news.apache.org/foundation/entry/the-asf-launches-firstasfcontribution-campaign
* Projects - share your public good story:
https://docs.google.com/forms/d/1vuN-tUnBwpTgOE5xj3Z5AG1hsOoDNLBmGIqQHwQT6k8/viewform?edit_requested=true
* Projects - submit a project spotlight for the blog:
https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=278466116
* Projects - contact the Voice of Apache podcast (formerly Feathercast) to
be featured: https://feathercast.apache.org/help/
*  Projects - use the 25th anniversary template and the #ASF25Years hashtag
on social media:
https://docs.google.com/presentation/d/1oDbMol3F_XQuCmttPYxBIOIjRuRBksUjDApjd8Ve3L8/edit#slide=id.g26b0919956e_0_13

If you have questions, email the Marketing & Publicity team at
mark...@apache.org.

Peace,
BKP

[1] https://apache.org/asf25years/

[NOTE: You are receiving this message because you are a contributor to an
Apache Software Foundation project. The ASF will very occasionally send out
messages relating to the Foundation to contributors and members, such as
this one.]

Brian Proffitt
VP, Marketing & Publicity
VP, Conferences


Re: Query on Performance Dip

2024-03-30 Thread Jon Haddad
Hi,

Unfortunately, the numbers you're posting have no meaning without context.
The speculative retries could be the cause of a problem, or you could
simply be executing enough queries and you have a fairly high variance in
latency which triggers them often.  It's unclear how many queries / second
you're executing and there's no historical information to suggest if what
you're seeing now is an anomaly or business as usual.

If you want to determine if your theory that speculative retries are
causing your performance issue, then you could try changing speculative
retry to a fixed value instead of a percentile, such as 50MS.  It's easy
enough to try and you can get an answer to your question almost immediately.

The problem with this is that you're essentially guessing based on very
limited information - the output of a nodetool command you've run "every
few secs".  I prefer to use a more data driven approach.  Get a CPU flame
graph and figure out where your time is spent:
https://rustyrazorblade.com/post/2023/2023-11-07-async-profiler/

The flame graph will reveal where your time is spent, and you can focus on
improving that, rather than looking at a random statistic that you've
picked.

I just gave a talk at SCALE on distributed systems performance
troubleshooting.  You'll be better off following a methodical process than
guessing at potential root causes, because the odds of you correctly
guessing the root cause in a system this complex is close to zero.  My talk
is here: https://www.youtube.com/watch?v=VX9tHk3VTLE

I'm guessing you don't have dashboards in place if you're relying on
nodetool output with grep.  If your cluster is under 6 nodes, you can take
advantage of AxonOps's free tier: https://axonops.com/

Good dashboards are essential for these types of problems.

Jon



On Sat, Mar 30, 2024 at 2:33 AM ranju goel  wrote:

> Hi All,
>
> On debugging the cluster for performance dip seen while using 4.1.4,  i
> found high speculation retries Value in nodetool tablestats during read
> operation.
>
> I ran the below tablestats command and checked its output after every few
> secs and noticed that retries are on rising side. Also there is one open
> ticket (https://issues.apache.org/jira/browse/CASSANDRA-18766) similar to
> this.
> /usr/share/cassandra/bin/nodetool -u  -pw  -p 
> tablestats  | grep -i 'Speculative retries'
>
>
>
> Speculative retries: 11633
>
> ..
>
> ..
>
> Speculative retries: 13727
>
>
>
> Speculative retries: 14256
>
> Speculative retries: 14855
>
> Speculative retries: 14858
>
> Speculative retries: 14859
>
> Speculative retries: 14873
>
> Speculative retries: 14875
>
> Speculative retries: 14890
>
> Speculative retries: 14893
>
> Speculative retries: 14896
>
> Speculative retries: 14901
>
> Speculative retries: 14905
>
> Speculative retries: 14946
>
> Speculative retries: 14948
>
> Speculative retries: 14957
>
>
> Suspecting this could be performance dip cause.  Please add in case anyone
> knows more about it.
>
>
> Regards
>
>
>
>
>
>
>
>
> On Wed, Mar 27, 2024 at 10:43 PM Subroto Barua via user <
> user@cassandra.apache.org> wrote:
>
>> we are seeing similar perf issues with counter writes - to reproduce:
>>
>> cassandra-stress counter_write n=10 no-warmup cl=LOCAL_QUORUM -rate
>> threads=50 -mode native cql3 user= password= -name 
>>
>>
>> op rate: 39,260 ops (4.1) and 63,689 ops (4.0)
>> latency 99th percentile: 7.7ms (4.1) and 1.8ms (4.0)
>> Total GC count: 750 (4.1) and 744 (4.0)
>> Avg GC time: 106 ms (4.1) and 110.1 ms (4.0)
>>
>>
>> On Wednesday, March 27, 2024 at 12:18:50 AM PDT, ranju goel <
>> goel.ra...@gmail.com> wrote:
>>
>>
>> Hi All,
>>
>> Was going through this mail chain
>> (https://www.mail-archive.com/user@cassandra.apache.org/msg63564.html)
>>  and was wondering that if this could cause a performance degradation in
>> 4.1 without changing compactionThroughput.
>>
>> As seeing performance dip in Read/Write after upgrading from 4.0 to 4.1.
>>
>> Regards
>> Ranju
>>
>


Re: Query on Performance Dip

2024-03-30 Thread ranju goel
Hi All,

On debugging the cluster for performance dip seen while using 4.1.4,  i
found high speculation retries Value in nodetool tablestats during read
operation.

I ran the below tablestats command and checked its output after every few
secs and noticed that retries are on rising side. Also there is one open
ticket (https://issues.apache.org/jira/browse/CASSANDRA-18766) similar to
this.
/usr/share/cassandra/bin/nodetool -u  -pw  -p 
tablestats  | grep -i 'Speculative retries'



Speculative retries: 11633

..

..

Speculative retries: 13727



Speculative retries: 14256

Speculative retries: 14855

Speculative retries: 14858

Speculative retries: 14859

Speculative retries: 14873

Speculative retries: 14875

Speculative retries: 14890

Speculative retries: 14893

Speculative retries: 14896

Speculative retries: 14901

Speculative retries: 14905

Speculative retries: 14946

Speculative retries: 14948

Speculative retries: 14957


Suspecting this could be performance dip cause.  Please add in case anyone
knows more about it.


Regards








On Wed, Mar 27, 2024 at 10:43 PM Subroto Barua via user <
user@cassandra.apache.org> wrote:

> we are seeing similar perf issues with counter writes - to reproduce:
>
> cassandra-stress counter_write n=10 no-warmup cl=LOCAL_QUORUM -rate
> threads=50 -mode native cql3 user= password= -name 
>
>
> op rate: 39,260 ops (4.1) and 63,689 ops (4.0)
> latency 99th percentile: 7.7ms (4.1) and 1.8ms (4.0)
> Total GC count: 750 (4.1) and 744 (4.0)
> Avg GC time: 106 ms (4.1) and 110.1 ms (4.0)
>
>
> On Wednesday, March 27, 2024 at 12:18:50 AM PDT, ranju goel <
> goel.ra...@gmail.com> wrote:
>
>
> Hi All,
>
> Was going through this mail chain
> (https://www.mail-archive.com/user@cassandra.apache.org/msg63564.html)
>  and was wondering that if this could cause a performance degradation in
> 4.1 without changing compactionThroughput.
>
> As seeing performance dip in Read/Write after upgrading from 4.0 to 4.1.
>
> Regards
> Ranju
>


Re: Query on Performance Dip

2024-03-27 Thread Subroto Barua via user
 we are seeing similar perf issues with counter writes - to reproduce:

cassandra-stress counter_write n=10 no-warmup cl=LOCAL_QUORUM -rate 
threads=50 -mode native cql3 user= password= -name  


op rate: 39,260 ops (4.1) and 63,689 ops (4.0)
latency 99th percentile: 7.7ms (4.1) and 1.8ms (4.0)
Total GC count: 750 (4.1) and 744 (4.0)
Avg GC time: 106 ms (4.1) and 110.1 ms (4.0)

On Wednesday, March 27, 2024 at 12:18:50 AM PDT, ranju goel 
 wrote:  
 
 Hi All,

Was going through this mail chain 
(https://www.mail-archive.com/user@cassandra.apache.org/msg63564.html) and was 
wondering that if this could cause a performance degradation in 4.1 without 
changing compactionThroughput. 

As seeing performance dip in Read/Write after upgrading from 4.0 to 4.1.

RegardsRanju  

Re: Cassandra 5.0 Beta1 - vector searching results

2024-03-27 Thread Caleb Rackliffe
> For your #1 - if there are going to be 100+ million vectors, wouldn't I
want the search to go across nodes?

If you have a replication factor of 3 and 3 nodes, every node will have a
complete copy of the data, so you'd only need to talk to one node. If your
replication factor is 1, you'd have to talk to all three nodes.

On Wed, Mar 27, 2024 at 9:06 AM Joe Obernberger <
joseph.obernber...@gmail.com> wrote:

> Thank you all for the details on this.
> For your #1 - if there are going to be 100+ million vectors, wouldn't I
> want the search to go across nodes?
>
> Right now, we're running both weaviate (8 node cluster), our main
> cassandra 4 cluster (12 nodes), and a test 3 node cassandra 5 cluster.
> Weaviate does some interesting things like product quantization to reduce
> size and improve search speed.  They get amazing speed, but the drawback
> is, from what I can tell, they load the entire index into RAM.  We've been
> having a reoccurring issue where once it runs out of RAM, it doesn't get
> slow; it just stops working.  Weaviate enables some powerful
> vector+boolean+range queries.  I would love to only have one database!
>
> I'll look into how to do profiling - the terms you use are things I'm not
> familiar with, but I've got chatGPT and google... :)
>
> -Joe
> On 3/21/2024 10:51 PM, Caleb Rackliffe wrote:
>
> To expand on Jonathan’s response, the best way to get SAI to perform on
> the read side is to use it as a tool for large-partition search. In other
> words, if you can model your data such that your queries will be restricted
> to a single partition, two things will happen…
>
> 1.) With all queries (not just ANN queries), you will only hit as many
> nodes as your read consistency level and replication factor require. For
> vector searches, that means you should only hit one node, and it should be
> the coordinating node w/ a properly configured, token-aware client.
>
> 2.) You can use LCS (or UCS configured to mimic LCS) instead of STCS as
> your table compaction strategy. This will essentially guarantee your
> (partition-restricted) SAI query hits a small number of SSTable-attached
> indexes. (It’ll hit Memtable-attached indexes as well for any recently
> added data, so if you’re seeing latencies shoot up, it’s possible there
> could be contention on the Memtable-attached index that supports ANN
> queries. I haven’t done a deep dive on it. You can always flush Memtables
> directly before queries to factor that out.)
>
> If you can do all of the above, the simple performance of the local index
> query and its post-filtering reads is probably the place to explore
> further. If you manage to collect any profiling data (JFR, flamegraphs via
> async-profiler, etc) I’d be happy to dig into it with you.
>
> Thanks for kicking the tires!
>
> On Mar 21, 2024, at 8:20 PM, Brebner, Paul via user
>   wrote:
>
> 
>
> Hi Joe,
>
>
>
> Have you considered submitting something for Community Over Code NA 2024?
> The CFP is still open for a few more weeks, options could be my Performance
> Engineering track or the Cassandra track – or both 
>
>
>
>
> https://www.linkedin.com/pulse/cfp-community-over-code-na-denver-2024-performance-track-paul-brebner-nagmc/?trackingId=PlmmMjMeQby0Mozq8cnIpA%3D%3D
>
>
>
> Regards, Paul Brebner
>
>
>
>
>
>
>
> *From: *Joe Obernberger 
> 
> *Date: *Friday, 22 March 2024 at 3:19 am
> *To: *user@cassandra.apache.org 
> 
> *Subject: *Cassandra 5.0 Beta1 - vector searching results
>
> EXTERNAL EMAIL - USE CAUTION when clicking links or attachments
>
>
>
>
> Hi All - I'd like to share some initial results for the vector search on
> Cassandra 5.0 beta1.  3 node cluster running in kubernetes; fast Netapp
> storage.
>
> Have a table (doc.embeddings_googleflan5tlarge) with definition:
>
> CREATE TABLE doc.embeddings_googleflant5large (
>  uuid text,
>  type text,
>  fieldname text,
>  offset int,
>  sourceurl text,
>  textdata text,
>  creationdate timestamp,
>  embeddings vector,
>  metadata boolean,
>  source text,
>  PRIMARY KEY ((uuid, type), fieldname, offset, sourceurl, textdata)
> ) WITH CLUSTERING ORDER BY (fieldname ASC, offset ASC, sourceurl ASC,
> textdata ASC)
>  AND additional_write_policy = '99p'
>  AND allow_auto_snapshot = true
>  AND bloom_filter_fp_chance = 0.01
>  AND caching = {'keys': 'ALL', 'rows_per_partition': 'NONE'}
>  AND cdc = false
>  AND comment = ''
>  AND compaction = {'class':
> 'org.apache.cassandra.db.compaction.SizeTieredCompactionStrategy',
> 'max_threshold': '32', 'min_threshold': '4'}
>  AND compression = {'chunk_length_in_kb': '16', 'class':
> 'org.apache.cassandra.io.compress.LZ4Compressor'}
>  AND memtable = 'default'
>  AND crc_check_chance = 1.0
>  AND default_time_to_live = 0
>  AND extensions = {}
>  AND gc_grace_seconds = 864000
>  AND incremental_backups = true
>  AND max_index_interval = 2048
>  AND memtable_flush_period_in_ms = 0

Re: Cassandra 5.0 Beta1 - vector searching results

2024-03-27 Thread Joe Obernberger

Thank you all for the details on this.
For your #1 - if there are going to be 100+ million vectors, wouldn't I 
want the search to go across nodes?


Right now, we're running both weaviate (8 node cluster), our main 
cassandra 4 cluster (12 nodes), and a test 3 node cassandra 5 cluster.  
Weaviate does some interesting things like product quantization to 
reduce size and improve search speed.  They get amazing speed, but the 
drawback is, from what I can tell, they load the entire index into RAM.  
We've been having a reoccurring issue where once it runs out of RAM, it 
doesn't get slow; it just stops working.  Weaviate enables some powerful 
vector+boolean+range queries.  I would love to only have one database!


I'll look into how to do profiling - the terms you use are things I'm 
not familiar with, but I've got chatGPT and google... :)


-Joe

On 3/21/2024 10:51 PM, Caleb Rackliffe wrote:
To expand on Jonathan’s response, the best way to get SAI to perform 
on the read side is to use it as a tool for large-partition search. In 
other words, if you can model your data such that your queries will be 
restricted to a single partition, two things will happen…


1.) With all queries (not just ANN queries), you will only hit as many 
nodes as your read consistency level and replication factor require. 
For vector searches, that means you should only hit one node, and it 
should be the coordinating node w/ a properly configured, token-aware 
client.


2.) You can use LCS (or UCS configured to mimic LCS) instead of STCS 
as your table compaction strategy. This will essentially guarantee 
your (partition-restricted) SAI query hits a small number of 
SSTable-attached indexes. (It’ll hit Memtable-attached indexes as well 
for any recently added data, so if you’re seeing latencies shoot up, 
it’s possible there could be contention on the Memtable-attached index 
that supports ANN queries. I haven’t done a deep dive on it. You can 
always flush Memtables directly before queries to factor that out.)


If you can do all of the above, the simple performance of the local 
index query and its post-filtering reads is probably the place to 
explore further. If you manage to collect any profiling data (JFR, 
flamegraphs via async-profiler, etc) I’d be happy to dig into it with you.


Thanks for kicking the tires!

On Mar 21, 2024, at 8:20 PM, Brebner, Paul via user 
 wrote:




Hi Joe,

Have you considered submitting something for Community Over Code NA 
2024? The CFP is still open for a few more weeks, options could be my 
Performance Engineering track or the Cassandra track – or both 


https://www.linkedin.com/pulse/cfp-community-over-code-na-denver-2024-performance-track-paul-brebner-nagmc/?trackingId=PlmmMjMeQby0Mozq8cnIpA%3D%3D

Regards, Paul Brebner

*From: *Joe Obernberger 
*Date: *Friday, 22 March 2024 at 3:19 am
*To: *user@cassandra.apache.org 
*Subject: *Cassandra 5.0 Beta1 - vector searching results

EXTERNAL EMAIL - USE CAUTION when clicking links or attachments




Hi All - I'd like to share some initial results for the vector search on
Cassandra 5.0 beta1.  3 node cluster running in kubernetes; fast Netapp
storage.

Have a table (doc.embeddings_googleflan5tlarge) with definition:

CREATE TABLE doc.embeddings_googleflant5large (
 uuid text,
 type text,
 fieldname text,
 offset int,
 sourceurl text,
 textdata text,
 creationdate timestamp,
 embeddings vector,
 metadata boolean,
 source text,
 PRIMARY KEY ((uuid, type), fieldname, offset, sourceurl, textdata)
) WITH CLUSTERING ORDER BY (fieldname ASC, offset ASC, sourceurl ASC,
textdata ASC)
 AND additional_write_policy = '99p'
 AND allow_auto_snapshot = true
 AND bloom_filter_fp_chance = 0.01
 AND caching = {'keys': 'ALL', 'rows_per_partition': 'NONE'}
 AND cdc = false
 AND comment = ''
 AND compaction = {'class':
'org.apache.cassandra.db.compaction.SizeTieredCompactionStrategy',
'max_threshold': '32', 'min_threshold': '4'}
 AND compression = {'chunk_length_in_kb': '16', 'class':
'org.apache.cassandra.io.compress.LZ4Compressor'}
 AND memtable = 'default'
 AND crc_check_chance = 1.0
 AND default_time_to_live = 0
 AND extensions = {}
 AND gc_grace_seconds = 864000
 AND incremental_backups = true
 AND max_index_interval = 2048
 AND memtable_flush_period_in_ms = 0
 AND min_index_interval = 128
 AND read_repair = 'BLOCKING'
 AND speculative_retry = '99p';

CREATE CUSTOM INDEX ann_index_googleflant5large ON
doc.embeddings_googleflant5large (embeddings) USING 'sai';
CREATE CUSTOM INDEX offset_index_googleflant5large ON
doc.embeddings_googleflant5large (offset) USING 'sai';

nodetool status -r

UN cassandra-1.cassandra5.cassandra5-jos.svc.cluster.local 18.02 GiB
128 100.0% f2989dea-908b-4c06-9caa-4aacad8ba0e8  rack1
UN cassandra-2.cassandra5.cassandra5-jos.svc.cluster.local 17.98 GiB
128 100.0% ec4e506d-5f0d-475a-a3c1-aafe58399412  rack1

Community Over Code NA 2024 Travel Assistance Applications now open!

2024-03-27 Thread Gavin McDonald
Hello to all users, contributors and Committers!

[ You are receiving this email as a subscriber to one or more ASF project
dev or user
  mailing lists and is not being sent to you directly. It is important that
we reach all of our
  users and contributors/committers so that they may get a chance to
benefit from this.
  We apologise in advance if this doesn't interest you but it is on topic
for the mailing
  lists of the Apache Software Foundation; and it is important please that
you do not
  mark this as spam in your email client. Thank You! ]

The Travel Assistance Committee (TAC) are pleased to announce that
travel assistance applications for Community over Code NA 2024 are now
open!

We will be supporting Community over Code NA, Denver Colorado in
October 7th to the 10th 2024.

TAC exists to help those that would like to attend Community over Code
events, but are unable to do so for financial reasons. For more info
on this years applications and qualifying criteria, please visit the
TAC website at < https://tac.apache.org/ >. Applications are already
open on https://tac-apply.apache.org/, so don't delay!

The Apache Travel Assistance Committee will only be accepting
applications from those people that are able to attend the full event.

Important: Applications close on Monday 6th May, 2024.

Applicants have until the the closing date above to submit their
applications (which should contain as much supporting material as
required to efficiently and accurately process their request), this
will enable TAC to announce successful applications shortly
afterwards.

As usual, TAC expects to deal with a range of applications from a
diverse range of backgrounds; therefore, we encourage (as always)
anyone thinking about sending in an application to do so ASAP.

For those that will need a Visa to enter the Country - we advise you apply
now so that you have enough time in case of interview delays. So do not
wait until you know if you have been accepted or not.

We look forward to greeting many of you in Denver, Colorado , October 2024!

Kind Regards,

Gavin

(On behalf of the Travel Assistance Committee)


Query on Performance Dip

2024-03-27 Thread ranju goel
Hi All,

Was going through this mail chain
(https://www.mail-archive.com/user@cassandra.apache.org/msg63564.html)
 and was wondering that if this could cause a performance degradation in
4.1 without changing compactionThroughput.

As seeing performance dip in Read/Write after upgrading from 4.0 to 4.1.

Regards
Ranju


Re: Cassandra 5.0 Beta1 - vector searching results

2024-03-25 Thread Brebner, Paul via user
Hi all, curious if there is support for the new Cassandra vector data type in 
any open-source Kafka Connect Cassandra Sink connectors please? i.e. To write 
vector data to Cassandra from Kafka. Regards, Paul

From: Caleb Rackliffe 
Date: Friday, 22 March 2024 at 1:52 pm
To: user@cassandra.apache.org 
Subject: Re: Cassandra 5.0 Beta1 - vector searching results
You don't often get email from calebrackli...@gmail.com. Learn why this is 
important

EXTERNAL EMAIL - USE CAUTION when clicking links or attachments


To expand on Jonathan’s response, the best way to get SAI to perform on the 
read side is to use it as a tool for large-partition search. In other words, if 
you can model your data such that your queries will be restricted to a single 
partition, two things will happen…

1.) With all queries (not just ANN queries), you will only hit as many nodes as 
your read consistency level and replication factor require. For vector 
searches, that means you should only hit one node, and it should be the 
coordinating node w/ a properly configured, token-aware client.

2.) You can use LCS (or UCS configured to mimic LCS) instead of STCS as your 
table compaction strategy. This will essentially guarantee your 
(partition-restricted) SAI query hits a small number of SSTable-attached 
indexes. (It’ll hit Memtable-attached indexes as well for any recently added 
data, so if you’re seeing latencies shoot up, it’s possible there could be 
contention on the Memtable-attached index that supports ANN queries. I haven’t 
done a deep dive on it. You can always flush Memtables directly before queries 
to factor that out.)

If you can do all of the above, the simple performance of the local index query 
and its post-filtering reads is probably the place to explore further. If you 
manage to collect any profiling data (JFR, flamegraphs via async-profiler, etc) 
I’d be happy to dig into it with you.

Thanks for kicking the tires!


On Mar 21, 2024, at 8:20 PM, Brebner, Paul via user  
wrote:

Hi Joe,

Have you considered submitting something for Community Over Code NA 2024? The 
CFP is still open for a few more weeks, options could be my Performance 
Engineering track or the Cassandra track – or both 

https://www.linkedin.com/pulse/cfp-community-over-code-na-denver-2024-performance-track-paul-brebner-nagmc/?trackingId=PlmmMjMeQby0Mozq8cnIpA%3D%3D

Regards, Paul Brebner



From: Joe Obernberger 
Date: Friday, 22 March 2024 at 3:19 am
To: user@cassandra.apache.org 
Subject: Cassandra 5.0 Beta1 - vector searching results
EXTERNAL EMAIL - USE CAUTION when clicking links or attachments




Hi All - I'd like to share some initial results for the vector search on
Cassandra 5.0 beta1.  3 node cluster running in kubernetes; fast Netapp
storage.

Have a table (doc.embeddings_googleflan5tlarge) with definition:

CREATE TABLE doc.embeddings_googleflant5large (
 uuid text,
 type text,
 fieldname text,
 offset int,
 sourceurl text,
 textdata text,
 creationdate timestamp,
 embeddings vector,
 metadata boolean,
 source text,
 PRIMARY KEY ((uuid, type), fieldname, offset, sourceurl, textdata)
) WITH CLUSTERING ORDER BY (fieldname ASC, offset ASC, sourceurl ASC,
textdata ASC)
 AND additional_write_policy = '99p'
 AND allow_auto_snapshot = true
 AND bloom_filter_fp_chance = 0.01
 AND caching = {'keys': 'ALL', 'rows_per_partition': 'NONE'}
 AND cdc = false
 AND comment = ''
 AND compaction = {'class':
'org.apache.cassandra.db.compaction.SizeTieredCompactionStrategy',
'max_threshold': '32', 'min_threshold': '4'}
 AND compression = {'chunk_length_in_kb': '16', 'class':
'org.apache.cassandra.io.compress.LZ4Compressor'}
 AND memtable = 'default'
 AND crc_check_chance = 1.0
 AND default_time_to_live = 0
 AND extensions = {}
 AND gc_grace_seconds = 864000
 AND incremental_backups = true
 AND max_index_interval = 2048
 AND memtable_flush_period_in_ms = 0
 AND min_index_interval = 128
 AND read_repair = 'BLOCKING'
 AND speculative_retry = '99p';

CREATE CUSTOM INDEX ann_index_googleflant5large ON
doc.embeddings_googleflant5large (embeddings) USING 'sai';
CREATE CUSTOM INDEX offset_index_googleflant5large ON
doc.embeddings_googleflant5large (offset) USING 'sai';

nodetool status -r

UN  cassandra-1.cassandra5.cassandra5-jos.svc.cluster.local 18.02 GiB
128 100.0% f2989dea-908b-4c06-9caa-4aacad8ba0e8  rack1
UN  cassandra-2.cassandra5.cassandra5-jos.svc.cluster.local  17.98 GiB
128 100.0% ec4e506d-5f0d-475a-a3c1-aafe58399412  rack1
UN  cassandra-0.cassandra5.cassandra5-jos.svc.cluster.local  18.16 GiB
128 100.0% 92c6d909-ee01-4124-ae03-3b9e2d5e74c0  rack1

nodetool tablestats doc.embeddings_googleflant5large

Total number of tables: 1

Keyspace: doc
 Read Count: 0
 Read Latency: NaN ms
 Write Count: 2893108
 Write Latency: 

Apache Cassandra Virtual Meetups this week

2024-03-25 Thread Paul Au
Hello Cassandra community!

There are two virtual events happening this week. Hope to see you all
there.

*Cassandra Contributor Call*
*CEP-34: mTLS Based Client and Internode Authenticators*
Presented by Jyothsna Konica  & Dinesh Josh
Tuesday, March 26 at 10:00AM PDT
https://www.meetup.com/cassandra-global/events/299617622/

*Cassandra Town Hall*
*Scalable Objects Persistence V2* | Gerardo Recinto
*Cassandra Corner: Behind the Scenes* | Aaron Ploetz
*State of Cassandra Quarterly Update* | Josh McKenzie
Thursday, March 28th at 8:00AM PDT
https://www.meetup.com/cassandra-global/events/299617844/


*Paul Au*
Community Manager
Constantia / DoK Community / Data Mesh Learning / Apache Cassandra
Contributor
LinkedIn 


Re: Cassandra 5.0 Beta1 - vector searching results

2024-03-21 Thread Caleb Rackliffe
To expand on Jonathan’s response, the best way to get SAI to perform on the read side is to use it as a tool for large-partition search. In other words, if you can model your data such that your queries will be restricted to a single partition, two things will happen…1.) With all queries (not just ANN queries), you will only hit as many nodes as your read consistency level and replication factor require. For vector searches, that means you should only hit one node, and it should be the coordinating node w/ a properly configured, token-aware client.2.) You can use LCS (or UCS configured to mimic LCS) instead of STCS as your table compaction strategy. This will essentially guarantee your (partition-restricted) SAI query hits a small number of SSTable-attached indexes. (It’ll hit Memtable-attached indexes as well for any recently added data, so if you’re seeing latencies shoot up, it’s possible there could be contention on the Memtable-attached index that supports ANN queries. I haven’t done a deep dive on it. You can always flush Memtables directly before queries to factor that out.)If you can do all of the above, the simple performance of the local index query and its post-filtering reads is probably the place to explore further. If you manage to collect any profiling data (JFR, flamegraphs via async-profiler, etc) I’d be happy to dig into it with you.Thanks for kicking the tires!On Mar 21, 2024, at 8:20 PM, Brebner, Paul via user  wrote:







Hi Joe,
 
Have you considered submitting something for Community Over Code NA 2024? The CFP is still open for a few more weeks, options could be my Performance Engineering track or the Cassandra
 track – or both 

 
https://www.linkedin.com/pulse/cfp-community-over-code-na-denver-2024-performance-track-paul-brebner-nagmc/?trackingId=PlmmMjMeQby0Mozq8cnIpA%3D%3D
 
Regards, Paul Brebner
 
 
 



From:
Joe Obernberger 
Date: Friday, 22 March 2024 at 3:19 am
To: user@cassandra.apache.org 
Subject: Cassandra 5.0 Beta1 - vector searching results


EXTERNAL EMAIL - USE CAUTION when clicking links or attachments




Hi All - I'd like to share some initial results for the vector search on
Cassandra 5.0 beta1.  3 node cluster running in kubernetes; fast Netapp
storage.

Have a table (doc.embeddings_googleflan5tlarge) with definition:

CREATE TABLE doc.embeddings_googleflant5large (
 uuid text,
 type text,
 fieldname text,
 offset int,
 sourceurl text,
 textdata text,
 creationdate timestamp,
 embeddings vector,
 metadata boolean,
 source text,
 PRIMARY KEY ((uuid, type), fieldname, offset, sourceurl, textdata)
) WITH CLUSTERING ORDER BY (fieldname ASC, offset ASC, sourceurl ASC,
textdata ASC)
 AND additional_write_policy = '99p'
 AND allow_auto_snapshot = true
 AND bloom_filter_fp_chance = 0.01
 AND caching = {'keys': 'ALL', 'rows_per_partition': 'NONE'}
 AND cdc = false
 AND comment = ''
 AND compaction = {'class':
'org.apache.cassandra.db.compaction.SizeTieredCompactionStrategy',
'max_threshold': '32', 'min_threshold': '4'}
 AND compression = {'chunk_length_in_kb': '16', 'class':
'org.apache.cassandra.io.compress.LZ4Compressor'}
 AND memtable = 'default'
 AND crc_check_chance = 1.0
 AND default_time_to_live = 0
 AND extensions = {}
 AND gc_grace_seconds = 864000
 AND incremental_backups = true
 AND max_index_interval = 2048
 AND memtable_flush_period_in_ms = 0
 AND min_index_interval = 128
 AND read_repair = 'BLOCKING'
 AND speculative_retry = '99p';

CREATE CUSTOM INDEX ann_index_googleflant5large ON
doc.embeddings_googleflant5large (embeddings) USING 'sai';
CREATE CUSTOM INDEX offset_index_googleflant5large ON
doc.embeddings_googleflant5large (offset) USING 'sai';

nodetool status -r

UN  cassandra-1.cassandra5.cassandra5-jos.svc.cluster.local 18.02 GiB
128 100.0% f2989dea-908b-4c06-9caa-4aacad8ba0e8  rack1
UN  cassandra-2.cassandra5.cassandra5-jos.svc.cluster.local  17.98 GiB
128 100.0% ec4e506d-5f0d-475a-a3c1-aafe58399412  rack1
UN  cassandra-0.cassandra5.cassandra5-jos.svc.cluster.local  18.16 GiB
128 100.0% 92c6d909-ee01-4124-ae03-3b9e2d5e74c0  rack1

nodetool tablestats doc.embeddings_googleflant5large

Total number of tables: 1

Keyspace: doc
 Read Count: 0
 Read Latency: NaN ms
 Write Count: 2893108
 Write Latency: 326.3586520174843 ms
 Pending Flushes: 0
 Table: embeddings_googleflant5large
 SSTable count: 6
 Old SSTable count: 0
 Max SSTable size: 5.108GiB
 Space used (live): 19318114423
 Space used (total): 19318114423
 Space used by snapshots (total): 0
 Off heap memory used (total): 4874912
 SSTable Compression Ratio: 0.97448
 Number of partitions (estimate): 58399
 Memtable cell count: 0

Re: Cassandra 5.0 Beta1 - vector searching results

2024-03-21 Thread Brebner, Paul via user
Hi Joe,

Have you considered submitting something for Community Over Code NA 2024? The 
CFP is still open for a few more weeks, options could be my Performance 
Engineering track or the Cassandra track – or both 

https://www.linkedin.com/pulse/cfp-community-over-code-na-denver-2024-performance-track-paul-brebner-nagmc/?trackingId=PlmmMjMeQby0Mozq8cnIpA%3D%3D

Regards, Paul Brebner



From: Joe Obernberger 
Date: Friday, 22 March 2024 at 3:19 am
To: user@cassandra.apache.org 
Subject: Cassandra 5.0 Beta1 - vector searching results
EXTERNAL EMAIL - USE CAUTION when clicking links or attachments




Hi All - I'd like to share some initial results for the vector search on
Cassandra 5.0 beta1.  3 node cluster running in kubernetes; fast Netapp
storage.

Have a table (doc.embeddings_googleflan5tlarge) with definition:

CREATE TABLE doc.embeddings_googleflant5large (
 uuid text,
 type text,
 fieldname text,
 offset int,
 sourceurl text,
 textdata text,
 creationdate timestamp,
 embeddings vector,
 metadata boolean,
 source text,
 PRIMARY KEY ((uuid, type), fieldname, offset, sourceurl, textdata)
) WITH CLUSTERING ORDER BY (fieldname ASC, offset ASC, sourceurl ASC,
textdata ASC)
 AND additional_write_policy = '99p'
 AND allow_auto_snapshot = true
 AND bloom_filter_fp_chance = 0.01
 AND caching = {'keys': 'ALL', 'rows_per_partition': 'NONE'}
 AND cdc = false
 AND comment = ''
 AND compaction = {'class':
'org.apache.cassandra.db.compaction.SizeTieredCompactionStrategy',
'max_threshold': '32', 'min_threshold': '4'}
 AND compression = {'chunk_length_in_kb': '16', 'class':
'org.apache.cassandra.io.compress.LZ4Compressor'}
 AND memtable = 'default'
 AND crc_check_chance = 1.0
 AND default_time_to_live = 0
 AND extensions = {}
 AND gc_grace_seconds = 864000
 AND incremental_backups = true
 AND max_index_interval = 2048
 AND memtable_flush_period_in_ms = 0
 AND min_index_interval = 128
 AND read_repair = 'BLOCKING'
 AND speculative_retry = '99p';

CREATE CUSTOM INDEX ann_index_googleflant5large ON
doc.embeddings_googleflant5large (embeddings) USING 'sai';
CREATE CUSTOM INDEX offset_index_googleflant5large ON
doc.embeddings_googleflant5large (offset) USING 'sai';

nodetool status -r

UN  cassandra-1.cassandra5.cassandra5-jos.svc.cluster.local 18.02 GiB
128 100.0% f2989dea-908b-4c06-9caa-4aacad8ba0e8  rack1
UN  cassandra-2.cassandra5.cassandra5-jos.svc.cluster.local  17.98 GiB
128 100.0% ec4e506d-5f0d-475a-a3c1-aafe58399412  rack1
UN  cassandra-0.cassandra5.cassandra5-jos.svc.cluster.local  18.16 GiB
128 100.0% 92c6d909-ee01-4124-ae03-3b9e2d5e74c0  rack1

nodetool tablestats doc.embeddings_googleflant5large

Total number of tables: 1

Keyspace: doc
 Read Count: 0
 Read Latency: NaN ms
 Write Count: 2893108
 Write Latency: 326.3586520174843 ms
 Pending Flushes: 0
 Table: embeddings_googleflant5large
 SSTable count: 6
 Old SSTable count: 0
 Max SSTable size: 5.108GiB
 Space used (live): 19318114423
 Space used (total): 19318114423
 Space used by snapshots (total): 0
 Off heap memory used (total): 4874912
 SSTable Compression Ratio: 0.97448
 Number of partitions (estimate): 58399
 Memtable cell count: 0
 Memtable data size: 0
 Memtable off heap memory used: 0
 Memtable switch count: 16
 Speculative retries: 0
 Local read count: 0
 Local read latency: NaN ms
 Local write count: 2893108
 Local write latency: NaN ms
 Local read/write ratio: 0.0
 Pending flushes: 0
 Percent repaired: 100.0
 Bytes repaired: 9.066GiB
 Bytes unrepaired: 0B
 Bytes pending repair: 0B
 Bloom filter false positives: 7245
 Bloom filter false ratio: 0.00286
 Bloom filter space used: 87264
 Bloom filter off heap memory used: 87216
 Index summary off heap memory used: 34624
 Compression metadata off heap memory used: 4753072
 Compacted partition minimum bytes: 2760
 Compacted partition maximum bytes: 4866323
 Compacted partition mean bytes: 154523
 Average live cells per slice (last five minutes): NaN
 Maximum live cells per slice (last five minutes): 0
 Average tombstones per slice (last five minutes): NaN
 Maximum tombstones per slice (last five minutes): 0
 Droppable tombstone ratio: 0.0

nodetool tablehistograms doc.embeddings_googleflant5large


Re: Cassandra 5.0 Beta1 - vector searching results

2024-03-21 Thread Jonathan Ellis
Hi Joe,

Thanks for testing out vector search!

Cassandra 5.0 is about six months behind on vector search progress.  Part
of this is keeping up with JVector releases but more of it is core
improvements to SAI.  Unfortunately there's no easy fix for the impedance
mismatch between a field where the state of the art is improving almost
daily, and a project with a release cycle measured in years.

DataStax's cutting-edge vector search work is public and open source [1]
but it's going to be a while before we have bandwidth to upstream it to
Apache, and longer before it can be released in 5.1 or 6.0.  If you're
interested in collaborating on this, I'm happy to get you pointed in the
right direction.

In the meantime, I can also recommend trying out DataStax's Astra [2]
service, where we deploy improvements regularly.  My guesstimate is that
Astra will be 2x faster at vanilla ANN queries (with no WHERE clause) and
10x-100x faster at queries with additional predicates, depending on the
cardinality.  (As an example of what needs to be upstreamed, we added a
primitive cost-based analyzer back in January to fix the kind of timeouts
you're seeing with offset=1, and we just committed a more sophisticated one
this week [3].)

If you're stuck with 5.0, my best advice is to compact as aggressively as
possible, since SAI queries are O(N) in the number of sstables.

[1] https://github.com/datastax/cassandra/tree/vsearch
[2] https://www.datastax.com/products/datastax-astra
[3]
https://github.com/datastax/cassandra/commit/eeb33dd62b9b74ecf818a263fd73dbe6714b0df0

On Thu, Mar 21, 2024 at 9:19 AM Joe Obernberger <
joseph.obernber...@gmail.com> wrote:

> Hi All - I'd like to share some initial results for the vector search on
> Cassandra 5.0 beta1.  3 node cluster running in kubernetes; fast Netapp
> storage.
>
> Have a table (doc.embeddings_googleflan5tlarge) with definition:
>
> CREATE TABLE doc.embeddings_googleflant5large (
>  uuid text,
>  type text,
>  fieldname text,
>  offset int,
>  sourceurl text,
>  textdata text,
>  creationdate timestamp,
>  embeddings vector,
>  metadata boolean,
>  source text,
>  PRIMARY KEY ((uuid, type), fieldname, offset, sourceurl, textdata)
> ) WITH CLUSTERING ORDER BY (fieldname ASC, offset ASC, sourceurl ASC,
> textdata ASC)
>  AND additional_write_policy = '99p'
>  AND allow_auto_snapshot = true
>  AND bloom_filter_fp_chance = 0.01
>  AND caching = {'keys': 'ALL', 'rows_per_partition': 'NONE'}
>  AND cdc = false
>  AND comment = ''
>  AND compaction = {'class':
> 'org.apache.cassandra.db.compaction.SizeTieredCompactionStrategy',
> 'max_threshold': '32', 'min_threshold': '4'}
>  AND compression = {'chunk_length_in_kb': '16', 'class':
> 'org.apache.cassandra.io.compress.LZ4Compressor'}
>  AND memtable = 'default'
>  AND crc_check_chance = 1.0
>  AND default_time_to_live = 0
>  AND extensions = {}
>  AND gc_grace_seconds = 864000
>  AND incremental_backups = true
>  AND max_index_interval = 2048
>  AND memtable_flush_period_in_ms = 0
>  AND min_index_interval = 128
>  AND read_repair = 'BLOCKING'
>  AND speculative_retry = '99p';
>
> CREATE CUSTOM INDEX ann_index_googleflant5large ON
> doc.embeddings_googleflant5large (embeddings) USING 'sai';
> CREATE CUSTOM INDEX offset_index_googleflant5large ON
> doc.embeddings_googleflant5large (offset) USING 'sai';
>
> nodetool status -r
>
> UN  cassandra-1.cassandra5.cassandra5-jos.svc.cluster.local 18.02 GiB
> 128 100.0% f2989dea-908b-4c06-9caa-4aacad8ba0e8  rack1
> UN  cassandra-2.cassandra5.cassandra5-jos.svc.cluster.local  17.98 GiB
> 128 100.0% ec4e506d-5f0d-475a-a3c1-aafe58399412  rack1
> UN  cassandra-0.cassandra5.cassandra5-jos.svc.cluster.local  18.16 GiB
> 128 100.0% 92c6d909-ee01-4124-ae03-3b9e2d5e74c0  rack1
>
> nodetool tablestats doc.embeddings_googleflant5large
>
> Total number of tables: 1
> 
> Keyspace: doc
>  Read Count: 0
>  Read Latency: NaN ms
>  Write Count: 2893108
>  Write Latency: 326.3586520174843 ms
>  Pending Flushes: 0
>  Table: embeddings_googleflant5large
>  SSTable count: 6
>  Old SSTable count: 0
>  Max SSTable size: 5.108GiB
>  Space used (live): 19318114423
>  Space used (total): 19318114423
>  Space used by snapshots (total): 0
>  Off heap memory used (total): 4874912
>  SSTable Compression Ratio: 0.97448
>  Number of partitions (estimate): 58399
>  Memtable cell count: 0
>  Memtable data size: 0
>  Memtable off heap memory used: 0
>  Memtable switch count: 16
>  Speculative retries: 0
>  Local read count: 0
>  Local read latency: NaN ms
>  Local 

Cassandra 5.0 Beta1 - vector searching results

2024-03-21 Thread Joe Obernberger
Hi All - I'd like to share some initial results for the vector search on 
Cassandra 5.0 beta1.  3 node cluster running in kubernetes; fast Netapp 
storage.


Have a table (doc.embeddings_googleflan5tlarge) with definition:

CREATE TABLE doc.embeddings_googleflant5large (
    uuid text,
    type text,
    fieldname text,
    offset int,
    sourceurl text,
    textdata text,
    creationdate timestamp,
    embeddings vector,
    metadata boolean,
    source text,
    PRIMARY KEY ((uuid, type), fieldname, offset, sourceurl, textdata)
) WITH CLUSTERING ORDER BY (fieldname ASC, offset ASC, sourceurl ASC, 
textdata ASC)

    AND additional_write_policy = '99p'
    AND allow_auto_snapshot = true
    AND bloom_filter_fp_chance = 0.01
    AND caching = {'keys': 'ALL', 'rows_per_partition': 'NONE'}
    AND cdc = false
    AND comment = ''
    AND compaction = {'class': 
'org.apache.cassandra.db.compaction.SizeTieredCompactionStrategy', 
'max_threshold': '32', 'min_threshold': '4'}
    AND compression = {'chunk_length_in_kb': '16', 'class': 
'org.apache.cassandra.io.compress.LZ4Compressor'}

    AND memtable = 'default'
    AND crc_check_chance = 1.0
    AND default_time_to_live = 0
    AND extensions = {}
    AND gc_grace_seconds = 864000
    AND incremental_backups = true
    AND max_index_interval = 2048
    AND memtable_flush_period_in_ms = 0
    AND min_index_interval = 128
    AND read_repair = 'BLOCKING'
    AND speculative_retry = '99p';

CREATE CUSTOM INDEX ann_index_googleflant5large ON 
doc.embeddings_googleflant5large (embeddings) USING 'sai';
CREATE CUSTOM INDEX offset_index_googleflant5large ON 
doc.embeddings_googleflant5large (offset) USING 'sai';


nodetool status -r

UN  cassandra-1.cassandra5.cassandra5-jos.svc.cluster.local 18.02 GiB  
128 100.0% f2989dea-908b-4c06-9caa-4aacad8ba0e8  rack1
UN  cassandra-2.cassandra5.cassandra5-jos.svc.cluster.local  17.98 GiB  
128 100.0% ec4e506d-5f0d-475a-a3c1-aafe58399412  rack1
UN  cassandra-0.cassandra5.cassandra5-jos.svc.cluster.local  18.16 GiB  
128 100.0% 92c6d909-ee01-4124-ae03-3b9e2d5e74c0  rack1


nodetool tablestats doc.embeddings_googleflant5large

Total number of tables: 1

Keyspace: doc
    Read Count: 0
    Read Latency: NaN ms
    Write Count: 2893108
    Write Latency: 326.3586520174843 ms
    Pending Flushes: 0
    Table: embeddings_googleflant5large
    SSTable count: 6
    Old SSTable count: 0
    Max SSTable size: 5.108GiB
    Space used (live): 19318114423
    Space used (total): 19318114423
    Space used by snapshots (total): 0
    Off heap memory used (total): 4874912
    SSTable Compression Ratio: 0.97448
    Number of partitions (estimate): 58399
    Memtable cell count: 0
    Memtable data size: 0
    Memtable off heap memory used: 0
    Memtable switch count: 16
    Speculative retries: 0
    Local read count: 0
    Local read latency: NaN ms
    Local write count: 2893108
    Local write latency: NaN ms
    Local read/write ratio: 0.0
    Pending flushes: 0
    Percent repaired: 100.0
    Bytes repaired: 9.066GiB
    Bytes unrepaired: 0B
    Bytes pending repair: 0B
    Bloom filter false positives: 7245
    Bloom filter false ratio: 0.00286
    Bloom filter space used: 87264
    Bloom filter off heap memory used: 87216
    Index summary off heap memory used: 34624
    Compression metadata off heap memory used: 4753072
    Compacted partition minimum bytes: 2760
    Compacted partition maximum bytes: 4866323
    Compacted partition mean bytes: 154523
    Average live cells per slice (last five minutes): NaN
    Maximum live cells per slice (last five minutes): 0
    Average tombstones per slice (last five minutes): NaN
    Maximum tombstones per slice (last five minutes): 0
    Droppable tombstone ratio: 0.0

nodetool tablehistograms doc.embeddings_googleflant5large

doc/embeddings_googleflant5large histograms
Percentile  Read Latency Write Latency  SSTables    
Partition Size    Cell Count

    (micros) (micros) (bytes)
50% 0.00  0.00 0.00    
105778   124
75% 0.00  0.00 0.00    
182785   215
95% 0.00  0.00 0.00    
379022   446
98% 0.00  0.00 0.00    
545791   642
99% 0.00  0.00 0.00    
654949   

Re: Alternate apt repo for Debian installation?

2024-03-20 Thread Grant Talarico
Oh, nevermind. It looks like debian.cassandra.apache.org has come back
online and I can get once again pull from the apt repo.

On Wed, Mar 20, 2024 at 2:15 PM Grant Talarico  wrote:

> I already tried those. My particular application requires a minimum
> version of 3.11.14 and I have 3.11.16 installed in my staging environment.
> The archive.apache.org only has it's latest of 3.11.13.
>
> On Wed, Mar 20, 2024 at 1:55 PM Bowen Song via user <
> user@cassandra.apache.org> wrote:
>
>> You can try https://archive.apache.org/dist/cassandra/debian/
>>
>> The deb files can be found here:
>> https://archive.apache.org/dist/cassandra/debian/pool/main/c/cassandra/
>> On 20/03/2024 20:47, Grant Talarico wrote:
>>
>> Hi there. Hopefully this is the right place to ask this question. I'm
>> trying to install the latest version of Cassandra 3.11 using debian
>> packages through the debian.cassandra.apache.org apt repo but it appears
>> to be down at the moment. Is there an alternate apt repo I might be able to
>> use as a backup?
>>
>> - Grant
>>
>>
>
> --
>
> *Grant Talarico IT Senior Systems Engineer*
>
>
> 901 Marshall St, Suite 200
> Redwood City, CA 94063
> http://www.imvu.com
>


-- 

*Grant Talarico IT Senior Systems Engineer*


901 Marshall St, Suite 200
Redwood City, CA 94063
http://www.imvu.com


Re: Alternate apt repo for Debian installation?

2024-03-20 Thread Grant Talarico
I already tried those. My particular application requires a minimum version
of 3.11.14 and I have 3.11.16 installed in my staging environment. The
archive.apache.org only has it's latest of 3.11.13.

On Wed, Mar 20, 2024 at 1:55 PM Bowen Song via user <
user@cassandra.apache.org> wrote:

> You can try https://archive.apache.org/dist/cassandra/debian/
>
> The deb files can be found here:
> https://archive.apache.org/dist/cassandra/debian/pool/main/c/cassandra/
> On 20/03/2024 20:47, Grant Talarico wrote:
>
> Hi there. Hopefully this is the right place to ask this question. I'm
> trying to install the latest version of Cassandra 3.11 using debian
> packages through the debian.cassandra.apache.org apt repo but it appears
> to be down at the moment. Is there an alternate apt repo I might be able to
> use as a backup?
>
> - Grant
>
>

-- 

*Grant Talarico IT Senior Systems Engineer*


901 Marshall St, Suite 200
Redwood City, CA 94063
http://www.imvu.com


Re: Alternate apt repo for Debian installation?

2024-03-20 Thread Bowen Song via user

You can try https://archive.apache.org/dist/cassandra/debian/

The deb files can be found here: 
https://archive.apache.org/dist/cassandra/debian/pool/main/c/cassandra/


On 20/03/2024 20:47, Grant Talarico wrote:
Hi there. Hopefully this is the right place to ask this question. I'm 
trying to install the latest version of Cassandra 3.11 using debian 
packages through the debian.cassandra.apache.org 
 apt repo but it appears to be 
down at the moment. Is there an alternate apt repo I might be able to 
use as a backup?


- Grant


Alternate apt repo for Debian installation?

2024-03-20 Thread Grant Talarico
Hi there. Hopefully this is the right place to ask this question. I'm
trying to install the latest version of Cassandra 3.11 using debian
packages through the debian.cassandra.apache.org apt repo but it appears to
be down at the moment. Is there an alternate apt repo I might be able to
use as a backup?

- Grant


Tomorrow 10AM PDT - Examining LWT perf in 5.0

2024-03-19 Thread Jon Haddad
Hey folks,

I'm doing a working session tomorrow at 10am PDT, testing LWTs in C* 5.0.
I'll be running benchmarks and doing some performance analysis.  Come hang
out and bring your questions!

Jon

YouTube: https://www.youtube.com/watch?v=IoWh647LRQ0

LinkedIn:
https://www.linkedin.com/events/cassandra5workingsession-lightw7174223694586687490/comments/


Update: C/C NA Call for Presentations Deadline Extended to April 15th

2024-03-19 Thread Paulo Motta
Hi,

I wanted to update that the Call for Presentations deadline was extended by
two weeks to April 15th, 2024 for Community Over Code North America 2024.
Find more information on this blog post:
https://news.apache.org/foundation/entry/apache-software-foundation-opens-cfp-for-community-over-code-north-america-2024

We're looking for presentation abstracts in the following areas:
* Customizing and tweaking Cassandra
* Benchmarking and testing Cassandra
* New Cassandra features and improvements
* Provisioning and operating Cassandra
* Developing with Cassandra
* Anything else related to Apache Cassandra

Please use this link to submit your proposal:
https://sessionize.com/community-over-code-na-2024/

Thanks,

Paulo


Re: [EXTERNAL] Re: About Cassandra stable version having Java 17 support

2024-03-18 Thread Bowen Song via user

Short answer:

There's no definite answer to that question.


Longer answer:

I doubt such date has already been decided. It's largely driven by the 
time required to fix known issues and any potential new issues 
discovered during the BETA and RC process. If you want to track the 
progress, feel free to look at the project's Jira boards, there's a 5.0 
GA board dedicated for that.


Furthermore, it's likely there will only be an experimental support for 
Java 17 in Cassandra 5.0, which means it shouldn't be used on production 
environments.


So, would you like to keep waiting indefinitely for the Java 17 official 
support, or run Cassandra 4.1 on Java 11 today and upgrade when newer 
version becomes available?



On 18/03/2024 13:10, Divyanshi Kaushik via user wrote:

Thanks for your reply.

As Cassandra has moved to Java 17 in it's *5.0-BETA1* (Latest release 
on 2023-12-05). Can you please let us know when the team is planning 
to GA Cassandra 5.0 version which has Java 17 support?


Regards,
Divyanshi

*From:* Bowen Song via user 
*Sent:* Monday, March 18, 2024 5:14 PM
*To:* user@cassandra.apache.org 
*Cc:* Bowen Song 
*Subject:* [EXTERNAL] Re: About Cassandra stable version having Java 
17 support


*CAUTION:* This email originated from outside the organization. Do not 
click links or open attachments unless you recognize the sender and 
know the content is safe.


Why Java 17? It makes no sense to choose an officially non-supported 
library version for a piece of software. That decision making process 
is the problem, not the software's library version compatibility.



On 18/03/2024 09:44, Divyanshi Kaushik via user wrote:

Hi All,

As per my project requirement, Java 17 needs to be used. Can you
please let us know when you are planning to release the next
stable version of Cassandra having Java 17 support?

Regards,
Divyanshi
This email and any files transmitted with it are confidential,
proprietary and intended solely for the individual or entity to
whom they are addressed. If you have received this email in error
please delete it immediately.

This email and any files transmitted with it are confidential, 
proprietary and intended solely for the individual or entity to whom 
they are addressed. If you have received this email in error please 
delete it immediately. 

Re: [EXTERNAL] Re: About Cassandra stable version having Java 17 support

2024-03-18 Thread Divyanshi Kaushik via user
Thanks for your reply.

As Cassandra has moved to Java 17 in it's 5.0-BETA1 (Latest release on 
2023-12-05). Can you please let us know when the team is planning to GA 
Cassandra 5.0 version which has Java 17 support?

Regards,
Divyanshi

From: Bowen Song via user 
Sent: Monday, March 18, 2024 5:14 PM
To: user@cassandra.apache.org 
Cc: Bowen Song 
Subject: [EXTERNAL] Re: About Cassandra stable version having Java 17 support


CAUTION: This email originated from outside the organization. Do not click 
links or open attachments unless you recognize the sender and know the content 
is safe.

Why Java 17? It makes no sense to choose an officially non-supported library 
version for a piece of software. That decision making process is the problem, 
not the software's library version compatibility.


On 18/03/2024 09:44, Divyanshi Kaushik via user wrote:
Hi All,

As per my project requirement, Java 17 needs to be used. Can you please let us 
know when you are planning to release the next stable version of Cassandra 
having Java 17 support?

Regards,
Divyanshi
This email and any files transmitted with it are confidential, proprietary and 
intended solely for the individual or entity to whom they are addressed. If you 
have received this email in error please delete it immediately.
This email and any files transmitted with it are confidential, proprietary and 
intended solely for the individual or entity to whom they are addressed. If you 
have received this email in error please delete it immediately.


Re: About Cassandra stable version having Java 17 support

2024-03-18 Thread Bowen Song via user
Why Java 17? It makes no sense to choose an officially non-supported 
library version for a piece of software. That decision making process is 
the problem, not the software's library version compatibility.



On 18/03/2024 09:44, Divyanshi Kaushik via user wrote:

Hi All,

As per my project requirement, Java 17 needs to be used. Can you 
please let us know when you are planning to release the next stable 
version of Cassandra having Java 17 support?


Regards,
Divyanshi
This email and any files transmitted with it are confidential, 
proprietary and intended solely for the individual or entity to whom 
they are addressed. If you have received this email in error please 
delete it immediately. 

Two weeks remaining to submit abstracts to Community Over Code 2024

2024-03-18 Thread Paulo Motta
Hi,

I'd like to send a friendly reminder that the deadline for submissions to
Community Over Code North America 2024  ends in two
weeks on April 1st, 2024. This conference will be held in Denver, Colorado,
October 7-10, 2024.

We're looking for abstracts in the following areas:
* Customizing and tweaking Cassandra
* Benchmarking and testing Cassandra
* New Cassandra features and improvements
* Provisioning Cassandra
* Developing with Cassandra
* Anything else related to Apache Cassandra

Please use this link to submit your proposal:
https://sessionize.com/community-over-code-na-2024/

It will be possible to update proposals after acceptance, so provisional
titles and abstracts are fine. At this moment we're interested in
collecting presentation ideas that can be refined later.

I recommend checking out these resources before submitting a proposal:
* How to Write a Successful Conference Abstract by the Tim Berglund <
https://www.youtube.com/watch?v=N0g3QoCuqH4>
* How to Write an Abstract by Philip Koopman, Carnegie Mellon University <
https://users.ece.cmu.edu/~koopman/essays/abstract.html> (this article is
targeted to academic abstracts but some tips can be useful to presentation
abstracts)

I would be happy to answer any questions.

Cheers,

Paulo


Re: Documentation about TTL and tombstones

2024-03-18 Thread Sebastian Marsching

> It's actually correct to do it how it is today.
> Insertion date does not matter, what matters is the time after tombstones are 
> supposed to be deleted.
> If the delete got to all nodes, sure, no problem, but if any of the nodes 
> didn't get the delete, and you would get rid of the tombstones before running 
> a repair, you might have nodes that still has that data.
> Then following a repair, that data will be copied to other replicas, and that 
> data you thought you deleted, will be brought back to life.

Sure, for regular data that does not have a TTL, this makes sense. But I claim 
that data with a TTL is deleted when it is inserted. It’s just that this delete 
only becomes effective at some future date.

In order to understand whether data might reappear, we have to consider four 
cases. Let us first consider the three cases where the INSERT / UPDATE did not 
overwrite any existing data that would have lived longer than the new data:

1. Let us assume that the data is successfully written to all nodes and no 
repair is run. After the TTL expires, the data turns into a tombstone, but 
because the data was present on all nodes, the tombstone is present on all 
nodes, so there is no risk of data reappearing.

2. Let us assume that this data is not written to all nodes but a repair is run 
within the TTL. After that, we effectively have the first situation, so there 
is no risk of data reappearing.

3. Let us assume that this data is not written to all nodes and no repair is 
run within the TTL. After the TTL has passed, the data expires on the nodes 
where it has been written. Now, we have tombstones on these nodes. If we get 
rid of the tombstones, there is no risk of the data reappearing, because there 
are no nodes that have the data, so even if we run a repair in the future, 
there is no risk that the data magically reappears.

Now, let us consider the cases where data that either had no TTL or had a TTL 
that expired after the TTL of the newly inserted data was overwritten. Again, 
there are three possible scenarios:

4. Let us assume that the data is successfully written to all nodes and no 
repair is run. After the TTL expires, the data turns into a tombstone, but 
because the data was present on all nodes, the tombstone is present on all 
nodes, so there is no risk of data reappearing.

5. Let us assume that this data is not written to all nodes but a repair is run 
within the TTL. After that, we effectively have the first situation, so there 
is no risk of data reappearing.

6. Let us assume that this data is not written to all nodes and no repair is 
run within the TTL. After the TTL has passed, the data expires on the nodes 
where it has been written. Now, we have tombstones on these nodes. If we get 
rid of the tombstones, there is the risk of the data reappearing, because the 
older data that was overwritten by the INSERT / UPDATE might still exist on 
some nodes, and as the data with the TTL never made it to these nodes, there is 
no tombstone on these nodes and thus the older data can reappear.

So, we only have to worry about the last scenario. In this scenario, we have to 
ensure that either the inserted data with the TTL is repaired (which brings us 
back to scenario 5), or that the tombstones are repaired before they are 
discarded.

This is why I claim that for data with a TTL, gc_grace_seconds should 
effectively start when the data is inserted, not when it is converted into a 
tombstone: It does not matter whether the data with the TTL is repaired or the 
tombstone is repaired. As long as either of these things between the data with 
the TTL being inserted and the tombstone being reclaimed, there is no risk of 
deleted or overwritten data reappearing.



smime.p7s
Description: S/MIME cryptographic signature


About Cassandra stable version having Java 17 support

2024-03-18 Thread Divyanshi Kaushik via user
Hi All,

As per my project requirement, Java 17 needs to be used. Can you please let us 
know when you are planning to release the next stable version of Cassandra 
having Java 17 support?

Regards,
Divyanshi
This email and any files transmitted with it are confidential, proprietary and 
intended solely for the individual or entity to whom they are addressed. If you 
have received this email in error please delete it immediately.


Re: Documentation about TTL and tombstones

2024-03-17 Thread Gil Ganz
It's actually correct to do it how it is today.
Insertion date does not matter, what matters is the time after tombstones
are supposed to be deleted.
If the delete got to all nodes, sure, no problem, but if any of the nodes
didn't get the delete, and you would get rid of the tombstones before
running a repair, you might have nodes that still has that data.
Then following a repair, that data will be copied to other replicas, and
that data you thought you deleted, will be brought back to life.

On Sat, Mar 16, 2024 at 5:39 PM Sebastian Marsching 
wrote:

> > That's not how gc_grace_seconds work.
> > gc_grace_seconds controls how much time *after* a tombstone can be
> deleted, it can actually be deleted, in order to give you enough time to
> run repairs.
> >
> > Say you have data that is about to expire on March 16 8am, and
> gc_grace_seconds is 10 days.
> > After Mar 16 8am that data will be a tombstone, and only after March 26
> 8am, a compaction  *might* remove it, if all other conditions are met.
>
> You are right. I do not understand why it is implemented this way, but you
> are 100 % correct that it works this way.
>
> I thought that gc_grace_seconds is all about being able to repair the
> table before tombstones are removed, so that deleted data cannot repappear.
> But when the data has a TTL, it should not matter whether the original data
> ore the tombstone is synchronized as part of the repair process. After all,
> the original data should turn into a tombstone, so if it was present on all
> nodes, there is no risk of deleted data reappearing. Therefore, I think it
> would make more sense to start gc_grace_seconds when the data is inserted /
> updated. I don’t know why it was not implemented this way.
>
>


Re: Documentation about TTL and tombstones

2024-03-16 Thread Sebastian Marsching

> That's not how gc_grace_seconds work.
> gc_grace_seconds controls how much time *after* a tombstone can be deleted, 
> it can actually be deleted, in order to give you enough time to run repairs.
>
> Say you have data that is about to expire on March 16 8am, and 
> gc_grace_seconds is 10 days.
> After Mar 16 8am that data will be a tombstone, and only after March 26 8am, 
> a compaction  *might* remove it, if all other conditions are met.

You are right. I do not understand why it is implemented this way, but you are 
100 % correct that it works this way.

I thought that gc_grace_seconds is all about being able to repair the table 
before tombstones are removed, so that deleted data cannot repappear. But when 
the data has a TTL, it should not matter whether the original data ore the 
tombstone is synchronized as part of the repair process. After all, the 
original data should turn into a tombstone, so if it was present on all nodes, 
there is no risk of deleted data reappearing. Therefore, I think it would make 
more sense to start gc_grace_seconds when the data is inserted / updated. I 
don’t know why it was not implemented this way.



smime.p7s
Description: S/MIME cryptographic signature


Re: Documentation about TTL and tombstones

2024-03-16 Thread Gil Ganz
That's not how gc_grace_seconds work.
gc_grace_seconds controls how much time *after* a tombstone can be deleted,
it can actually be deleted, in order to give you enough time to run repairs.

Say you have data that is about to expire on March 16 8am, and
gc_grace_seconds is 10 days.
After Mar 16 8am that data will be a tombstone, and only after March 26
8am, a compaction  *might* remove it, if all other conditions are met.
gil


On Fri, Mar 15, 2024 at 12:58 AM Sebastian Marsching <
sebast...@marsching.com> wrote:

>
> by reading the documentation about TTL
>
> https://cassandra.apache.org/doc/4.1/cassandra/operating/compaction/index.html#ttl
> It mention that it creates a tombstone when data expired, how does it
> possible without writing to the tombstone on the table ? I thought TTL
> doesn't create tombstones since the ttl is present together with the write
> time timestmap
> at the row level
>
>
> If you read carefully, you will notice that no tombstone is created and
> instead the data is *converted* into a tombstone. So, after the TTL has
> expired, the inserted data effectively acts as a tombstone. This is needed,
> because the now expired data might hide older data that has not expired
> yet. If the newer data was simply dropped after the TTL expired, older data
> might reappear.
>
> If I understand it correctly, you can avoid data with a TTL being
> converted into a tombstone by choosing a TTL that is greater than
> gc_grace_seconds. Technically, the data is still going to be converted into
> a tombstone when the TTL expires, but this tombstone will immediately be
> eligible for garbage collection.
>
>


Re: Documentation about TTL and tombstones

2024-03-14 Thread Sebastian Marsching

> by reading the documentation about TTL
> https://cassandra.apache.org/doc/4.1/cassandra/operating/compaction/index.html#ttl
> It mention that it creates a tombstone when data expired, how does it  
> possible without writing to the tombstone on the table ? I thought TTL 
> doesn't create tombstones since the ttl is present together with the write 
> time timestmap
> at the row level

If you read carefully, you will notice that no tombstone is created and instead 
the data is *converted* into a tombstone. So, after the TTL has expired, the 
inserted data effectively acts as a tombstone. This is needed, because the now 
expired data might hide older data that has not expired yet. If the newer data 
was simply dropped after the TTL expired, older data might reappear.

If I understand it correctly, you can avoid data with a TTL being converted 
into a tombstone by choosing a TTL that is greater than gc_grace_seconds. 
Technically, the data is still going to be converted into a tombstone when the 
TTL expires, but this tombstone will immediately be eligible for garbage 
collection.



smime.p7s
Description: S/MIME cryptographic signature


Documentation about TTL and tombstones

2024-03-14 Thread Jean Carlo
Hello community,

by reading the documentation about TTL
https://cassandra.apache.org/doc/4.1/cassandra/operating/compaction/index.html#ttl
It mention that it creates a tombstone when data expired, how does it
possible without writing to the tombstone on the table ? I thought TTL
doesn't create tombstones since the ttl is present together with the write
time timestmap
at the row level
Greetings

Jean Carlo

"The best way to predict the future is to invent it" Alan Kay


RE: SStables stored in directory with different table ID than the one found in system_schema.tables

2024-03-13 Thread Michalis Kotsiouros (EXT) via user
Hello everyone,

The recovery was performed successfully some days ago. Finally, the problematic 
datacenter was removed and added back to the cluster.

 

BR

MK

 

From: Michalis Kotsiouros (EXT) via user  
Sent: February 12, 2024 17:59
To: Sebastian Marsching ; user@cassandra.apache.org
Cc: Michalis Kotsiouros (EXT) 
Subject: RE: SStables stored in directory with different table ID than the one 
found in system_schema.tables

 

Hello Sebastian and community,

Thanks a lot for the post. It is really helpful.

After some additional observations, I am more concerned about trying to 
rename/move the sstables directory. I have observed that my client processes 
complain about missing columns even though those columns appear on the describe 
schema output.

My plan is to first try a restart of the Cassandra nodes and if that does not 
help to re-build the datacenter – remove it and then add it back to the cluster.

 

BR

MK

 

From: Sebastian Marsching mailto:sebast...@marsching.com> > 
Sent: February 10, 2024 01:00
To: Bowen Song via user mailto:user@cassandra.apache.org> >
Cc: Michalis Kotsiouros (EXT) mailto:michalis.kotsiouros@ericsson.com> >
Subject: Re: SStables stored in directory with different table ID than the one 
found in system_schema.tables

 

You might the following discussion from the mailing-list archive helpful:

 

https://lists.apache.org/thread/6hnypp6vfxj1yc35ptp0xf15f11cx77d

 

This thread discusses a similar situation gives a few pointers on when it might 
be save to simply move the SSTables around.

 

Am 08.02.2024 um 13:06 schrieb Michalis Kotsiouros (EXT) via user 
mailto:user@cassandra.apache.org> >:

 

Hello everyone,

I have found this post on-line and seems to be recent.

 

 Mismatch between Cassandra table uuid in linux file directory and 
system_schema.tables - Stack Overflow

The description seems to be the same as my problem as well.

In this post, the proposal is to copy the sstables to the dir with the ID found 
in system_schema.tables. I think it is equivalent with my assumption to rename 
the directories….

Have anyone seen this before? Do you consider those approaches safe?

 

BR

MK

 

From: Michalis Kotsiouros (EXT) 
Sent: February 08, 2024 11:33
To: user@cassandra.apache.org  
Subject: SStables stored in directory with different table ID than the one 
found in system_schema.tables

 

Hello community,

I have a Cassandra server on 3.11.13 on SLESS 12.5.

I have noticed in the logs the following line:

Datacenter A

org.apache.cassandra.db.UnknownColumnFamilyException: Couldn't find table for 
cfId d8c1bea0-82ed-11ee-8ac8-1513e17b60b1. If a table was just created, this is 
likely due to the schema not being fully propagated.  Please wait for schema 
agreement on table creation.

Datacenter B

org.apache.cassandra.db.UnknownColumnFamilyException: Couldn't find table for 
cfId 0fedabd0-11f7-11ea-9450-e3ff59b2496b. If a table was just created, this is 
likely due to the schema not being fully propagated.  Please wait for schema 
agreement on table creation.

 

This error results in failure of all streaming tasks.

I have checked the sstables directories and I see that:

 

In Datacenter A the sstables directory is:

-0fedabd0-11f7-11ea-9450-e3ff59b2496b

 

In Datacenter B the sstables directory are:

-0fedabd011f711ea9450e3ff59b2496b

- d8c1bea082ed11ee8ac81513e17b60b1

In this datacenter although the - d8c1bea082ed11ee8ac81513e17b60b1 
dir is more recent it is empty and all sstables are stored under 
-0fedabd011f711ea9450e3ff59b2496b

 

I have also checked the system_schema.tables in all Cassandra nodes and I see 
that for the specific table the ID is consistent across all nodes and it is:

d8c1bea0-82ed-11ee-8ac8-1513e17b60b1

 

So it seems that the schema is a bit mess in all my datacenters. I am not 
really interested to understand how it ended up in this status but more on how 
to recover.

Both datacenters seem to have this inconsistency between the id stored 
system_schema.tables and the one used in the sstables directory.

Do you have any proposal on how to recover?

I have thought of renaming the dir from 
-0fedabd011f711ea9450e3ff59b2496b to - 
d8c1bea082ed11ee8ac81513e17b60b1 but it does not look safe and I would not want 
to risk my data since this is a production system.

 

Thank you in advance.

 

BR

Michail Kotsiouros

 



smime.p7s
Description: S/MIME cryptographic signature


Re: Question about commit consistency level for Lightweight Transactions in Paxos v2

2024-03-11 Thread Weng, Justin via user
So for upgrading Paxos to v2, the non-serial consistency level should be set to 
ANY or LOCAL_QUORUM, and the serial consistency level should still be SERIAL or 
LOCAL_SERIAL. Got it, thanks!

From: Laxmikant Upadhyay 
Date: Tuesday, 12 March 2024 at 7:33 am
To: user@cassandra.apache.org 
Cc: Weng, Justin 
Subject: Re: Question about commit consistency level for Lightweight 
Transactions in Paxos v2
You don't often get email from laxmikant@gmail.com. Learn why this is 
important

EXTERNAL EMAIL - USE CAUTION when clicking links or attachments


You need to set both in case of lwt. your regular non -serial consistency level 
will only applied during commit phase of lwt.


On Wed, 6 Mar, 2024, 03:30 Weng, Justin via user, 
mailto:user@cassandra.apache.org>> wrote:
Hi Cassandra Community,

I’ve been investigating Cassandra Paxos v2 (as implemented in 
CEP-14)
 which improves the performance of lightweight transaction (LWT). But I’ve got 
a question about setting the commit consistency level for LWT after upgrading 
Paxos.

In 
cqlsh,
 gocql and Python 
driver,
 there are two settings for consistency levels: normal Consistency Level and 
Serial Consistency Level. As mentioned in the cqlsh 
documentation,
 Serial Consistency Level is only used for LWT and can only be set to either 
SERIAL or LOCAL_SERIAL. However, the Steps for Upgrading 
Paxos and 
CEP-14
 mention that ANY or LOCAL_QUOROM can be used as the commit consistency level 
for LWT after upgrading Paxos to v2. Therefore, I have a question about how to 
correctly set the commit consistency level to ANY or LOCAL_QUORUM for LWT. 
Namely, which consistency level should I set, the normal Consistency Level or 
Serial Consistency Level?

Any help would be really appreciated.

Thanks,
Justin


Call for Presentations: Cassandra @ Community Over Code North America 2024

2024-03-11 Thread Paulo Motta
Hi,

After a successful experience in ApacheCon 2022, the Cassandra track is
back to Community Over Code North America 2024 to be held in Denver,
Colorado, October 7-10, 2024.

I will be facilitating this track and I would like to request abstract
drafts in the following topics to be presented in this track:
- Customizing and tweaking Cassandra
- Benchmarking and testing Cassandra
- New features and improvements
- Provisioning Cassandra
- Developing with Cassandra
- Anything related to Apache Cassandra

If you are interested in presenting, please submit your title and abstract
drafts to https://communityovercode.org/call-for-presentations/ by April
1st (this is not a joke).

Provisional and generic abstracts are fine if you are unsure you will be
able to present. It will be possible to update them later if needed. At
this moment we're mostly interested in collecting rough ideas to be
presented in this track.

Please contact me if your employer is interested in sponsoring this event.
The sponsorship prospectus is available on
https://communityovercode.org/sponsors/ . If we get at least 2 sponsors we
may be able to offer a Cassandra community dinner/drinks night. ;-)

If you are planning to attend this conference and would like to
volunteer/help in the Cassandra track please contact me.

Let me know if you have any questions.

Cheers and see you in Denver! :)

Paulo


Re: Question about commit consistency level for Lightweight Transactions in Paxos v2

2024-03-11 Thread Laxmikant Upadhyay
You need to set both in case of lwt. your regular non -serial consistency
level will only applied during commit phase of lwt.


On Wed, 6 Mar, 2024, 03:30 Weng, Justin via user, 
wrote:

> Hi Cassandra Community,
>
>
>
> I’ve been investigating Cassandra Paxos v2 (as implemented in CEP-14
> )
> which improves the performance of lightweight transaction (LWT). But I’ve
> got a question about setting the commit consistency level for LWT after
> upgrading Paxos.
>
>
>
> In cqlsh
> ,
> gocql  and Python
> driver
> ,
> there are two settings for consistency levels: normal Consistency Level and
> Serial Consistency Level. As mentioned in the cqlsh documentation
> ,
> Serial Consistency Level is only used for LWT and can only be set to either
> SERIAL or LOCAL_SERIAL. However, the Steps for Upgrading Paxos
>  and CEP-14
> 
> mention that ANY or LOCAL_QUOROM can be used as the commit consistency
> level for LWT after upgrading Paxos to v2. Therefore, I have a question
> about how to correctly set the commit consistency level to ANY or
> LOCAL_QUORUM for LWT. Namely, which consistency level should I set, the
> normal Consistency Level or Serial Consistency Level?
>
>
>
> Any help would be really appreciated.
>
>
>
> Thanks,
>
> Justin
>


Question about commit consistency level for Lightweight Transactions in Paxos v2

2024-03-05 Thread Weng, Justin via user
Hi Cassandra Community,

I’ve been investigating Cassandra Paxos v2 (as implemented in 
CEP-14)
 which improves the performance of lightweight transaction (LWT). But I’ve got 
a question about setting the commit consistency level for LWT after upgrading 
Paxos.

In 
cqlsh,
 gocql and Python 
driver,
 there are two settings for consistency levels: normal Consistency Level and 
Serial Consistency Level. As mentioned in the cqlsh 
documentation,
 Serial Consistency Level is only used for LWT and can only be set to either 
SERIAL or LOCAL_SERIAL. However, the Steps for Upgrading 
Paxos and 
CEP-14
 mention that ANY or LOCAL_QUOROM can be used as the commit consistency level 
for LWT after upgrading Paxos to v2. Therefore, I have a question about how to 
correctly set the commit consistency level to ANY or LOCAL_QUORUM for LWT. 
Namely, which consistency level should I set, the normal Consistency Level or 
Serial Consistency Level?

Any help would be really appreciated.

Thanks,
Justin


  1   2   3   4   5   6   7   8   9   10   >